Phi coefficients, Christmas and the number 42
People like familiarity. That’s probably one reason we enjoy the holidays so much – we know all the words to Silent Night, how to carve a turkey, which of the Christmas cookies taste the best. If I am going to convince you to give up statistics with which you feel comfortable, such as chi-square and correlation coefficients to dive into logit models, I probably owe you an explanation as to why that might be desirable and even necessary. Kind of like convincing you that turkey will give you food poisoning or that Christmas ornament was made by shoeless, small children working for pennies a day down in Guatemala.
What could possibly be wrong with phi coefficients?
Phi coefficients are interpreted similarly to the Pearson coefficient which all of us learned in our first statistics course, along with having the mantra “Never infer causation from correlation” beaten into our heads.
Correlation coefficients are comfortable.
If you liked Hitchhikers Guide to the Galaxy (and how could you possibly not?) you might remember when Arthur Dent found out that 42, the answer to the ultimate question Life the Universe, and Everything, was the answer to the question “What is six times nine?” he responded,
“I always thought something was fundamentally wrong with the universe.”
Facts about the phi coefficient you maybe never knew or gave much thought
Here is the problem with the phi coefficient – the statement that phi is interpreted similarly to the Pearson correlation coefficient is based on the assumption that the marginal distributions of the variables are equal,for example, it is assumed for dichotomous variables that 50% of the population falls in each category.
Now we all know that restricted variance has a negative impact on the size of the correlation, the well-known ‘restriction of range’. If you are reading this blog, there is a high probability that you know the formula for the population variance as the sum of every x minus the mean of x, squared, and divided by n. If every value is close to the mean, you have relatively little variance.
Think about this for a moment – the variance of a binary variable = p*q where p is the probability of a given event, say getting a coherent answer from me before 7 a.m., and q is the probability of the opposite of that event. If you don’t remember some of these formulae or just don’t believe me, here is a nice site from the Visual Statistics Studio to remind you.
So… the maximum possible variance is when the odds of p versus q are 50-50 and the variance = .25. (Believe me, the odds of getting a coherent response from me before 7 a.m. are NOT .50 !) When we have a distribution that departs substantially from 50-50, say 95% of our patients lived and only 5% died, we have a restriction of range.
So, problem number one:
The phi coefficient’s maximum value is NOT always 1.0. Here is something you probably always intuitively knew but never gave much thought about,
When the split is fairly extreme on one of the variables and not so extreme on the other, the maximum phi coefficient is considerably less than 1.0 . In fact, if it is a 90-10 split on one variable, say, the probability of mortality from a given disease, and 50-50 on the other, for example that a patient was in the treatment or control group, the maximum possible phi value is only .33 . How depressing is that? For a really interesting discussion of this whole topic of marginal probabilities and maximum phi coefficients, I recommend the book Practical meta-analysis, by Lipsey & Wilson. And no, I have never met them and do not receive a cut from any books they sell. However, if they are reading this, they are most welcome to show their gratitude via Starbucks gift cards.
It is because of this, and other problems with the phi coefficient that other statistics and models are preferable. Enter, for example, the odds ratio. The odds ratio does not range from -1 to +1 , even in theory. It ranges from zero to infinity.
Odds ratios are good for studying diseases. Personally, I believe coffee cures all ills and particularly reduces your stress in the morning. Let’s say in a sample of 100 non-coffee drinkers that 50 of them get heart disease and of 100 coffee drinkers, 25 of them get heart disease. The odds ratio then, is 3.00 . The odds of getting heart disease versus not for a non-coffee drinker are even, 1:1, while the odds for a coffee drinker are 3:1 against, so the odds ratio is 3.0.
Incidentally, these data are completely made up as an example and you should not take this post as evidence to either begin drinking coffee or invest in shares of Maxwell House. On the other hand, if you need to be told that, you probably should not be allowed to have your own checkbook, anyway.
The odds ratio itself also has problems. One problem is that while it would be very nice to have a standard error for the odds ratio, the odds ratio is skewed, remember that whole zero to infinity thing?
A standard error, confidence intervals, that whole thing, assumes a random, normal distribution of errors. Enter the natural logarithm of the odds ratio – but that is a post for another day, after three cups of coffee.