Multicollinearity statistics with SPSS
“Can you explain multicollinearity statistics?”
she asked.
Why, yes, yes I can.
First of all, as noted in the Journal of Polymorphous Perversity,
“Multicollinearity is not a life-threatening condition except when a depressed graduate student employs multiple, redundant measures.”
What is multicollinearity, then, and how do you know if you have it?
Multicollinearity is a problem that occurs with regression analysis when there is a high correlation of at least one independent variable with a combination of the other independent variables. The most extreme example of this would be if you did something like had two completely overlapping variables. Say you were predicting income from the Excellent Test for Income Prediction (ETIP). Unfortunately, you are a better test designer than statistician so your two independent variables are Number of Answers Correct (CORRECT) and Number of Answers Incorrect (INCORRECT). Those two are going to have a perfect negative correlation of -1. Multicollinearity. You are not going to be able to find a single least squares solution. For example, if you have this equation:
Income = .5*Correct + 0*Incorrect
or
Income = 0*Correct -.5*Incorrect
You will get the exact same prediction. Now that is a pretty trivial example, but you can have a similar problem if you use two or more predictors that are very highly correlated. Let’s assume you’re predicting income from high school GPA, college GPA and SAT score. It may be that high school GPA and SAT score together have a very high multiple correlation with college GPA.
For more about why multicollinearity is a bad thing, read this very nice web page by a person in Michigan who I don’t know. Let’s say you already know multicollinearity is bad and you want to know how to spot it, kind of like cheating boyfriends. Well, I can’t help you with THAT (although you can try looking for lipstick on his collar), but I can help you with multicollinearity.
One suggestion some people give is to look at your correlation matrix and see if you have any independent variables that correlate above some level with one another. Some people say .75, some say .90, some say potato. I say that looking at your correlation matrix is fine as far as it goes, but it doesn’t go far enough. Certainly if I had variables correlated above .90 I would not include both in the equation. Even if it was above .75, I would look a bit askance, but I might go ahead and try it anyway and see the results.
The problem with just looking at the correlation matrix is what if you have four variables that together explain 100% of the variance in a fifth independent variable. You aren’t going to be able to tell that by just looking at the correlation matrix. Enter the Tolerance Statistic, wearing a black cape, here to save the day. Okay, I lied, it isn’t really wearing a black cape – it’s a green cape. ( By the way, if you have a mad urge to buy said green cape, or a Viking tunic, you can fulfill your desires here. I am not affiliated with this website in any way. I am just impressed that they seem to be finding a niche in the Pirate Garb / Viking tunic / cloak market .)
In complete seriousness now, ahem ….
To compute a tolerance statistic for an independent variable to test for multi-collinearity, a multiple regression is performed with that variable as the new dependent and all of the other independent variables in the model as independent variables. The tolerance statistic is 1 – R2 for this second regression. (R-square, just to remind you, is the amount of variance in a dependent variable in a multiple regression explained by a combination of all of the indepedent variables). In other words, Tolerance is 1 minus the amount of variance in the independent variable explained by all of the other independent variables. A tolerance statistic below .20 is generally considered cause for concern.Of course, in real life, you don’t actually compute a bunch of regressions with all of your independent variables as dependents, you just look at the collinearity statistics.
Let’s take a look at an example in SPSS, shall we?
The code is below or you can just pick REGRESSION from the ANALYZE menu. Don’t forget to click on the STATISTICS button and select COLLINEARITY STATISTICS.
Here I have a dependent variable that is the rating of problems a person has with sexual behavior, sexual attitudes and mental state. The three independent variables are ratings of symptoms of anorexia, symptoms of bulimia and problems in body perception
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT problems
/METHOD=ENTER anorexic perceptprob bulimia.
Let’s just take a look at the first variable “anorexic”. It has a Tolerance of .669. What does that mean? It means that if I ran a multiple regression with anorexic as the dependent, and perceptprob and bulimia as the independent vairables, I would get an R-square value of .331. Don’t take my word for it. Let’s try it. Notice that now anorexic is the dependent variable.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT anorexic
/METHOD=ENTER perceptprob bulimia.
Now, look at that. When we do a regression with anorexia as the dependent variable and bulimia and perceptprob as the two independent variables the R-square is .331 . If we take 1 – .331 we get .669 which is exactly the Tolerance Statistic for anorexia in the previous regression analysis above. Don’t you just love it when everything works out?
So WHY is a tolerance below .20 considered a cause for concern? It means that at least 80% of the variance of this independent variable is shared with some other independent variables. It means that the multiple correlation of the other independent variables with this independent variable is at least .90 (because .9 * .9 = .81) .
Another statistic sometimes used for multicollinearity is the Variance Inflation Factor, which is just the reciprocal of the tolerance statistics. A VIF of greater than 5 is generally considered evidence of multicollinearity. If you divide 1 by .669 you’ll get 1.495, which is exactly the same as the VIF statistic shown above .
And you thought I just made this sh*t up as I went along, didn’t you?
thanking you so much for such a great explanation
I am greatful to you for this better explanation. May I request you to explain about the condition index ! what it is ,how it is . How to solve the multicollinearity problem . Thanks a lot
Thankyou for a wonderful, simple and entertaining blog post. The link t the Chicago guy was extremely helpful, too 🙂
You’ve made one little psychology student much less stressed, thankyou!
xx
Thank you for this. Working on my thesis now and was looking for some help with collinearity – the article was very helpful. I have some results!
Hi,
I know that some smart-ass student is going to ask me about the tolerance statistic–even if the other 99% couldn’t care less. I couldn’t find anything really helpful in French so I did a more general search–just to be sure of my facts ! Your explanation is delicious.
Thank you.
Very simple and clear! Thanks for making this interesting 🙂
This was the most I have had reading about statistics in forever. you are hilarious and you can explain sth in simple terms which means you are very knowledgeable as well.
Thanks!
it is nice explanation but is it possible to test multidisciplinarity using VIF for both categorical and continuous variables?
You can use VIF if you dummy-code your variables, e.g., code Male =1 , Female = 0 for a variable gender. Now include another variable rsex and code it Male = 0, Female = 1. See what happens if you put both into an equation.
Cool explanation! Thanks! 🙂
master student of clinical psychology at UT; it was great.useful and piratical.tks
What shall I do if VIF values oscillate between 1,000-2,000, but while removing ‘suspicious’ predictors coefficients and significance still changes?
Can you give a little more detail, Ursula? How many variables do you have and how many observations?
I can now ignore everything else I’ve found on the Internet regarding multicollinearity thanks to your post. You’ve made it SOOO much easier to understand. I really appreciate it!
Thanks. This was very helpful. Just to confirm (I am quite statistically challenged), I am doing binary logistic regression (an assignment) and I am just using the linear regression to do the multicollinearity diagnostic. I do not have to run all my independents with one as dependent each time for this? That was the impression I was getting from reading other explanations. I just run all my independents with my “real” dependent and use the Tolerance level to tell me whether or not I have multicollinearity, as you have done? Thanks very much.
Carolyn –
The answer is yes and if you need a citation for the “Yes”, see Applied Logistic Regression Analysis by Scott Menard
Thank you very much.
Hi: I am doing a logistic regression, I have my dependent variable categorical (0/1) and 5 independent categorical (0/1) such as gender. Please tell me if it is correct to get VIF considerig that R2 is only for lineal regression. Please if you can cite the author to support your answer. Thank you very much, sorry for my english i´m from Mexico
First, I would look at my standard erros in the logistic regression – large standard errors are associated with multicollinearity. Although you should be aware that you could also have large standard errors for other reasons.
IBM (which now owns SPSS) suggests that you dummy code your categorical variables and then run a linear regression to get the multicollinearity statistics, then, after dropping variables as appropriate, running a logistic regression to get the predicted probabilities, etc.
http://www-01.ibm.com/support/docview.wss?uid=swg21476696
Stats made sexy ! Thanks for simplifying the explanations. Much appreciated.
Hi,
I have just read you comments about multicollinearity in categorical variables. As per your instructions I have created dummy variables (dichotomous) now when I run the collinearity stats keeping age band as my Dependent Variable and gender=1 and gender=0 as my independent variable. I get a VIF = 15.02 against gender =1 and VIF 1.67 against gender=0 so which one should I remove.
Further gender=1 comes out to be statistically significant in t test (having p value less than 0.05), another possibility is if pvalue of gender =1 is greater than 0.05 that is not significant then in that case should we remove it? Please suggest what should be done.
THANKS
hello!
I’m Joselyne i would like to know how can i remove multicollinearity by using spss without removing any independent variable because all independent variables are very important.(are imports,exports and exchange rate)the vif of imports and exports are very high.
thank you
Thank you so much for such a clear, concise and funny explanation. I appreciate your help loadddsss! 🙂
Thank you,very good explanation.I would like to know, the steps to find CFI and TLI.
I am going to remember this forever. Thank you for explaining it so well…
Thanks so much!! That was awesome (I mean it)! xo.
I think that collinearity is somethink we have to deal with. It is like being invisible, inflates out coefficients drinving us to misinterpreting. Anyway on my website i wrote an interesting article about
http://randomwalkproject.it/2015/01/11/why-collinearity-matters-in-regression-analysis/
By the way, your explanation is really accurate. Congrats
Hi! For each regression I compute, I am finding a Tolerance below .2 and a VIF above 5 for the same 2 independent variables. However, when I put either of these 2 problem children as the dependent variable, I do not find multicollinearity (i.e., Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?
Such a easy and Crystal clear explanation. Thanks
Hi, Thank you very much for clear blog. I had a severe problem regarding multicollinearity in my research. You made my life easier. Your blog is great. Highly appreciated your work. 🙂
Thanks for this post. It helped me a lot. Your jokes are excellent by the way 🙂
Quite simply explained. Most Doctorate students find the interpretation of tolerance and VIF confusing . They hardly conduct multi-linearity test before embarking on regression analysis. We shall continue sharing experiences as we demystify statistics to all learners.
Prof. John Aluko Orodho[Associate Professor of Research and Statistics ]
Hi, when should we conduct multicollinearity test? at the beginning of variable screening or after the regression (meaning after we have a final variables)
Thank you so much for the detail explanation!! I’m so bad at SPSS and always have to count on others. But thanks to your detail explanation, my mind is exploded! haha not really, but I’m very appreciate your time writing this article for clueless student like me!
Hahaha. I am doing the final touches on preparation for my research proposal in the next3-5hrs. I was trying to understand these tests. Wow, you have made my day. Thanks
Dear all. I have the final model, using survey logistic regression, with the collinearity especially in dummy variables. Nevertheless,logistic model for desease, goodnes of fit of test Prob > F : 0.002 and each of variables in the model is significance. May I kept the model with the collinearity or should remove the variable that have collineaeity?. Please give me sugestion. Thank you so much.
Thank you very much for this wonderful explanation and for making me laugh while learning through! I am working on my PhD and needed to handle with collinearity, that has been really helpful and so easy to read!
I will sign up for a seminary with you at any time
Thank you again
Maria
Barcelona
Great! Thank you!
Your explanation about tolerance & VIF is very easy to follow. Thanks a lot for your great service & contribution to the community.
Hello,
I have quite a different sort of problem. I’m working with survey data for a huge dataset (DHS 2013 for Nigeria). I’m trying to use the tolerance and VIF score to determine if I have multicollinearity. After running a regression model of my predictor variables and then the follow up VIF command (in Stata) this is what I get:
tolerance=. & VIF=.
I’m stumped. I got a nice output for my regression model though, unfortunately I have no way of showing it to you. What does it mean that my tolerance and VIF values are missing? Is that bad?
What could I be doing wrong? Thanks in advance.
Hi
pls how can I get a combined VIF and tolerance for multicollinearity using one dependent and 5 independent variables
Tolerance above .2 and VIF below 5). What does this mean? Do I not have multicollinearity?
In spss I am using enter method under regression analysis but still results show some excluded variable. I need to include all. Kindly tell me the way out. Thank you in advance. Kindly revert back.
This was fun to read and informative. Thank you.
Hi there,
So if these were my output would this be multicollienarity because they are greater than .9?
Colienatity
Tolerance VIF
.952 1.050
.835 1.197
.872 1.147
Thanks!
how do we compute collinearity with dummy values in SPSS?
Very useful article on multi collinearity. It has really helped me. Thanks