Logistic regression never screws up when you want it to
All my programs are working today and I am sad.
Fortunately for everyone else, but unfortunately for me today, SAS has increasingly automated or semi-automated fixing those errors. It’s unfortunate for me because I wanted to talk about errors and how to fix these. I could create a simulation dataset but I hate doing that. I think if whatever the issue is occurs often enough to bother talking about it, you ought to be able to find a real dataset it applies to without going to the extreme of making one up.
Ever notice how programs like SAS have 300-page manuals for something that you can code in three statements, like logistic regression? How does that make any sense?
How it makes sense is that coding those three or five or seven statements correctly, understanding your output like Nagelkerke pseudo-R-squared and fixing many of the errors that you encounter require understanding a lot of terms and some underlying mathematics.
On the other hand (where you have different fingers), very commonly the errors that occur when you are first learning a language have nothing to do with a Hessian matrix that is not positive definite and everything to do with having misspelled a word.
Years ago, I used to say that I could make a billion dollars if I could come up with a language that did what I meant to tell it instead of what I told it. SAS did this a version or two back. They also made a billion dollars. So I was right, but they still didn’t give me any of it. How rude!
(There is a lesson in here to silly young people who ask for NDAs, by the way. What is worth the billion dollars is not the idea, it’s the implementation.)
Now if you type DAAT instead of DATA, your program will run anyway with a polite note in your SAS log telling you that it has assumed you meant DATA and went ahead and executed based on that assumption, but if not, hey, feel free to let it know. (Am I the only one who feels this bears a little creepy resemblance to the happy doors in the Hitchhiker’s Guide to the Galaxy?)
So now, with almost no fanfare whatever, SAS has gone on and done this with its statistical procedures as well. There is the ODS graphics, which guesses what diagnostic graphs you would probably want, and there are also automatic self-correcting mechanisms.
I TRIED to get PROC LOGISTIC to screw up by doing some of the basic errors I see and here is what happened. Just so you know, the dependent variable in all cases except #3 was whether the person was employed, coded as 0 for no, 1 for yes.
1. I used two variables that were perfectly correlated as independent variables. Whether the subject had difficulty in job training was coded as 1 or 0, for a variable I named “difficulty”. Whether the person found job training easy was coded as
ease = 1 - difficulty ;
I see people do this when they are unfamiliar with the concept of dummy variables. In fact, (I’m not making this up), they think I am insulting them when I tell them that their problem is with having too many dummy variables.
What did SAS do?
Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
ease = 0.5 * Intercept + 0.5 * difficulty0
So, it dropped the redundant parameter, then ran with the corrected code.
2. I created a constant, that I creatively named const.
const = 5 ;
This happens to people usually when their sample size is too small or restricted. Say your dependent variable is divorced, coded 0 or 1. The variable really does vary in the population. However, if your sample is of high school students, very few of them are divorced and you would need a large sample size (or luck) to have any variation. If your sample size is 15 people, since only 10.4% of the U.S. population is divorced, it is perfectly possible you might not have anyone who is divorced.
So, what does SAS do in this situation?
Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
cons = 5 * Intercept
Didn’t really think about that, did you? A constant is just the intercept multiplied by some number. So, SAS goes ahead and runs, as if your code did not have that variable in the equation.
3. Okay, now I’m getting pissed. I set the dependent variable to be equal to a constant where everyone has a job. Now it finally does give me an error.
ERROR: All observations have the same response. No statistics are computed.
NOTE: The SAS System stopped processing this step because of errors.
You might think,
“Why didn’t it just say that the dependent variable was a constant?”
In its computer brain, it did. Remember, in logistic regression, the dependent is called the response variable. (That was in that 300-page manual.)
4. Finally, I am getting really annoyed trying to create an obscure error message that would justify having invested the time to read a 300-page manual (not to mention all of those statistics courses in graduate school). I create a variable that has very little variance.
if _n_ < 10 then wrong_job = 0 ;
else wrong_job = 1 ;
With the result that the first 10 people have no job while the other 470 do. I finally get SAS to run and give me a kind of obscure message:
Model Convergence Status
Quasi-complete separation of data points detected.
Warning: The maximum likelihood estimate may not exist.
Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.
Just to make sure you don't overlook this message skipping over all of the other tables to the end and looking at the hypothesis tests and seeing if you have significance (oh, yes, both SAS and I have met people like you before), it helpfully prints this heading on EVERY SINGLE PAGE for the remainder of the output.
WARNING: The validity of the model fit is questionable.
AND, in my log it adds, just for good measure and a little extra nagging:
WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood
estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on
the last maximum likelihood iteration. Validity of the model fit is questionable.
If it was my grandmother, it would have added,
And if you go ahead with what you want to do any way after I warned you, on your head be it!
I expect to see that included in my log with SAS version 9.3
As for quasi-complete separation, you can find that on page 84 of the SAS/STAT 9.22 User's Guide: The Logistic Procedure. No, that's not on page 84 of the guide, that's on page 84 of the LOGISTIC PROCEDURE SECTION of the guide.
There is also a really good article by Paul Allison on complete and quasi-complete separation, called "Convergence Failures in Logistic Regression".
So, even when you do get an error, it is not too hard to find fairly well-written explanations of the problem.
Two points occur to me:
- You don't really need to read 90% of the manual to get started and get pretty far along with any of these procedures. If you continue with statistical analysis, though, at some time, you will indeed need to sit down and RTFM, but you can push that time off for weeks, months, years even - if you read a whole lot of other books, websites and articles that cover the same information and may be more interesting (in the same way that a dead jellyfish may not be a good thing to give a woman instead of flowers on a first date. Just sayin'.)
- I cannot BELIEVE I never wrote a blog about complete and quasi-complete separation! I even wrote one day last month that was going to be my next post and then got distracted.
The world's most spoiled twelve-year-old is rolling her eyes and saying,
"I can't believe you forgot to write a blog about whatever that is either! But it's not like it's something really important, like that you forgot you were going to take me to Becca's dad's house for a sleepover."
Um, I think I have to go drive someone to Torrance for a sleepover.
As I sit here on a sunday night trying to parse the output of the logistic regression that is at the core of my dissertation, I have just one thought.
WHY COULDN’T THIS ARTICLE HAVE BEEN ABOUT SPSS????
*sigh*
How can I understand the concepts underlying logistic regression well enough to explain it to my statistically illiterate friends, yet be completely stumped by all the stuff that SPSS spits out when I try to run one.
Have a nice drive!
I’ve had the same issue re: people thinking you’re insulting them by calling some of their variables dummy variables. Then you have to go on explaining that it’s not an insult to them nor their data, just a description of the process for their coding.
Also, I agree with Rebecca: more SPSS. I know, it’s stats for idiots, but it’s so much nicer to use and easier to teach students/demonstrate concepts with.