7 Tips to Not Getting Screwed by Your Data
First of all, I want to draw your attention to this retraction in the Journal of the American Medical Association and mad props to Drs. Aboumatar and Wise and John Hopkins for doing the right thing in publicly retracting it.
For the TL; DR crowd
Someone who is probably now unemployed miscoded the study groups in this randomized clinical trial of self-management of Chronic Obstructive Pulmonary Disease. What does that mean? In this case, it meant that the reported results were the exact opposite of what was really observed because the treatment groups were coded incorrectly. Also, read the seven tips at the end of this post.
When I talk about statistical analysis, I focus 80% or more of my time and attention on the basics of knowing your data, cleaning your data and examining your data some more. To some, mostly younger, statisticians, that is not the sexy stuff. Why am I not talking about neural nets or generalized linear mixed models? Don’t I know that improving your prediction by .3% can result in millions of dollars in profit for a corporation that has 38 million customers?
What I know is that problems like the one in that JAMA article occur more often than we like to admit.
Recently, a student sent thesis results and then the next day sent an email saying, “Oops, I meant to use the DESCENDING option in PROC LOGISTIC but I didn’t, so the results are the exact opposite of what I said.”
A couple of years ago, I did an analysis with a depression scale for which the standardized coding is 0 to 3, but the application had used 1 to 4. The first analysis showed that every single person in the sample was clinically depressed. Fortunately, I caught this before it was published. Even when I re-analyzed the data with the correct scoring the mean score was extremely high. This was not a random sample of the population, but rather, children with a family member addicted to methamphetamine. The original (incorrect) analysis wasn’t in the opposite direction but it did somewhat overstate the problem.
Several years before that, I worked for a client who had a previous consultant with no knowledge of their particular field but who was a very good programmer. In reviewing some of that person’s code to understand the data and how it had been scored, I found that NONE of the items that should have been reverse-coded had been. The consultant had simply taken the sum of all of the items. This research had been published, by the way. I mentioned this to the client and suggested that a retraction was in order. That retraction never happened and I never worked for that client again.
My Six Tips for Saving Your Ass
- Learn to code. I don’t mean you need to be the greatest SAS/ R/ Python whatever guru in the world but you should be able to read through the code someone else wrote and understand it. This means you should be able to read an IF-THEN statement, a loop re-coding all the items in an array and the statistical procedures used in your analysis.
- Understand that the DESCENDING option in PROC LOGISTIC means that the probability modeled is reversed. So, by default, PROC LOGISTIC models the probability of response levels with lower Ordered Value, and if you have death (coded 0= lived, 1= died) as the dependent, the procedure is predicted who lived. If you use the DESCENDING option, it’s going to predict who died.
- Know how many people should be in each group; control, experimental condition 1, experimental condition 2. Do a PROC FREQ and see if it matches what you expect.
- Know the range for each item in your analysis and do a PROC MEANS with mean, minimum, maximum and standard deviation. Even if you have 500 or 600 variables it shouldn’t take you all that long to scan through that many lines and see if anything is out of range.
- Know which items should have been reverse-coded and check if that was done.
- Compute reliabilities for each scale in an analysis. While the reliability would not have been changed in the depression example where 1 was added to every response, it would have picked up those cases where the variables were not re-coded by showing very low reliabilities.
A seventh, extra bonus tip
If you can’t understand the code that someone has written, not because you are a moron (can’t help you there), but because they are one of those people who never write comments in code, don’t believe in documentation and write code that includes an unnecessary number of macro variables, user-written macros and overly complicated solutions, fire their sorry ass and hire someone less pompous. I’m not saying you shouldn’t have macros or that because a person uses a DATA step and you prefer PROC SQL you should get rid of them. What I am saying is if you ask a person what decisions they made in writing that code and what was the reason for, say, using a generalized linear model instead of a general linear model, they should be able to tell you.