Statisticians need an Occam, or at least a razor
My initial exposure to Occam’s razor came in my first undergraduate economics class. Perhaps due to my tender years, it made a great impression, and I have tried to apply it ever since. In short, Occam’s razor advises us when presented with competing, plausible choices, the default choice should be the simplest one.
Plausible is a key word in that sentence. So, don’t go writing down the answer to every statistics homework problem as 42 and saying Dr. De Mars told you to use Occam’s razor.
When sorting through propensity score macros this weekend, I was struck by how much the solutions for one problem can vary in complexity.
The concept of propensity scores is not that difficult. You have two groups, people who went to the emergency room and people seen in urgent care centers, or people who visited a museum exhibit on the holocaust and people who didn’t. They can be any two groups but the key point is this – they were not randomly assigned and they are not equivalent. If you want to decide if visiting the Museum of Tolerance changes a person’s views on diversity, government or anything else you need to consider that people who choose to visit the museum are probably different from people who don’t. Similarly, people who went to the ER are probably different in many ways from people who go to an urgent care center – severity of injury, income, insurance and so on.
So … you do some type of statistical analysis and get a score that takes all of these variables into account, called a propensity score. You try to match the two groups on this score as closely as possible. If you have 2,000 people who went to an ER and 12,000 who went to urgent care centers, you would select the 2,000 from urgent care who were the closest to the Emergency Room sample. (You can also select more than one match for each person, but remember, we are keeping it simple.)
Most people don’t have much trouble grasping the LOGIC of propensity score matching, but then the complexity begins. You can match on exact scores.
“We matched people who had the same score.”
This is my preferred method right out of the gate because it fits the AnnMaria Scale (my personal version of Occam’s razor) which is,
“Can you explain what you did in 25 words or less, without using any words the reader has to look up in a dictionary?”
(Be impressed that the AnnMaria Scale also fits the AnnMaria Scale.)
On my latest project, I used a modification of the custom macro published by Lori Parsons and presented at the SAS Users Group International (now known as SAS Global Forum). She finds scores that match down to the fifth decimal place and spits out those into the final data set. Next, her macro pulls out scores that match down to the fourth decimal place, then to three decimal places, then two, then 1.
If the propensity score isn’t even close – subject A has a score of .2 and subject B has a score of .8 and there are only these two subjects left without a match, then no match occurs.
Advantages of this method
- It’s easy to understand
- For data sets with a large control group and lots of close matches it makes a lot of sense and you can get a very closely matched control group
- It’s not very computer-resource intensive. With a data set with 15,000 records, it took about a second to run on my iMac running boot camp with 4GB of RAM on Windows 7 with a 32 bit operating system. (Translation, it took one second on a not very fancy system.)
Really, if you can match 75% of your subjects down to the fifth decimal place on the propensity score and 90% of them within three decimal places, I think any additional precision you get in matching is going to be trivial.
Disadvantages of this method
- You may not match everyone.
- If your data are very different on the variables included in your equation to create the propensity scores, a substantial proportion of your subjects may be unmatched.
This seems to occur a lot more often in medical studies than in the social science types of research I am usually working on. If you’re looking, for example, at people who elect to be in an experimental treatment study, they may be far sicker (and thus more willing to try an unproven drug or procedure) than people who go with a more orthodox treatment. People who are airlifted somewhere are probably much different from those who don’t get helicopter transport. So, how do you create groups when you are NOT going to find many close matches?
One way is to use calipers. In short, you match people who are within a given range, say within a propensity score of .10, or as used by another program, within .25 standard deviation of the logit. A second custom macro option, which, was, as far as I can see, perfectly appropriate for the data in their study, performed this type of match using a combination of principal components analysis, cluster analysis, logistic regression, Proc SQL and hash tables.
PROC SQL made perfect sense because you are doing a many to many match. You want to match all of the people who fall within a given range with all of the other people. Hash tables, I usually consider unnecessary because their main advantage is increasing speed and whether my program runs in 1.5 seconds instead of 1.8 seconds is usually not a major concern of mine. Right now, this program without hash tables I am running as a test is on its second hour. So, hash tables make a lot of sense IF you are running on a computer with limited processing speed. (I have learned to test things on the worst piece of crap equipment I can find and thus spare clients that old canard of programmers and IT support “It worked on my system.” If it works on all of our test systems it ought to run on anything short of a solar-powered graphing calculator. )
What the folks in the second article were trying to do was find the optimal possible match. Given the problem with their data, that made perfect sense. If, as in the data I have at the moment, you can match your subjects very closely without all of the extra steps, it makes sense to do so.
There is a second thing that bothers me. I remember all the way back to an article on Analysis of Covariance I read in graduate school. The article was written even further back still, in 1969, by Janet Elashoff. Her main point was that analysis of covariance is ” a delicate instrument”.
While I understand from a statistical and programming point of view what researchers are trying to do with some of the optimizing algorithms for matching, the premise is troubling. Some of these groups being compared are very different.
Elashoff was most concerned (and I agree) with situations where the two groups depart very far from the assumption of random assignment. We don’t have any new, magical mathematics today that makes that no longer a concern.
If you use a high-performance computer to crank through your population of 26,856 control subjects to find the 843 patients who are optimally like the 843 people in your study who elected to undertake a very risky treatment at a minimum you have a non-random sample of the population of people who did not elect treatment. It’s also plausible that they are very different from people in who did receive treatment in other ways. The person who is willing to take any chance, no matter how remote, regardless of potential side effects, to find a cure and the one who is not interested in something that promises the possibility of six more months of life of questionable quality – those might be quite different people. All propensity score discussions mention the concern about “unobserved differences” and then scurry on to complex mathematics. I think they should linger a bit.
In some cases, though, it seems that there are NOT really many very good matches. You really can’t statistically control for the difference in intelligence between a genius and a moron using analysis of covariance – or anything else.
Curiously, it reminds me of the scene from Cool Runnings where the captain of the Jamaican bobsled team is whacking his teammates helmets. When one of his teammates asked him why he was doing that, the captain said it was because the Swiss team did it and they won a lot of races. To which the teammate responded,
“Yes, well they make them little knives, too, and I don’t see you doing that.”
There’s a lot of interest currently in applying these methods developed in medical research to social science. Some of those ideas are good, but some ought to be examined a lot more before we start widely applying them.
I noticed your more recent post about using PROC SurveySelect in exciting, at least, different ways. Jamie culls my stream for entries in the NISS/SAMSI paper.li too, which is where I noticed your blog. I’m a very practical statistician i.e. probability models, descriptive stats, mostly for data quality, regulatory reporting, financial compliance, some survey analysis and SAS for large data sets…but NOT #big data!
I appreciated your sentiments here. They are not reductive, they aren’t negative, but they are practical, about using propensity scores in an appropriate context, while having a care before use for social sciences. I laughed at this:
and of course,
A question: You are AnnMariaStat, the blog name is The Julia Group. Are you Julia, or is that someone else, or just a corporate entity? “Julia” makes quaternions float through my mind, can’t be unseen…!
Thank you for your insights here!
Julia s my youngest daughter. She is named after Gaston Julia, the mathematician, who the Julia set of fractals are also named after.
The Julia Group was a satellite office of Spirit Lake Consulting and when we spun off to a separate company we needed a new name.
We also sell the fractaldomains software . The rocket scientist wrote it.