Survival analysis tip from Ogden Nash
I might as well give you my opinion of these two kinds of sin as long as, in a way, against each other we are pitting them,
And that is, don’t bother your head about the sins of commission because however sinful, they must at least be fun or else you wouldn’t be
committing them.
It is the sin of omission, the second kind of sin,
That lays eggs under your skin.
The way you really get painfully bitten
Is by the insurance you haven’t taken out and the checks you haven’t added up the stubs of and the appointments you haven’t kept and the bills you haven’t paid and the letters you haven’t written.
I had to respond to the post by Mike Nemecek and tweets by Rick Wicklin quoting Shakespeare with some culture of my own. Not having any degrees in liberal arts, though, the best I could do was this excerpt from the poem by Ogden Nash, Portrait of the Artist as a Prematurely Old Man.
Lately I’ve been playing around with the PROC LIFETEST procedure and it occurred to me that a way to get painfully bitten with this, and other survival analysis procedures, is not to think about some obvious facts. I’m assuming you are new to these procedures, either that, or in a big hurry, and you don’t scrutinize your output carefully. In that case, you may misinterpret the mean survival rate.
The mean survival rate is the mean length of time people/ bacteria / rats survived, right? Not necessarily.
Many procedures, say factor analysis, regression – automatically drop observations with missing data. Survival analysis procedures don’t work exactly the same way.
I am telling you this because it is a mistake I have seen people make who were familiar with other statistical procedures, and I can only presume in a hurry. Their solution to not knowing the length of survival time for some of their subjects was to drop those for whom the data are unknown.
Let’s try this with a data set I have laying around. I only use those observations for which I have complete data, that is, I know the survival time. It gives me the following information on survival time in days.
Mean Standard Error
360.934 22.183
Quartile Estimates
95% Confidence
Percent Estimate Transform [Lower Upper)
75 532.00 LOGLOG 489.00 612.00
50 308.00 LOGLOG 244.00 393.00
25 167.00 LOGLOG 117.00 193.00
Easy. right? The mean survival time is 360 days. The median is 308 days.
However, this is only using those people for whom we have a survival time. What about the other people? When I include EVERYONE, whether they died or not, I get the following
Mean | Standard Error |
431.466 | 22.506 |
So, is this the correct value then? Are these the correct percentile points?
Percent | Point Estimate |
Transform | [Lower | Upper) |
75 | 652.00 | LOGLOG | 560.00 | 755.00 |
50 | 428.00 | LOGLOG | 341.00 | 512.00 |
25 | 192.00 | LOGLOG | 157.00 | 237.00 |
Well, not exactly. In fact, if you are using SAS, you will see this helpful note in your log
The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time.
In my sample data set, 25% of the observations were censored, that is, we don’t know when they died.
Can we then say that the median survival rate really is 428 days? Okay, the mean is not correct because those 25% some of them may have died years later. What about the median? The answer is, “It depends.”
Depends on what, you might ask. Well, it depends on why you don’t know when they died.If they dropped out of your study and you have no idea what happened to them, then I would say that you want to be a bit cautious in your interpretation of both the mean and the median survival rates.
If all of the people who you don’t know how long they lived are censored because your study ended and they were still alive, then I would say, yes, the median survival time is accurate, assuming you had data on all of those people at the beginning and the end of the study.
Although survival analysis is generally thought of as predicting whether one survived in the sense of not dying, it is also used for other applications, such as predicting how long people last in a treatment program, without committing another crime, without drinking or using drugs. In those cases, when people are lost by dropping out of your study and out of sight, I would suspect it is very possible that at least some of those people began drinking, committing crimes or whatever prior to the end of your study. So, when people are censored due to having missing data, I would be a bit skeptical of both the median and the mean, and, as with all data problems, the larger percentage of your sample this affects, the more worried about it I would be.
An interesting point came up the other day when I was listening to a lecture. I’d assumed in animal studies that you would only have the problem of right-censored data due to the subjects living past the end of the study period. She mentioned that a couple of her subjects were censored because the rats had died of other causes unrelated to the disease under study. Not sure what occupational hazards a rat faces, but it’s getting pretty bad when you can’t even count on a rat these days.