SAS Tricks for Massaging Data into Shape
Today, I was thinking about using data from the National Hospital Discharge Survey to try to predict type of hospital admission. Is it true that some people use the emergency room as their primary method of care? Mostly, I wanted to poke around wit the NHDS data and get to know it better for possible use for teaching statistics. Before I could do anything, though, I needed to get the data into a usable form.
I decided to use as my dependent variable the type of hospital admission. There were certain categories, though, that were not going to be dependent on much else, for example – if you are an infant born in a hospital, your admission type is newborn. I also deleted the people whose admission type was not given.
The next question was what would be interesting predictor variables. Unfortunately, some of what I thought would be useful had less than perfect data, for example, discharge status, about 5% of the patients had a status of “Alive, disposition not stated”.
I also thought either diagnostic group or primary diagnosis would be a good variable for prediction. When I did a frequency distribution for each it was ridiculously long, so I thought I would be clever and only select those diagnoses where it was .05% or more, which is over 60 people. Apparently, there is more variation in diagnosis than I thought because in both cases that was over 330 different diagnoses.
Here is a handy little tip, by the way –
PROC FREQ DATA = analyze1 NOPRINT ;
TABLES dx1 / OUT = freqcnt ;
PROC PRINT DATA = freqcnt ;
WHERE PERCENT > 0.05 ;
Will only print out the diagnoses that occurred over the specified percentage of the time.
I thought what about the diagnoses that were at least .5% of the admissions? So, I re-ran the analyses with 0.5 and came up with 41 DRGs. I didn’t want to type in 41 separate DRGs, especially because I thought I might want to change the cut off point later, so I used a SAS format, like this. Note that in a CNTLIN dataset, which I am creating, the variables MUST have the names fmtname, label and start.
Also, note that the RENAME statement doesn’t take effect until you write out the new dataset, so your KEEP statement has to have the old variable name, in this case, drg.
Data fmtc ;
set freqcnt ;
if percent > 0.5 ;
retain fmtname ‘drgf’ ;
retain label “in5” ;
rename drg = start ;
keep fmtname drg label ;
Okay, so, what I did here was create a dataset that assigns the formatted value of in5 to everyone of my diagnosis related groups that occurs in .5% of the discharges or more.
To actually create the format, I need one more step
proc format cntlin = fmtc ;
Then, I can use this format to pull out my sample
DATA analyze2 ;
SET nhds.nhds10 ;
IF admisstype in (1,2,3) ;
IF dischargestatus in (1,3,4,6) & PUT(drg,drgf.) = “in5” then insample = 1 ;
ELSE insample = 0 ;
I could have just selected the sample that met these criteria, but I wanted to have the option of comparing those I kept in and those I dropped out. Now, I have 71,869 people dropped from the sample and 59,743 that I kept. (I excluded the newborns from the beginning because we KNOW they are significantly different. They’re younger, for one thing.)
So, now I am perfectly well set up to do a MANOVA with age and days of care as dependent variables. (You’d think there would be more numeric variables in this data set than those two, but surprisingly, even though many variables in the data set are stored as numeric they are actually categorical or ordinal and not really suited to a MANOVA.)
Anyway …. I think that MANOVA will be one of the first analyses we do in my multivariate course. It’s going to be fun.