Fixing data the easy way, part 2
I have learned not to be too smart for my own good. Yesterday was an example.
My client provides many different types of services to the consumers who use their program. There are about 15 different options, from counseling to on-the-job training to assistive technology. We want to get the total number of services each participant in their program receives and the average number for the total customer base. However, there are fifteen different variables and for each one, the data are entered “Yes” for if the customer received it and “No” if they didn’t. Of course, these data being entered by real human beings and not computers, the data are actually entered, “Yes” or “yes” or “YES” or “Yes ” . You get the idea. And, of course, all of those yes’s are interpreted differently by the computer.
My first (dumb) thought was to do two array statements, create 15 different variables, all numeric and use a DO-loop to convert to 0=No, 1 = Yes. Then I could sum those variables to get how many services each customer received. I could apply a PROC FORMAT so when the report printed out it would print as “YES” and “NO”.
There is a much easier way.
Data services ;
set report7_11 ;
Array nums{*} counsel -- Other_services ;
num_services = 0 ;
do i = 1 to dim(nums) ;
nums{i} = upcase(nums{i}) ;
if trim(nums{i}) = "YES" then num_services + 1 ;
end ;
Let’s look at each of these statements ….
Data services ;
set in.consumers7_11 ;
Well, obviously, the first one creates a data set. The second statement reads in data from a permanent SAS dataset.
Array nums{*} counsel — Other_services ;
This creates an array of all of the variables in the dataset from the “counsel” variable to the “Other_services” variable. Conveniently for me, and for the data entry personnel, their dataset was set up so these were 15 consecutive variables.
num_services = 0 ;
This sets my num_services variable to 0. If you DON’T do this the value will be retained from one iteration of the data step to the next. Your first record will be correct, but for the second person you’ll get the number of services they received plus the number the first person received. Bad!
do i = 1 to dim(nums) ;
Obviously, sets up the DO loop to go from the first variable to however many variables there were in the array. I don’t know if there were exactly 15 and I was too lazy to count them.
nums{i} = upcase(nums{i}) ;
This changes all of the values to upper case.
if trim(nums{i}) = “YES” then num_services + 1 ;
This trims the trailing blanks for the values and, if the result is “YES” (notice it is now upper case with no trailing blanks) the value for num_services is increased by one.
end ;
End the DO-loop and we’re done.
“This sets my num_services variable to 0. If you DON’T do this the value will be retained from one iteration of the data step to the next. Your first record will be correct, but for the second person you’ll get the number of services they received plus the number the first person received.”
I don’t think that’s correct, actually. I don’t see a RETAIN statement, so I’m not sure the value would be retained from one record to the next. If you didn’t set it to 0 before the DO loop, but still had “then num_services + 1”, you’d probably just get a bunch of missing values, and associated warnings in the log, right?
Yes, it is correct. I am using the sum statement (which doesn’t actually have the word “sum” in it anywhere).
With the
Sumvar + 1
notation, you don’t need the RETAIN statement.
My log is free of warnings – for today, anyway!
Here is one of many sites that explains RETAIN and the sum statement.
http://oregonstate.edu/dept/statistics/sasclass/3-5retainsum.htm
The log is free of warnings precisely because you’re starting the processing of each record by setting “num_services = 0 ;”. So, the value of num_services is getting set to 0, and then the sum statement is incrementing it upwards from there. It does it separately for each record.
What I was saying in my previous comment was that not having the “num_services = 0 ;” line would NOT cause the value of num_services to keep adding from the value of the previous record, unless you ALSO had a “RETAIN” statement in there. I wasn’t actually suggesting you put a RETAIN statement though.