Fixing character data without beatings: SAS Enterprise Guide
At the JMP seminar on Monday, when Dick De Veaux said that 65-70% of time in all research projects is spent on data cleaning, everyone in the audience groaned in agreement.
One of the biggest problems I run into is recoding those simple textboxes. For example, we often want to look at data for one reservation or tribe. Now, one would THINK that the answer to a question like:
If you are an enrolled tribal member, in what tribe?
would be SIMPLE? I mean, you know what tribe you’re in, right?Most likely it is on the sign on every **^ building on the reservation.
No. People need to abbreviate because it takes too much time to write, for example, “Turtle Mountain”, so they have to abbreviate it as TM because that extra 4 seconds they saved could be so profitably spent driving slowly on the back roads in front of me when I need to be somewhere. Other people, helpfully, put TMT (for Turtle Mountain Tribe), or the legal name, “Turtle Mountain Band of Chippewa Indians”, which is abbreviated by some as TMBCI. Still others put Ojibwe which is also sometimes spelled Ojibwa which is the name for Chippewa in Ojibwe (or is it Ojibwa). I can go on for another several paragraphs on this, because there are also Red Lake, White Earth and more reservations of the same tribe.
So … (hyperventilating here), in the old days, if I wanted to do an analysis of just the respondents from the Chippewa tribe, I would do something like :
1. Do a frequency distribution to find all the zillion permutations of this ONE question.
2. Be refrained by the wiser, kinder members of our office staff from beating the participants in our research with sticks.
3. Explain for the 59th time in the staff meeting why tribe cannot be multiple choice item (there are 562 federally recognized tribes. We cannot have a multiple choice item with 562 choices. )
4. Write SAS statements that look something like this:
Tribe = upcase(Tribe) ;
Tribe_s = substr(Tribe,1,4) ;
If tribe_s in (“TMBC”, “TURT”,”CHIP”,”OJIB”)
OR
tribe in (“RED LAKE”, “WHITE EARTH”) then chippewa = 1 ;
Except that the IF statement would be much, much, much longer. This is a common type of problem and you can find lots of solutions in many SAS Global Forum papers. Here is my new one, no beatings required.
In SAS Enterprise Guide, go to TASKS, select DATA, select FILTER & SORT. Click on the thing that looks like a filter.
From the drop-down variable list, I pick TRIBE. From the next list, I select IN A LIST. I click on the three dots and a list of all values appears. I can hold down the shift key and select several in a row, all the Chip, Chippewa, Chippewa Tribe and so on. After clicking OK, I can go back and select more values to add to the list.
Done. Pointing and clicking and no clever uses of SUBSTR, UPCASE or other nifty functions required. (Note to self: Find out what new careers the SAS function- users have now.)
So, now, with a few points and clicks, by going to TASKS, selecting DESCRIBE, then SUMMARY TABLES, I can produce this table that tells me if you are on one of the Chippewa reservations surveyed, your internet usage is related to your years of education and age. The relationship between internet usage and age seems to be curvilinear here.
My suspicion is that older people are slower to adopt new technologies, however, technology adoption is also enabled by money and those with more money tend to have more education (which is somewhat related to age, you don’t have a lot of 18-year-old college graduates). I can begin to examine some multivariate relationships now.
Notice what is going on here ! In about 12 seconds I have motored through the data cleaning part combining the frequency distributions, recoding and selection and am now delving into data analysis.
I would not go so far as to say that this is better than sex (hence explaining the four children) but it is definitely way cool and makes me happy.
In the interest of full disclosure, I must say this. If you have never used SAS before, it will take you longer than 12 seconds. Here is why :
- The filter on filter and query is three blanks followed by a box with three dots. The reaction of an experienced SAS programmer, especially one who ever used the analyst application, is to recognize that as an IF statement with the first box as the variable (click arrow for drop down list of variables), the second box as the operation (click arrow for drop down list of operations) and the third box as whatever you want to select (click the three dots for more). The reaction of the rest of humanity is going to be WTF?
- In creating computed columns, which I did to recode the internet usage variables, I immediately knew that the format I wanted was $CHARw. and that I needed a length of 10.
So, SAS EG is a great thing for any researcher who is a SAS programmer. It is also a great thing for any researcher who wants to be a SAS programmer. I won’t lie to you and say it will be completely easy and painless, but it is true that less beating of subjects with sticks will be required.