Discovering if your data blow with help from SAS Enterprise Guide
“Is there anything you can do to help? I’d kill you but there is a law against it. You’d better leave before I figure out a way around that.”
This comment was made by a co-worker of mine who had saved all of the data for his thesis for a masters in computer science on his hard drive. Someone who needed assistance had stopped in his office, popped in a floppy disk and accidentally formatted the hard drive instead of the floppy. I tell this story just to point out that people screwing with your data is a phenomenon that dates back to at least floppy disks, which, if you ask my children, is equivalent to prehistoric.
Why You Need to Look at Your Data Seven Different Ways before you do ANY Statistical Analyses.
- The data were entered by clerks making minimum wage who hate that they are doing a job that, were it not for animal cruelty laws would be done by a half-trained monkey.
- The data were entered by really bright undergraduates at a prestigious university who smoked something really good before coming in to work. (Are they still called joints? Email me if you know the answer.)
- After you taught all day, graded papers, read the RFP for your next grant, you entered all the data yourself – and finished both data entry and your third martini at 2 a.m.
So… you have your data entered into SAS Enterprise Guide. Congratulations. The very first thing you should do is from the Tasks menu, select Describe and then, select the List Data option. If you have a small dataset, you may want to list the whole thing. Otherwise, click on the Options tab. In the window to the right in the drop-down box under Rows to list select ‘Every nth row’, giving a value for n, say 10. This is what statisticians refer to as a systematic random sample and what other people, who do not invite us to their parties, refer to as every tenth row.
The output is very plain vanilla, as you can see. You could make it prettier, but why? I do like the fact that SAS EG lets me output it as an html file so it can be uploaded easily and read by anyone. Because I do a lot of work as a telecommuter, this makes my life easier. Unlike most of what makes my life easier – the housekeeper, the detail car wash guy, Safari Books – the html output feature doesn’t charge me. So, props to it.
Go here for more step-by-step on how to use List Data. This is my personal university web page.
(I can link from here to there but not vice versa because some people are concerned about a rumor that this blog is written without supervision by the university attorneys, or in fact, by a responsible adult of any profession. This rumor is true.)
Next awesome innovation, go to Tasks again, then Describe then Characterize Data. This task reminds me of the first grader who wrote in his book report, “This book taught me more about penguins than I wanted to know.”
The characterize data task may tell you more about your data than you want to know if you just go with the default options, so I wouldn’t. I’d recommend unchecking the boxes next to Graphs and also the one next to SAS Datasets that produce the datasets containing Univariate statistics and frequencies. You may need those datasets or charts for every variable, but usually you don’t. It just slows down your job and produces a bunch of output you aren’t going to look at, especially if you have dozens or hundreds of variables. You may want to look at graphs for some selected variables later.
By default, the characterize data task will give you frequency distributions for categorical variables with 30 or fewer categories, and, for other categorical variables, the frequencies of the 30 most common categories. You can change the default from 30, if you would like. It will also produce descriptive statistics for all numeric variables, as well as the number missing values. Again, you can make your output prettier than my output shown here, with titles, footnotes and probably embedded images of bells and whistles, but since the purpose of this page is to check for out of range values, outliers, etc., why bother, unless you are really, really bored.
In some cases, it may be of interest to see if you have a normal distribution because you really do expect one. In this case, go to Tasks again, select Describe then Distribution Analysis. If you select Normal under Distributions, you can enter the hypothesized mean (except it isn’t hypothesized at all since you just saw it in the previous task) and standard deviation, too, if you so desire. Click on Plots and then select Histogram to see a histogram of your data with a normal curve super-imposed.
You can also use the Titles options to enter titles and footnotes, since one should never miss the opportunity to suck up to the funding agency. If for you want to change the output for some reason, say, you have a purple fixation, you can go to the Tools menu and select Options. Click Results, then HTML. You can select a different style for your output, then re-run the distribution analysis.
There. Purple. Are you happy now?
Actually, I am happy. The data look pretty good. Everything is pretty much in range, as shown in the descriptive statistics, not much missing data, the values and distributions on all the categorical variables are reasonable, the dependent variable is approximately normally distributed, so we are good to go on parametric models.
Reality check passed. For the data, that is. As far as those smoking, martini-drinking minimum-wage earning data entry people, the jury is still out.
One Comment