SUPER BASIC INTRODUCTION TO DATA ANALYSIS
I was going to write more about reading JSON data but that will have to wait because I’m teaching a biostatistics class and I think this will be helpful to them.
What’s a codebook?
If you are using even a moderately complex data set, you will want a code book. At a minimum, it will tell you the name of each variable, the type (character, numeric or date), a label, if it has one and its position in the data set. It will also tell you the number of records and number of variables in a data set. In SAS, you can get all of this by running a PROC CONTENTS. (Also from a PROC DATASETS but we don’t cover that procedure in this class.)
So, for the sashelp.heart data set, for example, you would see:
The variable AgeAtDeath is the 12th variable in the data set. It is numeric, with a length of 8 and the label for it is “Age At Death”. Because it is a numeric variable, if you try to use it for any character functions, like finding a substring, you will get an error. (A substring is a subset of a string, so ‘ABC’ is a substring of ‘ABCDE’.)
Similarly, BP_Status is the 15th variable in the data set, it is a character, with a length of 7 and a label of “Blood Pressure Status”. Because it’s a character variable, if you try to do any procedures or functions that expect numeric variables, like find the mean, you will get an error. The label will be used in output, like in the table below.
This is useful because you may have no idea what BP_Status is supposed to mean. HOWEVER, if you use “Blood Pressure Status” in your statements like the example below, you will get an error.
**** WRONG!!!
Proc means data=sashelp.heart ;
Var blood pressure status ;
Seems unfair, but that’s the way it is.
The above statement will assume you want the means for three separate variables named “blood” “pressure” and “status”.
There are no variables in the data set named “blood” or “pressure” so you will get an error. There is a variable named “status”, but it’s something completely different, a variable telling if the subject is alive or dead.
Even if you don’t have a real codebook available, you should at a minimum start any analysis by doing a PROC CONTENTS so you have the correct variable names and types.
What about these errors I was talking about, though? Where will you see them?
LOOK AT YOUR SAS LOG!!
If you are using SAS Studio , it’s the second tab in the middle window, to the right of the tab that says CODE.
Click on that tab and if you have any SYNTAX errors, they will conveniently show up in red.
Also, if you are taking a course and want help from your professor or a classmate, the easiest way for them to help you is if you is to copy and paste your SAS log into an email, or even better, download it and send it as an attachment.
Just because you have no errors in the SAS log doesn’t mean everything is all good, but it’s always the first place you should look.
To get a table of blood pressure status, you may have typed something like
Proc freq data=sashelp.heart ;
Tables status ;
That will run without errors but it will give you a table that gives status as alive or dead, not blood pressure as high, normal or optimal.
PROC CONTENTS is a sort of “codebook light”. A real codebook should also include the mean, minimum, maximum and more for each variable. We’ll talk about that in the next post. Or, who knows, maybe I’ll finally finish talking about reading in JSON data.
Is it wishful thinking that all organizations will have a data catalog or lineage procedures in place? 🙂