Data Quality Macro Explained
Yesterday, I talked about using a macro for beginning checks on data quality and I promised to explain it today. So, here we go…
options mstored sasmstore = maclib ;
If you want to store your macro, you need to use two options in the OPTIONS statement.
MSTORED that you want the stored compiled macros in the library specified after SASMSTORE = .
Why you need two options, I don’t know.
LIBNAME MACLIB "C:\Users\MyDir\Documents\My SAS Files" ;
*** That’s the directory where my macro will be stored.
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;
*** This creates a macro named dataqual that will take parameters for
dsn = data set name
idvar = the subject identifier in the data set . This is something like social security number, telephone number (if you’re a phone company), customer number or other variable that should NOT have duplicates.
startvar = this is the first variable in the data set that I want to get descriptive statistics on
endvar = this is the last variable I want descriptive statistics on
obsnum = the number of variables in the data set
The / store is an option that tells SAS to store this macro, and it will be stored in the directory specified after the sasmstored option and matched in the LIBNAME statement.
Title "Duplicate ID Numbers" ;
Proc freq data = lib.&dsn noprint ;
tables &idvar / out = &dsn._freq (where = ( count > 1 )) ;
format &idvar ;
***** This will create a frequency distribution but not print it (that ‘s the noprint option on the PROC FREQ statement ). This is important to remember because many of my data sets will have hundreds of thousands or millions of records, each with a unique identifier. It will output the duplicate values to a data set named &dsn._freq , with, of course &dsn replaced by whatever name I give it.
Since many of the public data sets I use have formats for every value, and I don’t want the formatted value used, the statement
FORMAT &idvar ;
will cause it to use the unformatted value for &idvar .
proc print data = &dsn._freq (obs = 10 ) ;
run ;
*** This prints the first ten duplicate values. You want to be careful to put in that (obs = 10 ) just in case something went wrong in the FREQ procedure and it ended outputting everybody. For example, if you accidentally put (count = 1 ) . Then you may get 2 million records in your &dsn._freq data set and it would not be very good to print all of those out.
(Skipping story of dumping reams of green and white lined computer paper. If you’re my age, you have one of those stories of your own.)
proc summary data = lib.&dsn mean min n std ;
output out = &dsn._stats ;
var &startvar -- &endvar ;
**** This is going to create a data set , &dsn._stats with the mean, minimum, n and standard deviation for each variable in your data set from &startvar to &endvar /
proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;
This is going to transpose your dataset so that instead of five records with 200 or 500 or however many variables you have, instead you have 500 records with variables for _name_ , _label_ , minimum, mean, n and standard deviation.
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;
run;
**** This reads in the transposed data set and then does some quality checks – if the standard deviation is 0, more than 5% of the data are missing or there is a negative minimum value, the variable is output. Then, I get a listing of variables that are in some way warranting a second look. The columns will show the descriptive statistics, plus new variables that show the percent missing, whether the variable is a constant or has a negative minimum.
Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10) ;
format _all_ ;
run ;
**** You should always stare at your data. This prints out the first 10 values without formats, just so I can see what it looks like. I use a lot of public data sets that come with user-defined formats. The FORMAT _ALL_ statement removes all formats for this step and will print just the unformatted values.
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10) ;
run ;
*** This prints the first 10 observations with formatted values, obviously. Using formatted values is the default so I didn’t need a FORMAT statement.
Title “Deviant variables to drop ” ;
proc print data = &dsn._chk noobs;
var _name_ ;
**** This prints the names of all of those variables that have problems. I can copy this from the output window, type the word
DROP
paste the list of variables, add a semi-colon and I have my drop statement.
%put You defined the following macro variables ;
%put _user_ ;
**** This just helps with trouble-shooting. It will put the user defined macro variables to the log so it looks something like:
dsn : studentgr8
idvar: IDSTUD
and so on.
run ;
%mend dataqual ;
*** and that is the end of my macro.
I hope you love it as much as I do. While it isn’t all fun sexy cool like doing a proportional hazards regression model or even a factor analysis, if your data are no good all of those fancy statistics are at best, wrong and a waste of time, and , at worst, wrong and a major blow to your career depending on whose life your wrong data screwed up.
One Comment