Interestingness from WUSS: Part 2 Condensing Big Data
Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me —
Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?
It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.
If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.
Creating a covariance data set is a piece of cake. Just do this:
proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;
The COV option requests the covariances and the OUTP option has those written to a SAS data set.
If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.
So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.
What about means, frequencies and chi-square and all that, though?
Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)
Proc freq data= sashelp.heart ;
Tables AgeAtDeath
*BP_Status
*Chol_Status
*DeathCause
*Sex
*Smoking /noprint out=mydata.test1;
This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like
proc means data = test1 ;
weight count ;
var ageatdeath ;
Really, you can create “cubes” and analyze your big data on SAS Studio that way.
Yeah, obvious, I know, but I hadn’t been doing it with my students.