Finding Groups in Data
Today, Dr. De Mars is — happy.
One of the fun things about my job is that I get to do lots of different things. That can be a bit troubling some days, because statistical software consultant encompasses a wide range from different types of models, to coding, to various operating systems to all of non-parametric, parametric, Bayesian and other statistics that I cannot remember at the moment.
Because the range of people I work with continually increases, I am now more often running into questions I cannot answer off the top of my head. I do know how Mahalanobis’ distance is used, even though I had not thought about it in years until someone asked me a question yesterday, I do know the calculation for pooled variance , which should be used when Levene’s test is rejected. Still, once a day or so, someone asks me a question I have to look up. Sometimes, these are on techniques I have not used before and just as many times, the question relates to something that I KNOW can be done, and I know this because I personally have used that statistic or written that code before. I just can’t remember how.
You know that saying,
“I have forgotten more about statistics than you’ll ever know.”
Well, that is my problem. I keep forgetting it. Fortunately for me, and this is why I am happy, I get to consult on a lot of different projects each week that remind me of things I used to know. For example, cluster analysis, as the Stata multivariate statistics guide so poetically says, is used for finding groups in data. You can use it to identify or validate specific diagnostic groups, you can try to group just about anything. Most often, cluster analysis is used as an exploratory technique, which is my favorite type of statistics, where you are turning a bunch of numbers into knowledge.
The most common way to use cluster analysis is the k-means technique. You assume there are k-groups (with k being a number you specify) and the program iterates to a solution. The program starts with k “seeds” which are the means for each group. Every observation is assigned to the group whose mean is closest to it. New group means are calculated based on the observations in the group. If an observation’s mean is closer to a different group, it is moved into that group. Then, group means are calculated again. This continues until a step is reached where none of the observations change groups. And that is one way to do cluster analysis.