Text Mining with Statistica (or anything else) – look again!
There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables
- number of negative sentiments expressed in the post,
- number of positive sentiments expressed in the post
- total number of comments that poster had made, ranging from 1 to over 1,000.
I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.
I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.
There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.
One might be tempted at this point to run out and say,
“Oh, look! Sentiment is very positive!”
Also, it appears that people who have more negative comments also have more positive comments, this means that ….
Just stop right there.
Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.
- “Yeah, tell us you’ll improve conditions at your manufacturing plant in China. That would be great, wouldn’t it?” (Includes “improving” and “great” so counted as two positive sentiments).
- “I’d rather not say nice try, but … ” (Counts as one positive comment, with the word, “nice”)
- “Buy Windows! It’s superior” (Counts as one positive comment, with the word, “superior”)
- “Too bad I can’t buy it right now.” (Counts as negative, with the word “bad”)
I’m not saying that Statistica is bad – I don’t think it is – or that text mining is useless – I don’t think that, either.
What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.
Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.
It’s a bit of a leap from text mining, but, variety IS the spice of life.