statistics

Text Mining with Statistica (or anything else) – look again!

ByAnnMaria De Mars May 16, 2012May 16, 2012

There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables

number of negative sentiments expressed in the post,
number of positive sentiments expressed in the post
total number of comments that poster had made, ranging from 1 to over 1,000.

I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.

I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.

There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.

One might be tempted at this point to run out and say,

“Oh, look! Sentiment is very positive!”

Also, it appears that people who have more negative comments also have more positive comments, this means that ….

Just stop right there.

Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.

“Yeah, tell us you’ll improve conditions at your manufacturing plant in China. That would be great, wouldn’t it?” (Includes “improving” and “great” so counted as two positive sentiments).
“I’d rather not say nice try, but … ” (Counts as one positive comment, with the word, “nice”)
“Buy Windows! It’s superior” (Counts as one positive comment, with the word, “superior”)
“Too bad I can’t buy it right now.” (Counts as negative, with the word “bad”)

I’m not saying that Statistica is bad – I don’t think it is – or that text mining is useless – I don’t think that, either.

What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.

Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.

It’s a bit of a leap from text mining, but, variety IS the spice of life.

Virtual Machine vs SAS On-demand for Academics

ByAnnMaria De Mars September 19, 2014

I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has…

statistics

Why visual literacy matters

ByAnnMaria De Mars July 23, 2014

Visual literacy, being the word chooser of this blog, I have decided means the ability to “read” graphic information. A post I saw today on Facebook earnings over time gave a prime example of this. If you are a fluent “visualizer”, then just like a fluent reader can read a paragraph and comprehend it,…

Software | statistics | Technology

How to compute odds-ratios

ByAnnMaria De Mars August 1, 2013August 1, 2013

A two-by-two table is a very common design. The column variable is some type of treatment or intervention. For example, older (over 65 years of age) people who either lived in a nursing home or who did not. Your row variable is some categorical outcome. For example, they either lived or died. You create a…

Dr. De Mars General Life Ramblings | statistics | Technology

Margin of Error

ByAnnMaria De Mars May 13, 2010May 13, 2010

On my way back from Tunisia via Paris I ended up in a redneck dive bar somewhere in Georgia reading the New York Times on my Kindle while the lady next to me asked the very drunk waitress if she knew who had won at NASCAR this weekend. This sounds like the beginning of a…

Software | statistics | Technology

SAS Studio – where and wow

ByAnnMaria De Mars September 22, 2014

I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing. If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal…

Dr. De Mars General Life Ramblings | Software | statistics | Technology

You learn one programming language, you’ve learned them all (sort of): SPSS Quintiles Example

ByAnnMaria De Mars February 15, 2012February 15, 2012

Recently, I had the need to write the exact same programs twice, once using SAS and once using SPSS syntax. Even though these aren’t the same language, having done it once made it much easier to do it the second time. Let’s start with quintile matching. I’ve been rambling on about propensity scores lately and…

Similar Posts

Leave a Reply