statistics

What Good is Cook’s D ?

ByAnnMaria De Mars October 19, 2014

I’ve written here before about visual literacy and Cook’s D is just my latest example.

Most people intuitively understand that any sample can have outliers, say, an 80-year-old man who is the father of a six-year-old child, the new college graduate who is making $150,000 a year. We understand that those people may throw off our predictions and perhaps we want to exclude those outliers from our models.

What if you have multiple variables, though? It’s possible that each individual value may not be very extreme but the combination is. Take this data set below that I totally made up, with mom’s age, dad’s age and child’s age.

Mom Dad Child

30 32 6
20 27 5
31 33 8
29 28 6
40 42 20
44 44 21
37 39 14
25 29 7
30 32 6
20 27 5
31 33 8
29 28 6
39 42 19
43 44 20
37 39 13
25 28 6
40 29 15

Look at our last record. The mother has an age of 40, the father an age of 29 and the child an age of 15. None of these individually are extreme scores. These aren’t even the minimum or maximum for any of the variables. There are mothers older (and younger) than 40, fathers younger (and older) than 29; 15 isn’t that extreme an age in our sample of children. The COMBINATION, however, of a 40-year-old mother, 29-year-old father and 15-year-old child is an extreme case.

Enter Cook’s distance, a.k.a. Cook’s D, which measures the effect of deleting an observation. The larger the distance, the more influential that point is on the results. Take a look at my graph below.

It is pretty clear that the last observation is very influential. Now, you might have guessed that if you had thought to look at the data. However, if you had 11 variables and 100 observations it wouldn’t be so easy to see by looking at the data and you might be really happy you had Cook around to help you out.

Let’s look at the data re-analyzed without that last observation. Here is what our plot of Cook’s D looks like now.

This gives you a very different picture. While a couple of points are higher than the others, it is certainly not the extreme case we saw before.

In fact, dropping out that one point changed our explained variance from 89% to 93%.

So … knowing how to use Cook’s D for regression diagnostics is our latest lesson in visual literacy.

You’re welcome.

statistics

What you need to know before multivariate statistics

ByAnnMaria De Mars October 8, 2014October 8, 2014

You might have gotten the misimpression from my previous post that I don’t think students need to learn all that much matrix algebra that I am a slacker as far as expecting students to come to courses with some prior knowledge. That’s not exactly the case. In fact, here are some things I just assume…

Plotting Agreement with Kappa Plots from PROC FREQ

ByAnnMaria De Mars August 10, 2015August 10, 2015

In assessing whether our Fish Lake game really works to teach fractions, we collect a lot of data, including a pretest and a post-test. We also use a lot of types of items, including a couple of essay questions. Being reasonable people, we are interested in the extent to which the ratings on these items…

statistics

Simple graphs, not so simple answers

ByAnnMaria De Mars February 12, 2013February 12, 2013

The truth is, what I wanted to be talking about today was either data mining, text mining or mixed models. Those are three things I want to be doing more and would be doing more except that we have a Kickstarter campaign going on to fund the next six levels of our game that teaches…

statistics

Probability and z-scores

ByAnnMaria De Mars May 11, 2015May 11, 2015

For many students just learning statistics, the relationship of z-scores and probability is confusing. Let’s try this concrete example. Here is a chart of the distribution of height in a sample of over 2,800 women. Notice that the peak, the mode is around 62-63 inches. You can see the frequency table here, as well as a…

20 Day Blogging | Dr. De Mars General Life Ramblings | Software | statistics

Scoring tests with SAS: What a difference array makes

ByAnnMaria De Mars January 22, 2014January 23, 2014

Day eight of the 20-day blogging challenge was to write about a professional read – a book, article or blog post that has had an impact on me. To be truthful, I would have to say that the SAS documentation has had a profound impact on me. SAS documentation is extremely well-written (to be fair,…

Dr. De Mars General Life Ramblings | statistics | Technology

Open Data Wikipedia or How many monkeys = 1 statistician?

ByAnnMaria De Mars February 12, 2011February 12, 2011

Remember that old saying that 1,000,000 monkeys on a typewriter would eventually produce Shakespeare? After the equivalent of more than a 1,000,000 monkey-years of text published on the web, so far, no Shakespeare. (For a superb, in-depth discussion of this point, read Jason Lanier’s book, “You are not a gadget”) In very, very, brief, Lanier …

Similar Posts

Leave a Reply