Can you say Caveat Emptor if the data are free?
Let the buyer beware – that phrase certainly applies to open data, as does the less historical but equally true statement that students always want to work with real data until they get some.
Lately, I have had students working with two different data sets that have led me to the inevitable conclusion that self-report data is another way of proving that people are big fat liars. One data set is on campus crime and, according to the data provided by personnel at college campuses, the modal number of crimes committed per year – rape, burglary, vehicle theft, intimidation, assault, vandalism – is zero. Having taught at a very wide variety of campuses, from tribal colleges to extremely expensive private universities, and never seen one that was 100% crime free, I suspected – and this is a technical term here – that these data were complete bullshit. I looked up a couple of campuses that were in high crime areas and where I knew faculty or staff members who verified crime had occurred and been reported to the campus and city police. These institutions conveniently had not reported their data, which is morally preferable to lying through their teeth, with the added benefit of requiring less effort.
Jumping from there to a second study on school bullying, we found, as reported by school administrators in a national sample of thousands of public elementary, middle and secondary schools, that bullying and sexual harassment never, or almost never, occur and there are no schools in the country where these occur on a daily basis. Are you fucking kidding me? Have you never walked down the hall at a middle school or high school? Have you never attended a school? What the administrators thought to gain or avoid by lying on these surveys, I don’t know, but it certainly reduces the value of the data for, well, for anything.
So …. the students learn a valuable life lesson about not trusting their data too much. In fact, this may be the most valuable lesson they learn, Stamp’s Law
The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.
From an analysis standpoint, this is my soapbox that I am ranting on every day. Before you do anything, do a reality check. If you use SAS Enterprise Guide, the Characterize Data task is good for this, but any statistical software, or any programming language, for that matter, will have options for descriptive statistics – means, frequencies, standard deviations.
This isn’t to say that all open data sucks. On the contrary, I’m working with two other data sets at the moment that are brilliant. One used abstracts of medical records data over nine years plus state vital records to record data on medical care, diagnoses and death for patients over 65. Since Medicare doesn’t pay unless you have data on the care provided and diagnosis, and the state is kind of insistent on recording deaths, these data are beautifully complete and probably pretty accurate.
I’ve also been working with the TIMSS data. You can argue all you want about teaching to the test, but it’s not debatable that the TIMSS data is pretty complete and passes the reality test with flying colors. Distributions by race, gender, income and every other variable you can imagine is pretty much what you would expect based on U.S. Census data.
So, I am not saying open data = bad. What I am saying is let the buyer beware, and run some statistics to verify the data quality before you start believing your numbers too much.