Microdata that don’t add up OR “The government is smarter than you think”
You know those people who have perfect theories about raising children and then they give birth to an actual child who (surprise, surprise) progresses from throwing up baby food on the cat to staying out after curfew? Well, the open data initiative seems to be like that. The people who are full of hype about the 300,000 plus datasets released by the federal government alone, not to mention data from other countries – apps galore are going to come out and make brilliant people millionaires while solving all the nation’s problems – they don’t seem to actually have a lot of experience with it.
Now, I love the idea of open data. Not nearly as much as I love my children, but still, I think it is amazingly cool. The applications produced by the federal government are pretty cool, too. ArcExplorer2 is a handy-dandy FREE (as in free beer, not free puppy) application I used for analyzing FCC data . Not to be topped, the Census Bureau offers Data Ferrett. If you don’t want to do your own analyses, you can download this nifty little critter and be producing analyses in next to no time. So, I did.
You can also, if you want to do analyses on the actual data, download the PUMS which is the public use microdata. I downloaded a SAS datafile that was 1% sample for the state of California, which is around 320,000 people. That part was a piece of cake.
Then, I read the documentation to figure out what those 80 weight variables were for, whether agep was something different than age (it isn’t), what PINCP was (person’s income from the person record). I wrote some formats because I wanted the results to print, for example, “State Government Employee” rather than “4” to label each row.
That being done, I ran some statistics of my own and compared them against the data ferrett results, being careful that both selected only California as the state and both used the proper weight variable. The two sets of results came out identical. What could be better? Except…. (you knew it couldn’t be that easy) ….
For example, the data show that only 293,096 people in California are unemployed, out of 36 million. Hallelujah, recession over! Both Dataferrett and my little SAS program gave the same result. This is based on the “Class of Worker” variable, which asked people where they worked – business, non-profit, state government, self-employed and so on. Less than 2% said they were unemployed BUT this variable was blank for nearly a quarter of those surveyed. I’m going to guess some were unemployed, some retired and some just felt it was none of your damn business.
I called the Census help desk for PUMS and they were quite helpful. In fact, I don’t know why people constantly bash government employees. They generally seem to do a good job. When I asked why such a glaring discrepancy with the economic data one normally reads in the paper, the gentleman responded,
We record what the respondents tell us, which is a very different method from how many of those indicators are gathered, say, using unemployment claims. People don’t like to say they’re unemployed.
So, presumably, they leave the question blank or provide information on the last job in which they were employed. Fair enough. My point, though, is that if you were going to just do a frequency distribution and look at the percentage who said they were unemployed to get a percentage of people in California who are unemployed (which sounds like a pretty reasonable thing to do), you would be FAR off.
The government, it turns out, is smarter than you think. Since people don’t volunteer they are unemployed when asked what sector they work in, there is also a question on whether they worked in the last week, counting paid vacation. If you look at this question, you’ll find that 23% of the people said they did not work in the previous week. If you drop out people over 62, presumably some of whom are retired, it drops to 22%. So, is the unemployment rate the less than 2% who gave the sector in which they worked as “Unemployed” or is it the 22% who didn’t work for pay last week?
By this point I have been looking at the PUMS data for two days. I have downloaded the data dictionary, the actual survey questions, the coding for the questions, the data itself and a couple of tutorials. Even with all of that, I haven’t found where they actually explain how they coded ESR which is the “Employment Status Recoded” variable.
What I do find, running my SAS PROC FREQ is that 99.9% of those shown as employed in the civilian or military labor force reported working last week and 99.9% of those who were coded as unemployed or not in the labor force did NOT report working last week. Now, that’s what I’m talking about – the people who are unemployed should not be in paid employment (that being the definition of unemployment and all).
Based on this final variable, 11% of the people surveyed in 2009 were unemployed with another 2.5% who reported that they had a job but did not work in the past week. I presume these are people in seasonal work like farming, or jobs like firefighters where you don’t necessarily work every week.
Bottom line: There are a lot of great resources available for analysis of public data, including data sets, codebooks and applications. However, this is not Google. It’s not even Wolfram alpha (which I have found very disappointing, but that’s another post). Finding answers for yourself using even very well documented datasets is a lot of work and, unlike raising children, I can’t imagine the average person is going to have the time or inclination to do it. I mean, at least with children, you get to have sex a few times at the beginning and I don’t see the Census Bureau offering anything nearly equivalent as an enticement.
Does this mean it’s a bad idea? Not at all, and I do think, with more encouragement, a great deal can be learned. I chose this dataset because I am giving a presentation for middle school students in a very low income neighborhood and I wanted to show them how they can use data to analyze their lives – the poverty rates by race, education and age, the distribution of income, the disparities in unemployment and more. I think it will be fun and interesting because I DON’T work for the government so I don’t have to be politically correct. We can sort the data by income and go backwards down the list of the highest earners until we find the highest-paid Latino in the sample. We can look what percentage of people earning $X are African-American females, see how skewed the distribution of income is and the huge difference between the mean earnings and median earnings.
My points, and I have two, are:
- There really IS potential for ‘crowd-sourcing’ the analysis of open data, if people who need to use data anyway, e.g, for teaching, class presentations, research papers, choose to use THESE data.
- The useful work done by the people in #1 could be greatly extended by curation – having a repository where the results are posted, categorized, critiqued and edited.
Unfortunately, I don’t see either of these things happening much at the moment.
One Comment