Mucking about with data & making life better

We interrupt the prior rambling discussion of high performance computing for a new rambling discussion.

A lot of things bother me – hate crimes, domestic violence, terrorism, drug-related crimes in Mexico, low college graduation rates of minority youth – well, it’s a very long list.

When I was younger, so much younger than today …..

The reason I was fascinated by statistics was because I believed that as (I think it was) Galton said, “If anything exists, then it exists in some quantity and that quantity can be measured.”

My imagination was captured by the idea that EVERYTHING was measurable, from the weight of your cat to your love for your spouse, just some things we hadn’t quite figured out how to measure yet. I also wasted a great many hours reading science fiction books with no real socially redeeming value, and those that painted a picture of a time when we could accurately predict EVERYTHING also enthralled me. On the one hand, those were pretty scary, the idea that some authority would know everything you would do. On the other hand, knowing not to go down that alley or you’ll get knifed, not to have a baby with that person during that month or the child will be anencephalic, well, I can certainly see some value to that.

We now have unthinkably more powerful computers than when I was young and equally incredible masses of data, and yet, we haven’t really come all that far. I mean, it’s great that we now have 46 brands of ketchup, but I think we could do better.

So, I got to thinking — why not a kind of wikipedia for data? The great thing about wikipedia isn’t that it has facts that you can’t find anywhere else. Quite the contrary, I can’t think of a single topic from sea horse to stars (either the kind tracked by astronomers or the kind tracked by paparazzi ) that you couldn’t find more information on somewhere else. I know a lot of academic researchers look down their noses at wikipedia, but the fact is that it is accessed about a million times more than any academic journal. The three main advantages of wikipedia seem to be:

It’s free, so anyone can use the information.
It’s organized all the topics in one place.
It’s added to by lots of people all the time without a lot of bureaucracy in place. So you have information accumulating at a very rapid pace.

There are a lot of people out there with Ph.D.s. A lot of them aren’t full-time academics. While we have a lot of awesome open courseware and some really good journals that are free and open access there isn’t that much “open research”. There are masters students, doctoral students, really, really smart people who for whatever reason never finished their bachelors, masters or doctorate and are now teaching high school math or delivering pizzas. (I once knew a guy with a degree in economics from Johns Hopkins who was a pizza delivery person. He never volunteered why and I never asked.)

My point is that there are a lot of smart people out there who could contribute.

What if we had everything from NCES to Census to Inter-university Consortium for Political and Social Research to data gathered for dissertations and from grant-funded projects just referenced in one spot. (You wouldn’t have to upload 90% of it again, you could just link to it with a description.)

And if people put the results of their analyses up there.

Kind of like ManyEyes with numbers added meets wikipedia.

This would be as different from a refereed article as wikipedia is. And, in the same way, some of the results would be wrong and other people would have to notice and correct them.

You could do this group effort virtually.

Or, you could actually collaborate in person on occasion, kind of like the hack-a-thons.

Why would anybody do that? After all, the prize in these marathon coding sessions is you develop an app and maybe some venture capitalist will give you money and you can work night and day and then be rich and girls will set fire to their scarves in your dorm room.

Maybe some agency could award the winning team a grant to further develop their research. Or maybe we could just do it for the intellectual joy of it and the possibility of discovering knowledge in data that could make the world better. And that knowledge would be available to everybody, not just some central authority or the people who can pay $2,000 for a report.

If you know of anything like this already, please let me know. It’s possible that I just missed it because, you know, there are a lot of websites out there and I have not checked out all of them yet.

Yes, I’d rather do this than making a game where zombies kill people by running over them with stolen cars and then eat their brains.

I know, it’s self-defeating attitudes like mine that keep women from being major players in tech. I’m going to do it anyway.

One Comment

Chris Hemedinger says:

October 9, 2010 at 3:08 pm

I just read your post *after* writing my bit about harvesting data from kiva.org. Read about it here:

http://blogs.sas.com/sasdummy/index.php?/archives/208-World-Statistics,-FTW!.html

But here’s the thing: I’m not a statistician; I’m just a code monkey. So I’d love to see what a real analyst can do with data like this, perhaps even meshing it with other public data.

Chris

Mucking about with data & making life better

How to solve any (statistics) problem: Part 3, proportions

Significance & Mauchly’s W: I don’t think that word means what you think it means

More after the data step (the naked mole rat continues)

Mixed Models and other new stuff

Race, Income and Education – AnnMaria Explains it All

Logistic regression in pictures: Part 3

One Comment

Leave a Reply

Similar Posts

One Comment

Leave a Reply