Using SAS functions to force structure on unstructured data
Unstructured data is to the usual database as Toontown is to Irvine Ranch (or Diamond Bar or Porter Ranch or any other white bread community that has two names, six types of floor plans and where half the children are named Buffy, Jessica, Jason or Justin – you know who you are).
If one were to have structured data to answer the question,
“What are the two websites you visit the most?”
One could have a list of the most common websites and have people pick from those in a drop down menu. The problem is, of course, that you might need a very, very long menu and it would still leave some sites off.
Instead, we just have a question and people write whatever the hell they want, which can include the word “none”, one website, two websites or five or more websites because they either cannot read or are terribly indecisive.
Here is a start on putting some of these data into a reasonable structure for analysis. I happened to use SAS for this but you could probably do it just as easily using a bunch of other languages.
/*THIS FIRST PART READS IN THE DATASET, CREATES TWO VARIABLES LONG ENOUGH */
/* MAKES EVERYTHING LOWER CASE FOR ALL OF THOSE PEOPLE WHO PUT */
/* Yahoo, yahoo, YAHOO etc TO BE COUNTED AS ALL ONE WEBSITE */
data readsites ;
set urls ;
attrib website1 website2 length = $32. ;
web = lowcase(web) ;
**** If someone put “none” (with trailing blanks trimmed) , then
**** Website1 is equal to blank and the number of sites named = 0 ;
if trim(web) = “none” then do ;
website1 = “”;
namesite = 0 ;
end;
**** I use the index function to see if there is a “,”
**** If there is a “,” in the field that means the person entered two websites (or more)
**** separated by a comma, so I set the value for number of sites named to 2
**** I read in the first website from the beginning of the string until just before the comma
**** I read in the second website starting after the comma to the end ;
else do ;
fnd = index(web,”,”) ;
if fnd > 0 then do ;
namesite = 2 ;
website1 = substr(web,1,(fnd -1)) ;
website2 = substr(web,(fnd +1),length(web)) ;
end ;
**** If there is only one website listed, then the first website is equal to the whole string ;
if fnd = 0 then do ;
website1 = web ;
namesite = 1 ;
end ;
**** If there is no “.” then the person just entered something like ‘yahoo’
**** I tack an http://www. in the front and a .com at the end ;
if index(website1,”.”) = 0 then
website1 = “http://www.” || trim(website1) || “.com” ;
**** If I get to this point without a www then the person entered something like bebo.com
**** so, I tack http://www at the beginning ;
if index(website1,”www”) = 0 then
website1 = “http://www.” || trim(website1) ;
if index(website2,”.”) = 0 then
website2 = “http://www.” || trim(website2) || “.com” ;
else if index(website2,”www”) = 0 then
website2 = “http://www.” || trim(website2) ;
end;
When I get done, for all of those who entered 0, 1 or 2 websites I have a correctly formed url, thanks to the INDEX, SUBSTR, LENGTH and TRIM functions in SAS. The || concatenates strings, but you probably guessed that. I also have a variable that tells me if the person entered zero, one or two or more websites. You might have wondered about appending the .com. I just happened to know that in this dataset the few sites that were .edu or .gov had that entered.
My point – and it may surprise you to hear that I have one – is that while several years ago character functions were mildly interesting and occasionally useful, with the explosion in the amount of unstructured data being collected, functions like these are now one of the basic prerequisites of life. They are right up there with, say, coffee at any hour before 9 a.m. when one is expected to simulate a coherent human being.