Why won’t SAS see this character variable is equal?
Playing around today with possible activities for my Biostatistics course and used this simulated random sample of race by antigen type. Here is a random trivia question for you that took me about 10 minutes to figure out.
By the way, ChatGPT did not help
I’m interested in how my students could use ChatGPT or other AI, so I often paste questions into ChatGPT or Google Bard to see how the answers compare to reality. ChatGPT said this code should return observations for the second analysis, but it did not.
The second PROC FREQ said “no observations”. Why?
The first chi-square showed, as expected, not enough observations per cell. I decided to try to get a 2x 2 contingency table and a Fisher’s chi-square, but first I was just going to do a chi-square for two racial groups.
data blood ;
input race $ num antigen $ ;
datalines;
white 15 a_anti
white 6 b_anti
white 0 ab_anti
white 29 no_antigen
african_am 7 a_anti
african_am 4 b_anti
african_am 0 ab_anti
african_am 9 no_antigen
asian_am 7 a_anti
asian_am 5 b_anti
asian_am 1 ab_anti
asian_am 7 no_antigen
;
Title "Data from https://sciencecourseware.org/biol300/bloodgroups.html" ;
proc freq data=blood ;
tables race*antigen / chisq;
weight num ;
run;
proc freq data=blood ;
tables race*antigen / chisq;
where race in ("white","asian_am") ;
weight num ;
run;
How I figured it out
First, I looked at the output for the initial chi-square and saw race was listed as “white” and “asian_a”. So, I thought perhaps it was truncating the value of race when it was read in to 8 characters – but that doesn’t explain why “white” didn’t match. Anyway, I tried it again with “white” and “asian_a” as the two values and still nothing.
I thought perhaps there were trailing blanks, I tried using the trim function to remove trailing blanks. Still, no observations.
I did this:
DATA test2;
SET blood ;
SUBSTR(race,1,3) ;
PROC PRINT DATA = test2 ;
And that’s when I spotted it – the first three characters were ” wh” and ” as”. There was a LEADING blank!
There is a function for deleting all blanks as well but I actually just went and deleted the 12 spaces in the 12 lines and it worked.
How did those leading blanks get in there in the first place? I don’t know. I might have just typed a space without thinking or copied the labels from a table.
It’s a little thing, but all told, I spent 10 minutes trying to figure this out, so I thought I’d post the answer here.
I copied/pasted this code directly into ChatGPT to debug and this was the results:
https://imgur.com/a/kqQpCG2
It’s helpful, but you got it without ‘cheating’
The funny thing is, none of those ChatGPT comments indicated the real problem!
It said ‘Inconsistent spacing’ which is in the ballpark, but you are right no mention of LEADING space. FYI this is what Google Bard had to say: https://imgur.com/a/iH7CCGx