Random basic SAS tips on ranuni, sampling with replacement & junk
1. Use the data sets in the sashelp directory for dummy data.
2. The RANUNI function is worth remembering
Today I needed to check something. Specifically, I was using the ranuni function to generate random numbers for sampling with replacement. I wanted the data set to be sorted randomly, select the first match, then re-sort the data and re-sample. (Yes, obviously, I was doing propensity score matching.)
When you use a 0 for the seed, e.g.,
randnum = ranuni(0) ;
SAS actually uses the time of day. I was not sure to how many seconds, micro-seconds, nano-seconds or whatever it rounds the time. Even if I did, I’m not sure that would have helped me because I wanted to know is the rounding factor small enough that if my program is running with a small data set there is already a new seed by the next step. I *thought* so but you know, there is a reason we do testing. It’s because things don’t always come out the way you think.
I was going to create a dummy data set and then I realized, hey! There are all kinds of data sets already in the sashelp directory. You may never have looked at them, or noticed, but they are there. So, I did this :
Data yes ;
set sashelp.air ;
randnum = ranuni(0) ;
run ;
proc means data = yes ;
var randnum ;
Data yes ;
set yes ;
randnum = ranuni(0) ;
proc means data = yes ;
var randnum ;
run ;
I did the PROC MEANS because I was too lazy to open the two data sets and look at the numbers. Yes, it worked. I expected it would.
3. Avoid the unconditional ELSE statement.
This is a habit I got into years ago. I’m not one for giving other people rules for how to code because I think most of those rules given out as gospel are just one of many acceptable ways of doing things. This one, however, is worth remembering. Say you have two possible conditions, experimental and control. You would *think* it would make sense to type
If group = 1 then output experiment ;
else output control ;
It would make sense to you if you have never met any actual people. There are data entry errors, where someone types 11 or the letter “l” instead of a 1. People type “experiment” instead of 1. They leave that field blank because they don’t know if this person was in the experimental or control group. All of those people end up in the control group. The technical term statisticians use for this state of affairs is “bad”.
Instead, at a minimum, do this.
If group = 1 then output experiment ;
else if group = 0 then output control ;
I worked with someone who had a good habit of always creating one more data set than he needed, he named it junk. Then, at the end of every IF statement that sent data to different groups was this
If group = 1 then output experiment ;
else if group = 0 then output control ;
else output junk ;
Not only did your 1s only go to the experimental group and your 0s only go to the control group but you also had a dataset that collected the junk where you could look at who these people were and try to figure out what their problem was. Personally, I don’t do that as a routine, but it is a good habit.
Ooh… I like John Cook’s blog, and I read some SAS blogs…but I REALLY like your blog – and now I get some SAS tips too? Life is good…
For more information on random number seeds and how they work in SAS, see http://blogs.sas.com/content/iml/2011/08/31/random-number-streams-in-sas-how-do-they-work/
For long-running simulations/permutations, you might want to use the newer RAND function instead of the older RANUNI function. See the last paragraph of
http://blogs.sas.com/content/iml/2011/10/19/four-essential-functions-for-statistical-programmers/
Thanks, Rick. I’ll check it now. Turns out I just finished something which is a very long running simulation and although it’s running (good) I’m looking for ways to improve it.
If i already have a dataset, how do i randomize it using ranuni()
If you mean how do you get it in random order, you can create a variable just like shown above and then sort on that variable.
Data yes ;
Set yes ;
randnum = ranuni(0) ;
proc sort data = yes ;
by randnum ;
data rand;
input CT RT @@;
Label CT = ‘Circuit Type’;
Label RT = ‘Response Time’;
datalines;
1 9 1 12 1 10 1 8 1 15
2 20 2 21 2 23 2 17 2 30
3 6 3 5 3 8 3 16 3 7
;
order = ranuni(0);
output;
end;
end;
proc print data = rand;
title ‘Raw Dataset’;
run;
proc sort data = rand;
by order;
run;
proc print data = rand;
title ‘Randomized Dataset’;
run;
This is my SAS code, i am getting the same result in raw as well as randomized dataset, cannot figure out y mistake, its either in the way i entered the data or using the ranuni() command.
You should have also gotten some errors in your log.
Once you had the datalines, data and the ; after the data, that ended that data step. To create the order variable, you need to start a new data step.
Try this. It will work.
data rand;
input CT RT @@;
Label CT = ‘Circuit Type’;
Label RT = ‘Response Time’;
datalines;
1 9 1 12 1 10 1 8 1 15
2 20 2 21 2 23 2 17 2 30
3 6 3 5 3 8 3 16 3 7
;
proc print data = rand;
title ‘Raw Dataset’;
run;
data rand ;
set rand ;
order = ranuni(0);
proc sort data = rand;
by order;
run;
proc print data = rand;
title ‘Randomized Dataset’;
run;
Hi,
I do not know why the following two data steps produce the same random numbers for x_10 and x_20. Could you please help me in this regard? is it a bug? or I am doing some illegal! Thank you in advance.
data x1;
do i=1 to 3;
xx=ranuni(401);
do j= 1 to 2;
x_10=ranuni(10); output;
end;
end;
run;
data x2;
do i=1 to 3;
xx=ranuni(401);
do j= 1 to 2;
x_20=ranuni(20); output;
end;
end;
run;
I also checked for rand() function; it also produce same random numbers. When I delete “xx=ranuni(401);”, then they produce different, expected random numbers. Than you again.
Actually, the seed is only generated on your first observation. If you did ranuni(1), your random number wouldn’t be the same for every step observation. What SAS does is for _n_=1, it uses the time of day as a seed, and let’s say the random number it generated was = 12345/(2^31 -1) (ranuni generates a random number between 1 and 2^31-1 then normalizes it to be between 0 and 1 by division), at _n_=2, the new seed would be 12345, NOT the new time of day
data train validation;
set new;
if ranuni(0)<=0.60 then output train;
else output valid;
run;
I am getting an error message ERROR 455-185: Data set was not specified on the DATA statement
This set dataset is saved in my work folder. Please advice.
Hi, Shivi –
In your DATA step you named the dataset validation but in your output you called it valid.
Change one of those two so that the names match.