标题: 无重复简单随即抽样 [打印本页] 作者: shiyiming 时间: 2003-10-3 23:42 标题: 无重复简单随即抽样 /*To select a random sample where no observation can be chosen more than once*/
data EastHigh;
format GPA 3.1;
do Grade=9 to 12;
do StudentID=1 to 100+int(201*ranuni(432098));
GPA=2.0 + (2.1*ranuni(34280));
output;
end;
end;
run;
proc surveyselect data=EastHigh method=srs n=50 out=sample;
run;
proc print data=sample;
run;
data random;
set EastHigh;
x=ranuni(1234);
run;
proc sort data=random;
by x;
run;
data sample(drop x);
set random (obs=50);
run;
proc print data=sample;
run;
data sample(drop=k n);
retain k 50 n;
if _n_=1 then n=total;
set EastHigh nobs=total;
if ranuni(1230498) <= k/n then
do;
output;
k=k-1;
end;
n=n-1;
if k=0 then stop;
run;作者: shiyiming 时间: 2003-10-9 23:45
太复杂了,有极简单的。
%let sampsize=100;
data tmp;
set sashelp.prdsale nobs=nobs;
retain _cnt_ 0;
if &sampsize > _cnt_ and ranuni(0) * (nobs + 1 -_N_) < (&sampsize - _cnt_) then do;
_cnt_+1;
output;
end;
drop _cnt_;
run;
稍微改改,就可以做分层随机抽样了作者: shiyiming 时间: 2003-10-20 14:39 标题: 这样不就可以了吗? * Plz set your input data here;
%let in=%str(in_data);
* Plz set your output sample here;
%let sample=%str(sample);
* Plz set your sampling percentage here, ex. 10%;
%let percent=%eval(0.1);
data &sample;
set &in;
if ranuni(2)<=&percent then output &sample;
run;作者: shiyiming 时间: 2004-3-6 01:17 标题: to charles 呵呵,高,实在是高!作者: shiyiming 时间: 2004-3-6 03:03
I have not dug into it, but if you all think of this algorithm again, you may find some problems. I have not tried to run it, just a conjecture based on browsing the code. Surely, the requirements are not shown in the title.
There might be two problems:
1. I do not think the algorithm will always produce a sample with given sample size.
2. The sample will not be uniformly distributed over the sample space.
I might be wrong, just my 2 cents.作者: shiyiming 时间: 2004-3-6 11:56
If you can look into the log window of SAS Enterprise Miner, you will find SAS did it in the same way. P.S. I knew the code has been tested on some decent financial cases overseas.
it is simple to verify the algorithm, create the sample data and compare it to the whole population, see if it meets the row count and the distribution.
there're a quite a few ways to do sampling in SAS:
Proc Survey can provide some built-in features to let you select specific method, while it is quite inefficient;
using some distribution func like uniform() to customize the method -- the way above, will be suitable for experienced programmers.
sometimes for those not-so-accurate cases, you can even use rantbl(), which is quite efficient for large database.
e.g. for an 10% sampling,
data sample;
set population(where=(rantbl(-1,0.1) = 1));
run;
we tested it with 30GB, 400M rows data, also, it is OK with SAS/Access to other DBMS.作者: shiyiming 时间: 2004-3-7 09:24
SAS_Dream might be correct. Although I still have not seen how to garantee the sample to be uniformly distributed over the sample space, I did following simulation. It seems to me that the samples are fairly distributed with 1000 runs.
%macro test;
%let sampsize=1;
data temp;
do i=1 to 10;
output;
end;
run;
%do i=1 %to 1000;
data tmp;
set temp nobs=nobs;
retain _cnt_ 0;
if &sampsize > _cnt_ and ranuni(0) * (nobs + 1 -_N_) < (&sampsize -_cnt_) then do;
_cnt_+1;
output;
end;
drop _cnt_;
run;
proc append base=total data=tmp force;
run;
%end;
%mend;
%test;
proc freq data=total;
table i;
run;