无重复简单随即抽样

shiyiming · 发表于 2003-10-3 23:42:10

/*To select a random sample where no observation can be chosen more than once*/
data EastHigh;
  format GPA 3.1;
  do Grade=9 to 12;
do StudentID=1 to 100+int(201*ranuni(432098));
   GPA=2.0 + (2.1*ranuni(34280));
   output;
end;
  end;
run;
proc surveyselect data=EastHigh method=srs n=50 out=sample;
run;
proc print data=sample;
run;
data random;
  set EastHigh;
  x=ranuni(1234);
run;
proc sort data=random;
  by x;
run;
data sample(drop x);
  set random (obs=50);
run;
proc print data=sample;
run;
data sample(drop=k n);
  retain k 50 n;
  if _n_=1 then n=total;
  set EastHigh nobs=total;
if ranuni(1230498) <= k/n then
do;
   output;
   k=k-1;
end;
  n=n-1;
  if k=0 then stop;
run;

shiyiming · 发表于 2003-10-9 23:45:01

太复杂了，有极简单的。
%let sampsize=100;
data tmp;
set sashelp.prdsale nobs=nobs;
retain _cnt_ 0;
if &sampsize > _cnt_ and ranuni(0) * (nobs + 1 -_N_) < (&sampsize - _cnt_) then do;
_cnt_+1;
output;
end;
drop _cnt_;
run;

稍微改改，就可以做分层随机抽样了

shiyiming · 发表于 2003-10-20 14:39:25

* Plz set your input data here;
%let in=%str(in_data);
* Plz set your output sample here;
%let sample=%str(sample);
* Plz set your sampling percentage here, ex. 10%;
%let percent=%eval(0.1);

data &sample;
set &in;
if ranuni(2)<=&percent then output &sample;
run;

shiyiming · 发表于 2004-3-6 01:17:05

呵呵，高，实在是高！

shiyiming · 发表于 2004-3-6 03:03:17

I have not dug into it, but if you all think of this algorithm again, you may find some problems. I have not tried to run it, just a conjecture based on browsing the code. Surely, the requirements are not shown in the title.

There might be two problems:

1. I do not think the algorithm will always produce a sample with given sample size.

2. The sample will not be uniformly distributed over the sample space.

I might be wrong, just my 2 cents.

shiyiming · 发表于 2004-3-6 11:56:53

If you can look into the log window of SAS Enterprise Miner, you will find SAS did it in the same way. P.S. I knew the code has been tested on some decent financial cases overseas.

it is simple to verify the algorithm, create the sample data and compare it to the whole population, see if it meets the row count and the distribution.

there're a quite a few ways to do sampling in SAS:
Proc Survey can provide some built-in features to let you select specific method, while it is quite inefficient;
using some distribution func like uniform() to customize the method -- the way above, will be suitable for experienced programmers.
sometimes for those not-so-accurate cases, you can even use rantbl(), which is quite efficient for large database.
e.g. for an 10% sampling,
data sample;
set population(where=(rantbl(-1,0.1) = 1));
run;

we tested it with 30GB, 400M rows data, also, it is OK with SAS/Access to other DBMS.

shiyiming · 发表于 2004-3-7 09:24:47

SAS_Dream might be correct.  Although I still have not seen how to garantee the sample to be uniformly distributed over the sample space, I did following simulation.  It seems to me that the samples are fairly distributed with 1000 runs.

%macro test;
%let sampsize=1;
data temp;
do i=1 to 10;
output;
end;
run;
%do i=1 %to 1000;
data tmp;
set temp nobs=nobs;
retain _cnt_ 0;
if &sampsize > _cnt_ and ranuni(0) * (nobs + 1 -_N_) < (&sampsize -_cnt_) then do;
_cnt_+1;
output;
end;
drop _cnt_;
run;
proc append base=total data=tmp force;
run;
%end;
%mend;
%test;
proc freq data=total;
table i;
run;

                                    The FREQ Procedure

                                                   Cumulative Cumulative
                     i Frequency    Percent    Frequency    Percent
                     1       116    11.60          116       11.60
                     2       93       9.30          209       20.90
                     3       116    11.60          325       32.50
                     4       102    10.20          427       42.70
                     5       105    10.50          532       53.20
                     6       99       9.90          631       63.10
                     7       95       9.50          726       72.60
                     8       102    10.20          828       82.80
                     9       93       9.30          921       92.10
                  10       79       7.90       1000    100.00

shiyiming · 发表于 2006-6-7 12:18:54

xic 我们说的sample都是简单抽样，每个样本不是来自于均匀分布，这是自然的，但是我们是在样本中抽样，这个是对于抽取的样本而样中的二次抽样，这个必须是均匀的才有意义，不存在是否均匀的问题。 <img src="{SMILIES_PATH}/icon_redface.gif" alt=":oops:" title="Embarassed" />

		自动登录	找回密码
密码			立即注册

无重复简单随即抽样

无重复简单随即抽样

这样不就可以了吗?

to charles

xic