SAS中文论坛

标题: 求助：SAS如何做抽样分析？ [打印本页]

作者: shiyiming 时间: 2004-4-13 16:59
标题: 求助：SAS如何做抽样分析？
比如从1亿条记录集中抽取100万条记录，用编程时用什么方法？用SAS菜单能实现吗？谢谢！

作者: shiyiming 时间: 2004-4-13 17:29
标题: ^&^
1. 编程：proc sql；

2. 菜单：analyst、insight

作者: shiyiming 时间: 2004-4-14 02:36
You have to clarify if you want roughly one million records or exactly one million records, the former is comparatively easier.

data sample;
set total;
if ranuni(0)<0.01 then output;
run;

It will give you a random sample of one percent of the total population, but not necessarily to be exact one million records, just close to it.

作者: shiyiming 时间: 2004-4-14 11:49
I am sorry that I forgot to mention something. For a data set of this size, there are some special skills to handle it. Otherwise, it will take a long time to run. The code in previous message is simple, but not efficient. Please check out some other papers to get an idea. The key is that you have to move the curser to the begining of a record without read it, go through the randomizing part, if the record is picked, read in the rest of the data, otherwise, skip it and go to the nest record.

作者: shiyiming 时间: 2004-4-14 14:55
xic 提的效率问题非常重要。实际应用中，需要抽样的总体往往是海量数据，抽样效率是一定要考虑的。

一般说来可以使用where 代替if 的方法来提高效率：set total(where=(ranuni(0)< &sampleRatio ))，因为不需要将每条记录都读到DDV中，但是这个差异并不是在所有情形下都很明显。

proc surveyselect 需要买了STAT才可以用，用它的理由是选项较多，功能比较全，不用的理由是效率很低，即使是SRS方法都是慢许多。

SAS数据挖掘中的抽样功能是通过后台封装的Data步而不是Proc Surveyselect来做，而且对于Stratified 抽样使用了DMDB中的meta来辅助，所以效率也比Surveyselect高，但是如果没有EM也是不行。

下面的程序可以帮助对比几种抽样方法的效率，调整totalSize和sampleRatio，可以看到
1）(where=(ranuni(0)< &sampleRatio ));
2）if ranuni(0)< &sampleRatio
3）surveyselect
的效率
注意每个方法的测试应独立进行，否则会由于file caching的缘故得出错误结论。

%let totalSize = 600000;

%let sampleRatio = 0.001;

data total;
array v(*) v1 - v125;
do i=1 to &totalSize;
do j=1 to dim(v);
v(j) = i;
end;
output;
end;
drop i j;
run;

data sample;
set total(where=(ranuni(0)< &sampleRatio ));
run;

data sample;
set total;
if ranuni(0)< &sampleRatio then output;
run;

proc surveySelect data=total sampRate=0.01 out=sample;
run;

作者: shiyiming 时间: 2004-4-18 00:23
使用SET语句的POINT=选项进行随机访问。

语法：
SET SAS-data-set POINT = point-variable;

point-variable

为一临时的数字变量，存放需要读出的观测号；
必须在SET语句执行之前附值；
必须是一个变量，如X；不能为一常数，如12。

POINT=选项使用直接访问来读取数据，不对文件结束进行检测，为了防止数据步进入死循环，需要与STOP语句连用。

语法：
STOP;

案例：建立等距取样。

data ia.subset;
  do PickIt = 100 to 500 by 100;
set ia.sale2000 point = PickIt;
output;
  end;
  stop;
run;

如果事前不知道观测数，可以使用SET语句的NOBS选项，来检测数据集中的观测数。

语法：
SET SAS-data-set NOBS = varible;
NOBS= 选项建立一个临时变量：

其值为输入数据集的观测数
在编译时赋值
变量值始终保留
在执行时不能修改

例如：
data ia.subset;
  do PickIt = 100 to TotObs by 100;
set ia.sale2000 point = PickIt nobs = TotObs;
output;
  end;
  stop;
run;

NOBS=返回的是所有的观测，包括标志删除的观测。由于TotObs在在编译时赋值，它可以在SET语句之前被引用。

建立随机取样

有几个随机函数能够返回随机数，常用的是RANUNI

语法：
RANUNI(seed)

seed为小于2^31

欢迎光临 SAS中文论坛 (http://mysas.net/forum/)