2. 菜单:analyst、insight作者: shiyiming 时间: 2004-4-14 02:36
You have to clarify if you want roughly one million records or exactly one million records, the former is comparatively easier.
data sample;
set total;
if ranuni(0)<0.01 then output;
run;
It will give you a random sample of one percent of the total population, but not necessarily to be exact one million records, just close to it.作者: shiyiming 时间: 2004-4-14 11:49
I am sorry that I forgot to mention something. For a data set of this size, there are some special skills to handle it. Otherwise, it will take a long time to run. The code in previous message is simple, but not efficient. Please check out some other papers to get an idea. The key is that you have to move the curser to the begining of a record without read it, go through the randomizing part, if the record is picked, read in the rest of the data, otherwise, skip it and go to the nest record.作者: shiyiming 时间: 2004-4-14 14:55
xic 提的效率问题非常重要。实际应用中,需要抽样的总体往往是海量数据,抽样效率是一定要考虑的。