SAS中文论坛

标题: 多条的观察值排序之优化? [打印本页]

作者: shiyiming    时间: 2010-8-10 14:32
标题: 多条的观察值排序之优化?
很多条观察值,大概有2亿条吧。
然后用sort语句:

libname tem "d:\";

proc sort data=tem.ex out=tem.ex;
by num2 num1;
run;

1一亿条时没有问题,2亿条就出现下面问题。按照附log的提示,需要增加C盘空间。

现在的问题是,需要优化它,因为除了No disk space外,排序花的时间很长,另外需要说明的是该tem.ex里面有重复数据,需要保留。

为了节省时间和空间,我该怎么办?hash可以么?

附log:
ERROR: No disk space is available for the write operation.  Filename = C:\DOCUME~1\sss\LOCALS~1\Temp\SAS Temporary
Files\SAS_util000100000F7C_sxlion\ut0F7C000004.utl.
ERROR: Failure while attempting to write page 3390 of sorted run 4.
ERROR: Failure while attempting to write page 20511 to utility file 2.
ERROR: Failure merging sorted runs from utility file 1 to utility file 2 during merge pass 1.
ERROR: Failure encountered during external sort.
ERROR: 执行排序失败。
作者: shiyiming    时间: 2010-8-10 21:15
标题: Re: 多条的观察值排序之优化?
两种方法:
1,分段法,分几段后还原。
2,避免产生这种耗时吃硬盘的恶魔数据,扼杀在摇篮里。
作者: shiyiming    时间: 2010-10-22 23:00
标题: Re: 多条的观察值排序之优化?
to sxlion
do u have many variables? Use NOTAG option
or use INDEX
作者: shiyiming    时间: 2010-10-23 21:24
标题: Re: 多条的观察值排序之优化?
INDEX
如果有重复观测估计不行吧
作者: Qiong    时间: 2010-10-26 17:08
标题: Re: 多条的观察值排序之优化?
如果在local pc run的话,试试用memlib
<!-- m --><a class="postlink" href="http://support.sas.com/resources/papers/proceedings10/070-2010.pdf">http://support.sas.com/resources/papers ... 0-2010.pdf</a><!-- m -->
作者: shiyiming    时间: 2010-10-27 10:06
标题: Re: 多条的观察值排序之优化?
to oloolo
可否具体说说这两个怎么用,有啥功能?谢谢!
作者: shiyiming    时间: 2010-10-27 15:27
标题: Re: 多条的观察值排序之优化?
谢谢各位的回复和关注,当时这个问题,用二楼的方法1暂时解决了,想想绝非Final solution,SAS肯定有绝招。

vick的将数据读入到内存,我猜想可能会加快速度,于空间补救恐怕有点犯难,不过这个仅仅是猜想中。

oloolo的option Notag 头次听书,还没测试;index也没测试,待我测试后,定来汇报。

ps:busy now!  <!-- s:( --><img src="{SMILIES_PATH}/icon_sad.gif" alt=":(" title="Sad" /><!-- s:( -->
作者: shiyiming    时间: 2010-10-27 18:12
标题: Re: 多条的观察值排序之优化?
我错了,是tagsort选项,脑袋晕了,就记得那个tag了,呵呵,不好意思。我在PC上给有30亿个观测值的数据集排序也只用了不到60多分钟,整个记录长度大概字长100左右,4个数值变量,3个字符,主要长度都来自那几个字符变量。你这个估计每个记录的字长很长,很适合TAGSORT的要求

The TAGSORT option in the PROC SORT statement is useful in sorts when there may not be enough disk space to sort a large SAS data set. When you specify TAGSORT, the sort is a single-threaded sort. Do not specify TAGSORT if you want the SAS to use multiple threads to sort.

When you specify the TAGSORT option, only sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary files. The sort keys, together with the observation number, are referred to as tags. At the completion of the sorting process, the tags are used to retrieve the records from the input data set in sorted order. Thus, in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably. You should have enough disk space to hold another copy of the data (the output data set) or two copies of the tags, whichever is greater. Note that while using the TAGSORT option may reduce temporary disk use, the processing time may be much higher. However, on PCs with limited available disk space, the TAGSORT option may allow sorts to be performed in situations where they would otherwise not be possible.




欢迎光临 SAS中文论坛 (https://mysas.net/forum/) Powered by Discuz! X3.2