标题: Summary Numerical Data in a Rolling Window [打印本页] 作者: shiyiming 时间: 2010-10-29 16:49 标题: Summary Numerical Data in a Rolling Window From oloolo's blog on SasProgramming
<p><a href="http://feedads.g.doubleclick.net/~a/QHRut4NR0qlk4lplXB4pnGRTkhU/0/da"><img src="http://feedads.g.doubleclick.net/~a/QHRut4NR0qlk4lplXB4pnGRTkhU/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/QHRut4NR0qlk4lplXB4pnGRTkhU/1/da"><img src="http://feedads.g.doubleclick.net/~a/QHRut4NR0qlk4lplXB4pnGRTkhU/1/di" border="0" ismap="true"></img></a></p>Obtain summary statistics over a rolling window for a given data, usually on a time dimension, is not quit easy in SAS, especially the rolling window may contain different number of records and the maximum number is unknown without pass the data once. For example, given a transaction data over several days, a business analyst wants to summarize the data for each 24 hour period. This is actually a recent question on SAS-L.<br />
<br />
There are several approaches. Typically people use an array that is large enough to handle a reasonable guess of maximum number of records within the given interval. Or use a hash table so to manage the data cells dynamically. The coder basically needs to build a stack which is first in first out. Both of these methods are not easy to code and error prone for a beginner.<br />
<br />
This rolling window problem can be efficiently solved by using MultiLabel Format, a feature tend to be ignored. Here are two example. We want to summarize for a rolling window of size 5. Pay attention to the second example.<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
data fmt;
retain fmtname 'rollwindow' type 'n' hlo 'M';
do start=1 to 10;
end=start+5;
label=cats('time', start);
output;
end;
hlo='O'; label='Out-Of-Bound';
output;
run;
data dsn;
do time=1 to 20;
x=rannor(0);
y=ranuni(0);
output;
end;
run;
proc format cntlin=fmt; run;
proc means data=dsn noprint;
class time /preloadfmt mlf;
format time rollwindow.;
var x y;
output out=summary_roll mean(x y)= std(x y)= /autoname;
run;
data dsn2;
do time=1 to 20;
k=ranpoi(10, 10);
do j=1 to k;
time=time+j/(k+1);
x=rannor(0); y=ranuni(0);
output;
end;
end;
drop k j;
run;
proc means data=dsn2 noprint;
class time/preloadfmt mlf exclusive;
format time rollwindow.;
var x y;
output out=summary_roll2 mean()= std()=/autoname;
run;
</code></pre><br />
When the data is getting large, there will be some computing difficulty and insufficient resource. To solve this problem, simply sort and divide the original data into smaller pieces with an overlap equals to the size of rolling window. When combine the summarized pieces, you should discard the overlap part from the rest pieces. <br />
<br />
The multilabel approach has an overhead that is smaller overall to the other methods and can be easily changed to accommandate other sizes of rolling windows.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-8224701217907225470?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/NqmgAIBcq6E" height="1" width="1"/>