标题: 练手系列6: OLS regression in a rolling window [打印本页] 作者: shiyiming 时间: 2010-12-22 04:40 标题: 练手系列6: OLS regression in a rolling window Can't input Chinese at workstation, but here is an interesting question. I have a solution I think that is the fastest so far, but I would like to see how you guys come up with noval solutions:
1. Easy One:
I have a dataset with 500K observations, one independent variable X and one dependent variable Y. I want to conduct an OLS regression sequentially in a 20-observation rolling window and obtain the slope coefficient for X, say \beta. Note that this Linear Normal regression model should have an intercept, but I am not interested in the intercept term.
2. Difficult One:
I have a dataset with 500K observations, M independent variables X1-Xm where m ranges from 2 to 10 and one dependent variable Y. I want to conduct a OLS regression in a 20 observation rolling window and obtain the slope coefficients for X1-Xm, a.k.a \beta_1 to \beta_m. Note that your linear normal regression model should have an intercept term but it is of no interets to me.
I will post my solutions to both questions after 5 followup posts or after this weekend, whichever comes first.
All solutions will be gauged on my PC or Shiyiming's machine based on total real time and memory consumption. Folks, fire up you knowledge on Stat and SAS! <!-- s:D --><img src="{SMILIES_PATH}/icon_biggrin.gif" alt=":D" title="Very Happy" /><!-- s:D -->作者: shiyiming 时间: 2010-12-22 08:20 标题: Re: 练手系列6: OLS regression in a rolling window 好难啊,怎么又扯上我的机器了,哈哈哈。 <!-- s:lol: --><img src="{SMILIES_PATH}/icon_lol.gif" alt=":lol:" title="Laughing" /><!-- s:lol: -->作者: shiyiming 时间: 2010-12-22 08:30 标题: Re: 练手系列6: OLS regression in a rolling window 第一个还好吧,我给了两个方法,第一种方法在我公司的机器上总计用时2秒,在我家的烂PC上总计用时7秒;第二种方法应该是一个比较有经验的SAS programmer能想到的,在我公司的机器上总计用时71秒,在我家总计用时104秒。当然还有用macro来慢慢loop的,那就慢慢等上几个小时吧。。。两种方法都不需要很难的coding。
第二个题需要有点线性代数的底子,coding稍微难点作者: shiyiming 时间: 2010-12-22 10:32 标题: Re: 练手系列6: OLS regression in a rolling window 50万的记录即使读入数据在我的机子上都要超过7秒。差不多50万次运算(regression),7秒真是难以想象。我是2g的内存,cpu速度忘了,用时5.5×60秒。作者: shiyiming 时间: 2010-12-22 10:56 标题: Re: 练手系列6: OLS regression in a rolling window to jingju11
1.86G的C2D啊,3GB内存,不过2GB也够了,这个方法用了53MB的内存,比较耗。
还有更快的,就是要多写点代码,数据只要扫一遍。今晚调一下这个代码。
你的硬盘估计不是很给力作者: shiyiming 时间: 2010-12-22 11:00 标题: Re: 练手系列6: OLS regression in a rolling window 你指的是moving window,而不是每20个记录,right?也就是说如果一个一个的fit model要差不多500k个是吧?作者: shiyiming 时间: 2010-12-22 11:06 标题: Re: 练手系列6: OLS regression in a rolling window to jingju11
是moving window,也就是比如说记录1-20算一组,做OLS回归; 记录2-21算一组,做OLS回归,记录3-23算一组,做回归,以此类推。
所以如果要每一组都fit model的话,确实要499981个回归
我弄这个练手系列6就是要说明搞统计的,一定要对算法有了解
你那个5.5X60秒是怎么来的?
下面是我的log。 我故意把notes和源代码都隐藏了。
2053 %let t0=%sysfunc(datetime(), datetime24.);
2054 options nosource nonotes;
2057 %let t1=%sysfunc(datetime(), datetime24.);
2058 %put Start @ &t0;
Start @ 21DEC2010:22:09:14
2059 %put End @ &t1;
End @ 21DEC2010:22:09:21作者: shiyiming 时间: 2010-12-22 11:48 标题: Re: 练手系列6: OLS regression in a rolling window to oloolo
能用OLS的计算公式计算吗? 这个是最快的,一遍搞定~~一个array{20,2}. FIFO处理方式。
原来学计量的时候都是用gauss矩阵算所有的。作者: shiyiming 时间: 2010-12-22 11:51 标题: Re: 练手系列6: OLS regression in a rolling window to jeozu
当然可以,不用IML,直接在data step里面写
等你的代码了作者: shiyiming 时间: 2010-12-22 12:04 标题: Re: 练手系列6: OLS regression in a rolling window to oloolo
逻辑都全部说出来了也就不用再写了吧~~
对了, IML实现那种算法反而很蹩脚~~
我现在挑灯夜战写开题报告.一大把年纪回到学校上课不容易,更不容易的是还要跟一群小青年比体力~~
献个丑,1.8G, 900M, 1CPU 虚拟机 0.49m
[code:22kfj5cg]
%let N=50;
data a;
do i=1 to 500000;
X=ranuni(0);
Y=0.25*X+rannor(0)+1;
output;
end;
run;
%macro A(t);
ifn(_n_<=&N, 0, &t{i})
%mend;
options mprint ;
data b;
array R{0:%eval(&N-1)} _temporary_;
array L{0:%eval(&N-1)} _temporary_;
set a;
i=mod(_n_,&N);
sum_X+X-%A(R);
sum_Y+Y-%A(L);
sum_XY+X*Y-%A(R) * %A(L);
sum_X2+X**2-%A(R)**2;
if _n_ >=&N then do;
a=(&n*sum_XY-sum_X*sum_Y)/(&n*sum_X2-sum_X**2);
b=sum_Y/&N -a*sum_X/&N;
output;
end;
R{i}=X;
L{i}=Y;
keep a b i;
run;
proc sql noprint;
create table xx as
select mean(a) as a, mean(b) as b
from b;
quit;
[/code:22kfj5cg]作者: shiyiming 时间: 2010-12-22 12:06 标题: Re: 练手系列6: OLS regression in a rolling window 如果y=x仅一个变量我可以做到
real time 1.71s
cpu time 0.76s
我只运算coefficient, that is, beta1
如果是老老实实做regression(将近500k次),可以报告se, p等等,我差不多可以控制在5分钟之内
电脑cpu2.33G, memory4.0G, network processing作者: shiyiming 时间: 2010-12-22 12:11 标题: Re: 练手系列6: OLS regression in a rolling window to jingju11
好,比我强
代码是啥?作者: shiyiming 时间: 2010-12-22 12:13 标题: Re: 练手系列6: OLS regression in a rolling window to jeozu
要的就是代码啊作者: shiyiming 时间: 2010-12-22 12:18 标题: Re: 练手系列6: OLS regression in a rolling window [code:1y9thflp]data had;
keep x1 y;
array x[10];
do i =1 to 5e5;
do j =1 to 10;
x[j] =round(ranuni(111)*10);
end;
y =round(rannor(11),0.01);
output;
end;
run;
**创建数据的时间应该不算吧;
sasfile had load;
data have;
set had;
array t1[0:19] _temporary_; array t2[0:19] _temporary_; array t3[0:19] _temporary_; array t4[0:19] _temporary_;
_dim =mod(_n_, 20);
t1[_dim] =x1; t2[_dim] =y; t3[_dim] =x1*y; t4[_dim] =x1*x1;
if _n_ >=20 then b1 =(sum(of t3[*])-sum(of t1[*])*sum(of t2[*]) /20) /(sum(of t4[*])-20*mean(of t1[*])**2);
run;
sasfile had close;[/code:1y9thflp]作者: shiyiming 时间: 2010-12-22 12:22 标题: Re: 练手系列6: OLS regression in a rolling window 本想用多维数组,最后没有找到reference的方法。所以程序很难看。
如果做regression的话,白天写了个程序用by processing,时间大概5.5分钟。程序懒的再编写了。白天network比较拥挤。如果是现在。我想应该在5分钟之内。作者: shiyiming 时间: 2010-12-22 12:31 标题: Re: 练手系列6: OLS regression in a rolling window to jingju11
嗯,直接用公式。扩展到多变量就把第二个解决了
多维数组:
def: array _x{4,20} _temporary_;
ref: _x[3,2]=....
fun: dim(_x, 2) ...作者: shiyiming 时间: 2010-12-22 12:41 标题: Re: 练手系列6: OLS regression in a rolling window 说实话。我只能依靠_temporary_数组的记忆功能。如果要loop的话,我也没有哪个本事写出来。
我想不明白如何推广到多变量的形式。不过既然每个x形成的面是互相垂直的,说不定简单的相加即可。到此为止。
我现在的兴趣是探讨nlmixed过程的converge问题。希望有兴趣的同志多多探讨一下。谢谢。
京剧作者: shiyiming 时间: 2010-12-22 12:59 标题: Re: 练手系列6: OLS regression in a rolling window 好吧,既然大家没啥兴趣,就算了吧
[code:2krjs26y]
/* Method 1.*/
data rest0;
set test;
array _x{4} _temporary_;
array _a{2,20} _temporary_;
m=mod(_n_-1, 20)+1;
_a[1, m]=x1; _a[2,m]=y;
link filler;
m2=mod(_n_-20, 20)+1;
if _n_>=20 then do;
if _n_>20 then do;
link deduct;
end;
beta=(_x[2]-_x[1]*_x[4]/20)/(_x[3]-_x[1]**2/20);
intercept=_x[4]/20 - beta*_x[1]/20;
keep seq intercept beta ;
output;
end;
return;
filler:
_x[1]+x1;
_x[2]+x1*y;
_x[3]+x1**2;
_x[4]+y;
return;
deduct:
_x[1]=_x[1]-_a[1,m2];
_x[2]=_x[2]-_a[1,m2]*_a[2,m2];
_x[3]=_x[3]-_a[1,m2]**2;
_x[4]=_x[4]-_a[2,m2];
return;
run;
/* Method 2.*/
%macro wrap;
%let window=20;
data testv/view=testv;
set test;
xy=x1*y;
run;
data test2v/view=test2v;
set test;
array _x{2, 20} _temporary_ (20*0 20*0);
k=mod(_n_-1, 20)+1;
_x[1, k]=x1; _x[2, k]=y;
if _n_>=20 then do;
do j=1 to dim(_x, 2);
x=_x[1, j]; y=_x[2, j];
output;
keep time x y;
end;
end;
run;
ods select none;
proc reg data=test2v outest=res2(keep=time x intercept);
by time;
model y = x;
run;quit;
ods select all;