标题: Implement Randomized SVD in SAS [打印本页] 作者: shiyiming 时间: 2010-10-22 13:39 标题: Implement Randomized SVD in SAS From oloolo's blog on SasProgramming
<p><a href="http://feedads.g.doubleclick.net/~a/dH5Ih9YvIfBBiguGyHxkBLX_RNA/0/da"><img src="http://feedads.g.doubleclick.net/~a/dH5Ih9YvIfBBiguGyHxkBLX_RNA/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/dH5Ih9YvIfBBiguGyHxkBLX_RNA/1/da"><img src="http://feedads.g.doubleclick.net/~a/dH5Ih9YvIfBBiguGyHxkBLX_RNA/1/di" border="0" ismap="true"></img></a></p>In the 2010 SASware Ballot®, a dedicated PROC for Randomized SVD was among the options. While an official SAS PROC will not be available in the immediate future as well as in older SAS releases, it is fairly simple to implement this algorithm using existing SAS/STAT procedures.<br />
<br />
Randomized SVD will be useful for large scale, high dimension data mining problems, for instance Text Mining. In SAS/Base and SAS/STAT, lack of sparse matrix operation capability puts any serious Text Mining task at the edge of infeasibility, such as using LSI or NMF algorithms. Randomized SVD provides an economic alternate solution by sacrificing a little accuracy which is bounded under the three sampling schema proposed by the authors [1], while the code below demos sampling schema 1.<br />
<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
/* Randomized SVD with sampling schema 1. */
%let dim=2048;
%let nobs=1e4;
%let s=256;
data matrix;
array _x{*} x1-x&dim;
do id=1 to &nobs;
do _j=1 to dim(_x); _x[_j]=sin(mod(id, _j))+rannor(id); end;
output;
drop _j;
end;
run;
%let datetime_start = %sysfunc(TIME()) ;
%let time=%sysfunc(datetime(), datetime.); %put &time;
data seed;
array _x{*} x1-x&dim;
do _j=1 to dim(_x); _x[_j]=0; end;
output;
stop;
run;
proc fastclus data=matrix seed=seed out=norm(keep=ID DISTANCE)
maxiter=0 maxclusters=1 noprint replace=none;
var x1-x&dim;
run;
data normv/ view=normv;
set norm(keep=DISTANCE);
DISTANCE2=DISTANCE**2;
drop DISTANCE;
run;
proc means data=normv noprint;
var DISTANCE2;
output out=matrixnorm sum(DISTANCE2)=Frobenius_sqr;
run;
data prob;
set matrixnorm ;
retain Frobenius_sqr;
do until (eof);
set norm end=eof;
_rate_=DISTANCE**2/Frobenius_sqr;
keep ID _rate_;
output;
end;
run;
data matrixv/view=matrixv;
merge matrix prob(keep=_rate_);
run;
proc transpose data=matrixsamp out=matrixsamp;
var x1-x&dim;
run;
proc princomp data=matrixsamp outstat=testv(where=(_type_ in ("USCORE")))
noint cov noprint;
var col1-col&s;
run;
data testV_t/view=testV_t;
retain _TYPE_ 'PARMS';
set testv(drop=_TYPE_);
run;
proc score data=matrixsamp score=testV_t type=parms
out=SW(keep=ID Prin:);
var col1-col&s;
run;
data seed;
array _s{*} prin1-prin&s;
do _j=1 to dim(_s); _s[_j]=0; end;
drop _j; output; stop;
run;
</code></pre><br />
Reference:<br />
[1], <strong><em>P. Drineas and M. W. Mahoney</em></strong>, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-7542050131525605054?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/lNKI03L8_28" height="1" width="1"/>