"Entrywise" Norm calculation using PROC FASTCLUS

shiyiming · 发表于 2010-10-22 13:39:49

From oloolo's blog on SasProgramming

<a href="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/0/da"><img src="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/0/di" border="0" ismap="true"></img></a> 
<a href="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/1/da"><img src="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/1/di" border="0" ismap="true"></img></a>In some data mining applications, matrix norm has to be calculated, for instance [1]. You can find a detailed explanation of Matrix Norm on Wiki @ <a href="http://en.wikipedia.org/wiki/Matrix_norm">Here</a> 
 
Instead of user written routine in DATA STEP, we can obtain "Entrywise" norm via PROC FASTCLUS efficiently and accurately. 
 
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; height: 658px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 98.32%;"><code>
data matrix;
 input X1-X5;
datalines;
1 2 4 5 6
7 8 9 0 1
2 3 4 5 6
3 4 5 6 7
7 8 9 0 2
2 4 6 8 0
;
run;

data seed;
 input X1-X5;
datalines;
0 0 0 0 0
;
run;

options nosource;
proc export data=matrix outfile='c:\matrix.csv' dbms=csv replace; run;
options source;

proc fastclus data=matrix seed=seed out=norm(keep=DISTANCE)
 maxiter=0 maxclusters=1 noprint ;
 var x1-x5;
run;

/*
In output file NORM, variable DISTANCE is the square root of Frobenius norm. If LEAST=P option is specified, then p-norm is calculated. In PROC FASTCLUS, you can specify p in the range of [1, \inf].

Now what you got is vector norm for each row, taking the sum of squares of DISTANCE, you obtain the Frobenius norm of the data matrix, which can be easily obtained through PROC MEANS on a data view:
*/
data normv/ view=normv;
 set norm(keep=DISTANCE);
 DISTANCE2=DISTANCE**2;
 drop DISTANCE;
run;
proc means data=normv noprint;
 var DISTANCE2;
 output out=matrixnorm sum(DISTANCE2)=Frobenius_sqr;
run;
</code></pre> 
You can use the following R code to verify the results; 
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
mat <- read.csv('c:/matrix.csv', header=T)
#verify vector norm
vnorm <- apply(mat, 1, function(x){sqrt(sum(x^2))});
#verify norm of the matrix
x<-as.matrix(mat)
sqrt(sum(diag(t(x)%*%x)))
</code></pre> 
PS: 
1. Of course, above process is designed for implementing the randomized SVD in [1]. If only the matrix Frobenius norm is of interests, you can also use the following code snippet: 
 
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; height: 380px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 98.32%;"><code>
data matrixv/view=matrixv;
 set matrix;
 array _x{*} x1-x5;
 array _y{*} y1-y5;
 do j=1 to dim(_x); _y[j]=_x[j]**2; end;
 keep y1-y5;
run;

proc means data=matrixv noprint;
 var y1-y5;
 output out=_var(drop=_TYPE_ _FREQ_) sum()=/autoname;
run;

data _null_;
 set _var;
 norm=sqrt(sum(of _numeric_));
 put norm=;
run;
/* --LOG WRITES:
norm=28.635642127
NOTE: There were 1 observations read from the data set WORK._VAR.
*/

</code></pre> 
2. Using its built-in computing engine for Eucleadian Distance, PROC FASTCLUS is also a powerful tool to search for the data point in main table that is CLOEST to the a record in lookup table. This technique is shown <a href="http://www.sas-programming.com/2009/09/tweak-proc-fastclus-for-1-nearest.html">Here</a> and [2]. 
 
 
Reference: 
[1], P. Drineas and M. W. Mahoney, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006. 
 
[2], Dorfman, Paul M.; Vyverman, Koen; Dorfman, Victor P., "Black Belt Hashigana", Proc. of the 2010 SAS Global Forum, Seattle, WA, 2010<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-26676881494384467?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/RjNo7zm2JHY" height="1" width="1"/>

		自动登录	找回密码
密码			立即注册