SAS中文论坛

 找回密码
 立即注册

扫一扫,访问微社区

查看: 1006|回复: 0
打印 上一主题 下一主题

"Entrywise" Norm calculation using PROC FASTCLUS

[复制链接]

49

主题

76

帖子

1462

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
1462
楼主
 楼主| 发表于 2010-10-22 13:39:49 | 只看该作者

"Entrywise" Norm calculation using PROC FASTCLUS

From oloolo's blog on SasProgramming


<p><a href="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/0/da"><img src="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/1/da"><img src="http://feedads.g.doubleclick.net/~a/rSE7F2ZwZtq8H8hchiyCTPIjaQo/1/di" border="0" ismap="true"></img></a></p>In some data mining applications, matrix norm has to be calculated, for instance [1]. You can find a detailed explanation of Matrix Norm on Wiki @ <a href="http://en.wikipedia.org/wiki/Matrix_norm">Here</a><br />
<br />
Instead of user written routine in DATA STEP, we can obtain "Entrywise" norm via PROC FASTCLUS efficiently and accurately.<br />
<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; height: 658px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 98.32%;"><code>
data matrix;
     input X1-X5;
datalines;
1 2 4 5 6
7 8 9 0 1
2 3 4 5 6
3 4 5 6 7
7 8 9 0 2
2 4 6 8 0
;
run;

data seed;
     input X1-X5;
datalines;
0 0 0 0 0
;
run;

options nosource;
proc export data=matrix  outfile='c:\matrix.csv'  dbms=csv replace; run;
options source;

proc fastclus data=matrix  seed=seed      out=norm(keep=DISTANCE)
              maxiter=0    maxclusters=1  noprint  ;
     var x1-x5;
run;

/*
In output file NORM, variable DISTANCE is the square root of Frobenius norm. If LEAST=P option is specified, then p-norm is calculated. In PROC FASTCLUS, you can specify p in the range of  [1, \inf].

Now what you got is vector norm for each row, taking the sum of squares of DISTANCE, you obtain the Frobenius norm of the data matrix, which can be easily obtained through PROC MEANS on a data view:
*/
data normv/ view=normv;
     set norm(keep=DISTANCE);
     DISTANCE2=DISTANCE**2;
     drop DISTANCE;
run;
proc means data=normv noprint;
     var DISTANCE2;
     output  out=matrixnorm  sum(DISTANCE2)=Frobenius_sqr;
run;
</code></pre><br />
You can use the following R code to verify the results;<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
mat &lt;- read.csv('c:/matrix.csv', header=T)
#verify vector norm
vnorm &lt;- apply(mat, 1, function(x){sqrt(sum(x^2))});
#verify norm of the matrix
x&lt;-as.matrix(mat)
sqrt(sum(diag(t(x)%*%x)))
</code></pre><br />
PS: <br />
1. Of course, above process is designed for implementing the randomized SVD in [1]. If only the matrix Frobenius norm is of interests, you can also use the following code snippet:<br />
<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; height: 380px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 98.32%;"><code>
data matrixv/view=matrixv;
     set matrix;
     array _x{*}  x1-x5;
     array _y{*}  y1-y5;
     do j=1 to dim(_x);  _y[j]=_x[j]**2; end;
     keep y1-y5;
run;

proc means data=matrixv  noprint;
     var y1-y5;
     output  out=_var(drop=_TYPE_  _FREQ_)   sum()=/autoname;
run;

data _null_;
     set _var;  
     norm=sqrt(sum(of _numeric_));
     put norm=;
run;
/* --LOG WRITES:
norm=28.635642127
NOTE: There were 1 observations read from the data set WORK._VAR.
*/

</code></pre><br />
2. Using its built-in computing engine for Eucleadian Distance, PROC FASTCLUS is also a powerful tool to search for the data point in main table that is CLOEST to the a record in lookup table. This technique is shown&nbsp;<a href="http://www.sas-programming.com/2009/09/tweak-proc-fastclus-for-1-nearest.html">Here</a> and [2].<br />
<br />
<br />
<strong><em>Reference:</em></strong><br />
[1], <strong>P. Drineas and M. W. Mahoney</strong>, "<em>Randomized Algorithms for Matrices and Massive Data Sets</em>", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.<br />
<br />
[2], <strong>Dorfman, Paul M.; Vyverman, Koen; Dorfman, Victor P.,</strong> "<em>Black Belt Hashigana</em>", Proc. of the 2010 SAS Global Forum, Seattle, WA, 2010<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-26676881494384467?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/RjNo7zm2JHY" height="1" width="1"/>
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|小黑屋|手机版|Archiver|SAS中文论坛  

GMT+8, 2026-2-3 23:37 , Processed in 0.171979 second(s), 20 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表