An Economic Approach for a Class of Dimensionality Reduction

shiyiming · 发表于 2010-10-22 13:39:51

From oloolo's blog on SasProgramming

<a href="http://feedads.g.doubleclick.net/~a/aA_miOlbV0dESAgnH2eH1fWgcvo/0/da"><img src="http://feedads.g.doubleclick.net/~a/aA_miOlbV0dESAgnH2eH1fWgcvo/0/di" border="0" ismap="true"></img></a> 
<a href="http://feedads.g.doubleclick.net/~a/aA_miOlbV0dESAgnH2eH1fWgcvo/1/da"><img src="http://feedads.g.doubleclick.net/~a/aA_miOlbV0dESAgnH2eH1fWgcvo/1/di" border="0" ismap="true"></img></a>Just back from KDD2010. In the conference, there are several papers that interested me. 
 
On the computation side, Liang Sun et al.'s paper [1], "A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques" caught my eyes. Liang proves that a class of dimension reduction techniques, such as CCA, OPLS, LDA, etc, that relies on general eigenvalue decomposition, can be computed in a much cheaper way by decomposing the original computation into a least square problem and a much smaller scale eigenvalue decomposition problem. The equivalence of their two stage approach and direct eigenvalue decomposition is rigourously proved. 
 
This technique is of particular interest to ppl like me that only have limited computing resources and I believe it would be good to implement their algorithm in SAS. For example, a Canonical Discriminant Analysis with above idea is demonstrated below. Note also that by specifing RIDGE= option in PROC REG, the regularized version can be implemented as well, besides, PROC REG is multi-threaded in SAS. Of course, the computing advantage is only appreciatable when the number of features is very large. 
 
The canonical analysis result from reduced version PROC CANDISC is the same as the full version. 
 
In fact, this exercise is the answer for Exercise 4.3 of The Elements of Statistical Learning [2] 
 
[1]. Liang Sun, Betul Ceran, Jieping Ye, "<a href="http://www.public.asu.edu/%7Elsun27/Publications/KDD_2010.pdf">A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques</a>", KDD2010, Washington DC. 
 
[2]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", 2nd Edition. 
 
<a href="http://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576?ie=UTF8&tag=xie1978&link_code=bil&camp=213689&creative=392969" imageanchor="1" target="_blank"><img alt="The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)" src="http://ws.amazon.com/widgets/q?MarketPlace=US&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=_SL160_&ASIN=0387848576&tag=xie1978" /></a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=xie1978&l=bil&camp=213689&creative=392969&o=1&a=0387848576" style="border: medium none ! important; margin: 0px ! important; padding: 0px ! important;" width="1" /> 
 
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>

proc format;
 value specname
 1='Setosa '
 2='Versicolor'
 3='Virginica ';
run;

data iris;
 title 'Fisher (1936) Iris Data';
 input SepalLength SepalWidth PetalLength PetalWidth
 Species @@;
 format Species specname.;
 label SepalLength='Sepal Length in mm.'
 SepalWidth ='Sepal Width in mm.'
 PetalLength='Petal Length in mm.'
 PetalWidth ='Petal Width in mm.';
 symbol = put(Species, specname10.);
 datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;
proc candisc data=iris out=outcan distance anova;
 class Species;
 var SepalLength SepalWidth PetalLength PetalWidth;
run;

 ods select none;
 proc glmmod data=iris outdesign=H(keep=COL:);
 class Species;
 model SepalLength=Species/noint;
 run;

 data H;
 merge H iris;
 run;

/**************************
for efficiency consideration, a view can also be used:
data H/view=H;
 set iris;
 array _S{*} Col1-Col3 (3*0);
 do j=1 to dim(_S); _S[j]=0; end;
 _S[Species]=1;
 drop j;
run;
****************************/
 proc reg data=H outest=beta;
 model Col1-Col3 = SepalLength SepalWidth PetalLength PetalWidth;
output out=P p=yhat1-yhat3;
 run;quit;
 ods select all;

 proc candisc data=P;
 class Species;
var yhat1-yhat3;
 run;

</code></pre><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-1890102299009695989?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/z9cIg-wjxLc" height="1" width="1"/>

		自动登录	找回密码
密码			立即注册