SAS中文论坛

标题: K-Nearest Neighbor in SAS [打印本页]

作者: shiyiming    时间: 2010-10-22 13:39
标题: K-Nearest Neighbor in SAS
From oloolo's blog on SasProgramming


<p><a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/di" border="0" ismap="true"></img></a></p>K-Nearest-Neighbor, aka KNN, is a widely used data mining tool and is often called memory-based/case-based/instance-based method as no model is fit. A good introduction to KNN can be find at [1], or @ <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">Wiki</a>. <br />
<br />
Typically, KNN algorithm relies on a sophisticated data structure called kd-Tree [2] to quickly find the cloeset points for a given point in high dimensional space. While you can find good pseudo-code for kd-Tree implementation and KNN online everywhere, for example [3], it is not trivial to implement your own in SAS [I mean an efficient one]. For most analysts, the first idea is to turn to your <b>$200K-Initial-Fee-AND-$60K-Per-Year</b> Enerprise Miner(R) for this method, however, it turns out that PROC DISCRIM in SAS/STAT is able to do the same thing! Note that annual license fee for SAS/STAT is a tiny fraction of EM and most analysts have ready access to SAS/STAT.<br />
<br />
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance.<br />
<br />
Because KNN is a memory-based method, when you score the test data or new data in production, you have to use the raw table in the scoring process. Test different values of K using Cross-Validation to select the best one [you need a macro loop then.]<br />
<br />
PROC DISCRIM uses memory approximately proportional to the second order of number of variables and time usage, excluding I/O time, is roughly proportional to log(N)*(N*P) where N is the number of observations and P is the number of variables used as it uses the tree search algorithm in [4]. <br />
<br />
Therefore, time consumption is still a concern on large data set. As an experiment, I conducted KNN on data set with records ranging from 5000 to 50000 while scoring the same amount of records at the same time, using 20 features and let k=11, we observe:<br />
<br />
obs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; time (sec)&nbsp;&nbsp;&nbsp;&nbsp; memory (KB)<br />
5000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.03&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1068<br />
10000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.90&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;1949<br />
15000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.65&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2830<br />
20000&nbsp;&nbsp;&nbsp; 15.48&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3709<br />
25000&nbsp; &nbsp; 34.58&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4587<br />
30000&nbsp;&nbsp;&nbsp; 70.26&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5472<br />
35000&nbsp; 109.83&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6350<br />
40000&nbsp; 161.71&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7229<br />
45000&nbsp; 208.80&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8108<br />
50000&nbsp; 263.58&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8987<br />
<br />
Empirical time usage is roughly quadratic in the multiplier of per 5000 observations, which means to work on a data set with 300K observations and score the same number of records, SAS needs to take 3.75 hours! Large K value will only&nbsp;increase time used by a very small fraction, though.<br />
Sample code using Iris data from SAS Online Doc for v9.1.3:<br />
<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
ods select none;
proc surveyselect data=iris  out=iris2  
                  samprate=0.5  method=srs  outall;
run;
ods select all;

%let k=5;
proc discrim data=iris2(where=(selected=1))   
             test=iris2(where=(selected=0))
             testout=iris2testout
             method=NPAR k=&amp;k
             listerr crosslisterr;
      class Species;
      var SepalLength SepalWidth PetalLength PetalWidth;
      title2 'Using KNN on Iris Data';
run;

proc freq data=iris2testout;
     table Species*_INTO_;
run;
</code></pre><br />
Reference:<br />
[1]. <b>Trevor Hastie, Robert Tibshirani, Jerome Friedman</b>, "Elements of Statistical Learning", Ed.2, Springer, 2008;<br />
[2]. <b>J. L. Bentley</b>. "Multidimensional binary search trees used for associative searching", Communications of the ACM, 18(9):509-517, 1975;<br />
[3]. <a href="http://simsearch.yury.name/tutorial.html"><!-- m --><a class="postlink" href="http://simsearch.yury.name/tutorial.html">http://simsearch.yury.name/tutorial.html</a><!-- m --></a>&nbsp;;<br />
[4]. <strong>Friedman, J.H., Bentley, J.L., and Finkel, R.A</strong>., "An Algorithm for Finding Best Matches in Logarithmic Expected Time," ACM Transactions on Mathematical Software, 3, 209 - 226. 1977;<br />
<br />
<a href="http://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576?ie=UTF8&amp;tag=xie1978&amp;link_code=bil&amp;camp=213689&amp;creative=392969" imageanchor="1" target="_blank"><img alt="The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)" src="http://ws.amazon.com/widgets/q?MarketPlace=US&amp;ServiceVersion=20070822&amp;ID=AsinImage&amp;WS=1&amp;Format=_SL160_&amp;ASIN=0387848576&amp;tag=xie1978" /></a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=xie1978&amp;l=bil&amp;camp=213689&amp;creative=392969&amp;o=1&amp;a=0387848576" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px! important; padding-left: 0px! important; padding-right: 0px! important; padding-top: 0px! important;" width="1" /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-775463015665104754?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/D7Ib6XRCVGA" height="1" width="1"/>




欢迎光临 SAS中文论坛 (https://mysas.net/forum/) Powered by Discuz! X3.2