标题: K-Nearest Neighbor in SAS [打印本页] 作者: shiyiming 时间: 2010-10-22 13:39 标题: K-Nearest Neighbor in SAS From oloolo's blog on SasProgramming
<p><a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/di" border="0" ismap="true"></img></a></p>K-Nearest-Neighbor, aka KNN, is a widely used data mining tool and is often called memory-based/case-based/instance-based method as no model is fit. A good introduction to KNN can be find at [1], or @ <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">Wiki</a>. <br />
<br />
Typically, KNN algorithm relies on a sophisticated data structure called kd-Tree [2] to quickly find the cloeset points for a given point in high dimensional space. While you can find good pseudo-code for kd-Tree implementation and KNN online everywhere, for example [3], it is not trivial to implement your own in SAS [I mean an efficient one]. For most analysts, the first idea is to turn to your <b>$200K-Initial-Fee-AND-$60K-Per-Year</b> Enerprise Miner(R) for this method, however, it turns out that PROC DISCRIM in SAS/STAT is able to do the same thing! Note that annual license fee for SAS/STAT is a tiny fraction of EM and most analysts have ready access to SAS/STAT.<br />
<br />
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance.<br />
<br />
Because KNN is a memory-based method, when you score the test data or new data in production, you have to use the raw table in the scoring process. Test different values of K using Cross-Validation to select the best one [you need a macro loop then.]<br />
<br />
PROC DISCRIM uses memory approximately proportional to the second order of number of variables and time usage, excluding I/O time, is roughly proportional to log(N)*(N*P) where N is the number of observations and P is the number of variables used as it uses the tree search algorithm in [4]. <br />
<br />
Therefore, time consumption is still a concern on large data set. As an experiment, I conducted KNN on data set with records ranging from 5000 to 50000 while scoring the same amount of records at the same time, using 20 features and let k=11, we observe:<br />
<br />
obs time (sec) memory (KB)<br />
5000 1.03 1068<br />
10000 3.90 1949<br />
15000 8.65 2830<br />
20000 15.48 3709<br />
25000 34.58 4587<br />
30000 70.26 5472<br />
35000 109.83 6350<br />
40000 161.71 7229<br />
45000 208.80 8108<br />
50000 263.58 8987<br />
<br />
Empirical time usage is roughly quadratic in the multiplier of per 5000 observations, which means to work on a data set with 300K observations and score the same number of records, SAS needs to take 3.75 hours! Large K value will only increase time used by a very small fraction, though.<br />
Sample code using Iris data from SAS Online Doc for v9.1.3:<br />
<br />
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
proc freq data=iris2testout;
table Species*_INTO_;
run;
</code></pre><br />
Reference:<br />
[1]. <b>Trevor Hastie, Robert Tibshirani, Jerome Friedman</b>, "Elements of Statistical Learning", Ed.2, Springer, 2008;<br />
[2]. <b>J. L. Bentley</b>. "Multidimensional binary search trees used for associative searching", Communications of the ACM, 18(9):509-517, 1975;<br />
[3]. <a href="http://simsearch.yury.name/tutorial.html"><!-- m --><a class="postlink" href="http://simsearch.yury.name/tutorial.html">http://simsearch.yury.name/tutorial.html</a><!-- m --></a> ;<br />
[4]. <strong>Friedman, J.H., Bentley, J.L., and Finkel, R.A</strong>., "An Algorithm for Finding Best Matches in Logarithmic Expected Time," ACM Transactions on Mathematical Software, 3, 209 - 226. 1977;<br />
<br />
<a href="http://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576?ie=UTF8&tag=xie1978&link_code=bil&camp=213689&creative=392969" imageanchor="1" target="_blank"><img alt="The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)" src="http://ws.amazon.com/widgets/q?MarketPlace=US&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=_SL160_&ASIN=0387848576&tag=xie1978" /></a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=xie1978&l=bil&camp=213689&creative=392969&o=1&a=0387848576" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px! important; padding-left: 0px! important; padding-right: 0px! important; padding-top: 0px! important;" width="1" /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-775463015665104754?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/D7Ib6XRCVGA" height="1" width="1"/>