SAS中文论坛

标题: K-Nearest Neighbor in SAS [打印本页]

作者: shiyiming 时间: 2010-10-22 13:39
标题: K-Nearest Neighbor in SAS
From oloolo's blog on SasProgramming

<a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/0/di" border="0" ismap="true"></img></a> 
<a href="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/da"><img src="http://feedads.g.doubleclick.net/~a/AAI9N0wNnrKWm2ODbaRacE26u8g/1/di" border="0" ismap="true"></img></a>K-Nearest-Neighbor, aka KNN, is a widely used data mining tool and is often called memory-based/case-based/instance-based method as no model is fit. A good introduction to KNN can be find at [1], or @ <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">Wiki</a>. 
 
Typically, KNN algorithm relies on a sophisticated data structure called kd-Tree [2] to quickly find the cloeset points for a given point in high dimensional space. While you can find good pseudo-code for kd-Tree implementation and KNN online everywhere, for example [3], it is not trivial to implement your own in SAS [I mean an efficient one]. For most analysts, the first idea is to turn to your $200K-Initial-Fee-AND-$60K-Per-Year Enerprise Miner(R) for this method, however, it turns out that PROC DISCRIM in SAS/STAT is able to do the same thing! Note that annual license fee for SAS/STAT is a tiny fraction of EM and most analysts have ready access to SAS/STAT. 
 
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. 
 
Because KNN is a memory-based method, when you score the test data or new data in production, you have to use the raw table in the scoring process. Test different values of K using Cross-Validation to select the best one [you need a macro loop then.] 
 
PROC DISCRIM uses memory approximately proportional to the second order of number of variables and time usage, excluding I/O time, is roughly proportional to log(N)*(N*P) where N is the number of observations and P is the number of variables used as it uses the tree search algorithm in [4]. 
 
Therefore, time consumption is still a concern on large data set. As an experiment, I conducted KNN on data set with records ranging from 5000 to 50000 while scoring the same amount of records at the same time, using 20 features and let k=11, we observe: 
 
obs        time (sec)     memory (KB) 
5000        1.03            1068 
10000      3.90            1949 
15000      8.65            2830 
20000    15.48            3709 
25000    34.58            4587 
30000    70.26            5472 
35000  109.83            6350 
40000  161.71            7229 
45000  208.80            8108 
50000  263.58            8987 
 
Empirical time usage is roughly quadratic in the multiplier of per 5000 observations, which means to work on a data set with 300K observations and score the same number of records, SAS needs to take 3.75 hours! Large K value will only increase time used by a very small fraction, though. 
Sample code using Iris data from SAS Online Doc for v9.1.3: 
 
<pre style="background-color: #ebebeb; border-bottom: #999999 1px dashed; border-left: #999999 1px dashed; border-right: #999999 1px dashed; border-top: #999999 1px dashed; color: #000001; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 100%;"><code>
ods select none;
proc surveyselect data=iris out=iris2
 samprate=0.5 method=srs outall;
run;
ods select all;

%let k=5;
proc discrim data=iris2(where=(selected=1))
 test=iris2(where=(selected=0))
 testout=iris2testout
 method=NPAR k=&k
 listerr crosslisterr;
 class Species;
 var SepalLength SepalWidth PetalLength PetalWidth;
 title2 'Using KNN on Iris Data';
run;

proc freq data=iris2testout;
 table Species*_INTO_;
run;
</code></pre> 
Reference: 
[1]. Trevor Hastie, Robert Tibshirani, Jerome Friedman, "Elements of Statistical Learning", Ed.2, Springer, 2008; 
[2]. J. L. Bentley. "Multidimensional binary search trees used for associative searching", Communications of the ACM, 18(9):509-517, 1975; 
[3]. <a href="http://simsearch.yury.name/tutorial.html"><a class="postlink" href="http://simsearch.yury.name/tutorial.html">http://simsearch.yury.name/tutorial.html</a></a> ; 
[4]. Friedman, J.H., Bentley, J.L., and Finkel, R.A., "An Algorithm for Finding Best Matches in Logarithmic Expected Time," ACM Transactions on Mathematical Software, 3, 209 - 226. 1977; 
 
<a href="http://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576?ie=UTF8&tag=xie1978&link_code=bil&camp=213689&creative=392969" imageanchor="1" target="_blank"><img alt="The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)" src="http://ws.amazon.com/widgets/q?MarketPlace=US&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=_SL160_&ASIN=0387848576&tag=xie1978" /></a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=xie1978&l=bil&camp=213689&creative=392969&o=1&a=0387848576" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px! important; padding-left: 0px! important; padding-right: 0px! important; padding-top: 0px! important;" width="1" /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/29815492-775463015665104754?l=www.sas-programming.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasProgramming/~4/D7Ib6XRCVGA" height="1" width="1"/>

欢迎光临 SAS中文论坛 (https://mysas.net/forum/)