SAS中文论坛

 找回密码
 立即注册

扫一扫,访问微社区

查看: 3396|回复: 0
打印 上一主题 下一主题

Multicollinearity and the solutions

[复制链接]

49

主题

76

帖子

1462

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
1462
楼主
 楼主| 发表于 2012-3-22 21:55:15 | 只看该作者

Multicollinearity and the solutions

From Dapangmao's blog on sas-analysis

<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/--Zz3fr5Fjg8/T2lIOAYHVcI/AAAAAAAAA_o/sd5nmXvLXFA/s1600/SGPlot1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://4.bp.blogspot.com/--Zz3fr5Fjg8/T2lIOAYHVcI/AAAAAAAAA_o/sd5nmXvLXFA/s400/SGPlot1.png" width="400" /></a></div><br />
In <a href="http://www.amazon.com/Regression-Methods-Statistics-textbooks-monographs/dp/0824766474/ref=sr_1_1?ie=UTF8&amp;qid=1332172741&amp;sr=8-1">his book</a>, Rudolf Freund described a confounding phenomenon while fitting a linear regression. Given a small data set below, there are three variables - dependent variable(y) and independent variables(x1 and x2). Using x2 to fit y alone, the estimated parameter of x2 f is positive that is 0.78. Then using x1 and x2 together to fit y, the parameter of x2 becomes -1.29, which is hard to explain since clearly x2 and y has a positive correlation.<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
data raw;
input y x1 x2;
cards;
2 0 2
3 2 6
2 2 7
7 2 5
6 4 9
8 4 8
10 4 7
7 6 10
8 6 11
12 6 9
11 8 15
14 8 13
;;;
run;

ods graphics on / border = off;
proc sgplot data = raw;
   reg x = x2 y = y;
   reg x = x2 y = y / group = x1 datalable = x1;
run;
</code></pre><div class="separator" style="clear: both; text-align: center;"></div><b><br />
</b><br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-JLtalNH5b40/T2lK8TwzAKI/AAAAAAAAA_w/usCX3SyF9s4/s1600/Presentation1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="http://3.bp.blogspot.com/-JLtalNH5b40/T2lK8TwzAKI/AAAAAAAAA_w/usCX3SyF9s4/s640/Presentation1.png" width="640" /></a></div>The reason is that x1 and x2 have strong correlation each other. Diagnostics are well when using x2 to fit y. However, counting x1 and x2 together into the regression model causes multicollinearity, and therefore demonstrates severe heteroskedasticity and a skewed distribution of the residuals, which violates the <a href="http://www.sasanalysis.com/2011/07/10-minute-tutorial-for-linear.html">assumptions</a> for OLS regressions. Shown in the top scatter plot, 0.78 is the slope of the regression line by y ~ x2 (the longest straight line), while -1.29 is&nbsp;actually&nbsp;the slope of the partial regression lines by y ~ x2|x1 (four short segments). <br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc reg data = raw;
   model y = x2;
   ods select parameterestimates diagnosticspanel;
run;

proc reg data = raw;
   model y = x1 x2;
   ods select parameterestimates diagnosticspanel;
run;
</code></pre><b>Solutions:</b><br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-MSiPPghR-g4/T2lMjrq4JXI/AAAAAAAAA_4/FjthFobzbGs/s1600/DiagnosticsPanel2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://2.bp.blogspot.com/-MSiPPghR-g4/T2lMjrq4JXI/AAAAAAAAA_4/FjthFobzbGs/s320/DiagnosticsPanel2.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div>1. Drop a variable<br />
Standing alone, x1 seems like a better predictor (higher R-square and lower MSE) than x2. The easiest way to remove this multicollinearity is to keep&nbsp;only&nbsp;x1 in the model.<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc reg data = raw;
   model y = x1;
   ods select parameterestimates diagnosticspanel;
run;
</code></pre><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-XCWo81NBAdk/T2eOVsSWEPI/AAAAAAAAA_U/Bj_C83lilsY/s1600/ScreePlot5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="http://2.bp.blogspot.com/-XCWo81NBAdk/T2eOVsSWEPI/AAAAAAAAA_U/Bj_C83lilsY/s400/ScreePlot5.png" width="400" /></a></div><br />
2. Principle component regression<br />
If we want to keep both variables to avoid information loss, principle component regression is a good option. PCA would transform the correlated variables to the orthogonal factors. In this case, the 1st eigenvector explains 97.77% of the total variance, which is fairly enough for the following regression. SAS's <a href="http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_pls_sect004.htm">PLS procedure </a>can also perform the principle component regression.<br />
<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc princomp data = raw out = pca;
   ods select screeplot corr eigenvalues eigenvectors;
   var x1 x2;
run;

proc reg data = pca;
   model y = prin1;
   ods select parameterestimates diagnosticspanel;
run;
</code></pre><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3256159328630041416-7982648040085775225?l=www.sasanalysis.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasAnalysis/~4/BX2vQoC9hzs" height="1" width="1"/>
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|小黑屋|手机版|Archiver|SAS中文论坛  

GMT+8, 2025-6-9 07:15 , Processed in 0.068891 second(s), 20 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表