SAS中文论坛
标题:
SAS EM:变量转换结点(Transform Variable Node)
[打印本页]
作者:
shiyiming
时间:
2010-10-22 22:51
标题:
SAS EM:变量转换结点(Transform Variable Node)
From supersasmacro's blog on Sina
<div>SAS EM:变量转换结点(Transform Variable Node)</DIV>
<div><br /></DIV>
<div>SAS EM(Enterprise Miner)企业数据挖掘节点功能详解及代码实现(第八弹)</DIV>
<div><br /></DIV>
<div>本文未经作者允许,请勿转载</DIV>
<div><br /></DIV>
<div><br /></DIV>
<div>变量转换结点(Transform Variable
Node)提供各种衍生变量的产生功能,数值数据转置等。变量转换结点允许你透过转换在数据中已存在的变量建立新的变量。举例来说,你可以在变量中稳定变异数、移除非线性和更正非正态分布的数据,有几种转换的型态: </DIV>
<div> <a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static4.photo.sina.com.cn/orignal/5d3b177ch8d35a30ebf13" TARGET="_blank"><img SRC="http://static4.photo.sina.com.cn/middle/5d3b177ch8d35a30ebf13&690" WIDTH="500" HEIGHT="300" /></A></DIV>
<br />
<div>转换为三种方法:</DIV>
<div>基本转换:</DIV>
<div> Log:取对数。</DIV>
<div> Square root:取平方根。</DIV>
<div> Inverse:取倒数。</DIV>
<div> Square:取平方。</DIV>
<div> Exponential:取指数。</DIV>
<div> Standardize:标准化。</DIV>
<div><br /></DIV>
<div>Binning转换即连续数据分箱:</DIV>
<div> Bucket:将数据依照同大小的宽度分成 n 个区间,每个区间内的数据个数通常会不一样。</DIV>
<div> Quantile:将数据依照数据个数分成 n 个区间,每个区间内的数据各数会相同。</DIV>
<div><br /></DIV>
<div>Best power transforms:最优次方转换</DIV>
<div> Optimal binning for relationship to target:根据目标去优化区间。</DIV>
<div> Maximize normality:最大化正态分布。</DIV>
<div> Maximize correlation with target:最大化与目标的相关系数。</DIV>
<div> Equalize spread with target levels:使与目标的区间相同。</DIV>
<div><br /></DIV>
<div>原始变量或是转换变量展现的字段包括:</DIV>
<div> Name:原始变量或是转换变量的名称。</DIV>
<div> Keep:保留变数做为输出。</DIV>
<div> Mean:平均值。</DIV>
<div> Std Dev:标准偏差。</DIV>
<div> Skew:歪斜值,如果为正,表示在平均值右边的宽度比左边大;如果为负,则表示平均值右边的宽度比左边小。</DIV>
<div> Kurtosis:针对分布的形状的测量值,大的值表示含有一些资料距离平均值较远。</DIV>
<div> C.V.:共变异数。</DIV>
<div> Formula:转换的公式。</DIV>
<div> Format:变数的格式。</DIV>
<div> Label:变量的卷标。</DIV>
<div><br /></DIV>
<div>变量转换结点(Transform Variable Node)</DIV>
<div> <a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static5.photo.sina.com.cn/orignal/5d3b177ch8d35a487a534" TARGET="_blank"><img SRC="http://static5.photo.sina.com.cn/middle/5d3b177ch8d35a487a534&690" WIDTH="500" HEIGHT="300" /></A></DIV>
<br />
<div>设置目标变量</DIV>
<div> <a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static3.photo.sina.com.cn/orignal/5d3b177ch8d35a5c3a582" TARGET="_blank"><img SRC="http://static3.photo.sina.com.cn/middle/5d3b177ch8d35a5c3a582&690" WIDTH="690" HEIGHT="303" /></A></DIV>
<br />
<div>对变量进行转换</DIV>
<div> <a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static7.photo.sina.com.cn/orignal/5d3b177ch74855d866ff6" TARGET="_blank"><img SRC="http://static7.photo.sina.com.cn/middle/5d3b177ch74855d866ff6&690" WIDTH="494" HEIGHT="207" /></A></DIV>
<br />
<div>变量转换结果</DIV>
<div> <a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static9.photo.sina.com.cn/orignal/5d3b177ch8d35ad45b3c8" TARGET="_blank"><img SRC="http://static9.photo.sina.com.cn/middle/5d3b177ch8d35ad45b3c8&690" WIDTH="690" HEIGHT="230" /></A></DIV>
<br />
<div>代码实现如下:</DIV>
<div>%let DM_SEED = 12345;</DIV>
<div><br /></DIV>
<div>libname SAMPSIO list;</DIV>
<div> </DIV>
<div>data EMDATA.VIEW_KXX / view=EMDATA.VIEW_KXX;</DIV>
<div> set EMSAMPLE.DMAGECR;</DIV>
<div>run;</DIV>
<div> </DIV>
<div>data EMPROJ.SMP_VIIA /view=EMPROJ.SMP_VIIA;</DIV>
<div> set EMSAMPLE.DMAGECR;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc sql noprint;</DIV>
<div> select count(*) into
:_tmpa </DIV>
<div> from
sashelp.vstabvw </DIV>
<div> where libname = "EMSAMPLE"
and </DIV>
<div> upcase(memname) =
upcase("DMAGECR");</DIV>
<div>quit;</DIV>
<div><br /></DIV>
<div>data EMPROJ.SMP_XGPV/view=EMPROJ.SMP_XGPV;</DIV>
<div> set EMPROJ.SMP_VIIA;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>data EMDATA.TRNTSZ2K/view=EMDATA.TRNTSZ2K;</DIV>
<div> set EMDATA.VIEW_KXX;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>**这里,对AMOUNT变量按Maximize normality(最大化正态分布)的要求来进行变量转换;</DIV>
<div><a href="http://blog.photo.sina.com.cn/showpic.html#url=http://static3.photo.sina.com.cn/orignal/5d3b177ch8d35aefd78f2" TARGET="_blank"><img SRC="http://static3.photo.sina.com.cn/middle/5d3b177ch8d35aefd78f2&690" WIDTH="463" HEIGHT="355" /></A><br />
<br /></DIV>
<div>
*这里一共提供了以下几种变量转换方式,然后找出最满足正态分布的转换方式作为最终的转换方式:求自然对数,1/4次方,1/2次方,平方,4次方,E的X次方等(log(x),x1/4,sqrt(x),x2,x4,ex);</DIV>
<div>* AMOUNT ;</DIV>
<div>data _trntmp(keep=AMOUNT _logvar _rt4var _sqrtvar _sqrvar
_pwr4var _expvar);</DIV>
<div> set EMPROJ.SMP_VIIA;</DIV>
<div> if
AMOUNT + 0 > 0 then _logvar = log(AMOUNT + 0);</DIV>
<div> else
_logvar = .;</DIV>
<div>
_rt4var = (AMOUNT + 0) ** 0.25;</DIV>
<div>
_sqrtvar = sqrt((AMOUNT + 0));</DIV>
<div>
_sqrvar = (AMOUNT + 0)**2;</DIV>
<div>
_pwr4var = (AMOUNT + 0)**4;</DIV>
<div>
_expvar = exp((AMOUNT + 0)/184.24);</DIV>
<div>RUN;</DIV>
<div>**标准化; </DIV>
<div>proc standard data=_trntmp </DIV>
<div>
out =_trnstd mean=0
std=1;</DIV>
<div>RUN;</DIV>
<div><br /></DIV>
<div>proc sort data=_trnstd;</DIV>
<div> by
AMOUNT;</DIV>
<div>run;</DIV>
<div>**先生成一个正态分布变量;</DIV>
<div>data _trnstd;</DIV>
<div> set _trnstd;</DIV>
<div>
normval = probit(_n_/(1000+1));</DIV>
<div>run;</DIV>
<div>**候选转换变量与正态分布变量求相关性;</DIV>
<div>proc corr data=_trnstd outp=_indtrn noprint;</DIV>
<div> var
AMOUNT _logvar _rt4var _sqrtvar _sqrvar _pwr4var _expvar;</DIV>
<div> with
normval;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>data _modtmp(keep=_power _val);</DIV>
<div> set _indtrn;</DIV>
<div> if
_type_ = 'CORR' then do;</DIV>
<div>
_power = 0;</DIV>
<div>
_val = ((2*_logvar +
_rt4var)/3)**2;</DIV>
<div>
output;</DIV>
<div>
_power = .25;</DIV>
<div>
_val = ((_logvar + 2*_rt4var +
_sqrtvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = .5;</DIV>
<div>
_val = ((_rt4var + 2*_sqrtvar
+ AMOUNT)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 1;</DIV>
<div>
_val = ((_sqrtvar + 2*AMOUNT +
_sqrvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 2;</DIV>
<div>
_val = ((AMOUNT + 2*_sqrvar +
_pwr4var)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 4;</DIV>
<div>
_val = ((_sqrvar + 2*_pwr4var
+ _expvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 10;</DIV>
<div>
_val = ((_pwr4var +
2*_expvar)/3)**2;</DIV>
<div>
output;</DIV>
<div>
end;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc sort;</DIV>
<div> by
descending _val;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>%let _tmpa=1;</DIV>
<div>proc sql;</DIV>
<div> reset noprint;</DIV>
<div> select _power into
:_tmpa </DIV>
<div> from _modtmp;</DIV>
<div>quit;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc datasets lib=work nolist;</DIV>
<div> delete
_trntmp _modtmp _indtrn;</DIV>
<div>run;</DIV>
<div>quit;</DIV>
<div><br /></DIV>
<div>proc format lib=WORK;</DIV>
<div> value
AGE_1BY_ low-33 ='0001:low-33'</DIV>
<div>
33-47
='0002:33-47'</DIV>
<div>
47-61
='0003:47-61'</DIV>
<div>
61-high='0004:61-high';</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div> </DIV>
<div>data _trntmp(keep=DURATION GOOD_BAD _logvar _rt4var _sqrtvar
_sqrvar _pwr4var _expvar);</DIV>
<div> set EMPROJ.SMP_VIIA;</DIV>
<div> if
DURATION + 0 > 0 then _logvar = log(DURATION +
0);</DIV>
<div> else
_logvar = .;</DIV>
<div>
_rt4var = (DURATION + 0) ** 0.25;</DIV>
<div>
_sqrtvar = sqrt((DURATION + 0));</DIV>
<div>
_sqrvar = (DURATION + 0)**2;</DIV>
<div>
_pwr4var = (DURATION + 0)**4;</DIV>
<div>
_expvar = exp((DURATION + 0)/1);</DIV>
<div>RUN;</DIV>
<div> </DIV>
<div>proc standard data=_trntmp out=_trnstd mean=0 std=1;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc summary data=_trnstd;</DIV>
<div> class
GOOD_BAD;</DIV>
<div> var
DURATION _logvar _rt4var _sqrtvar _sqrvar _pwr4var _expvar;</DIV>
<div> output
out=_indtrn std=;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc summary data=_indtrn;</DIV>
<div> where
_type_=1;</DIV>
<div> var
DURATION _logvar _rt4var _sqrtvar _sqrvar _pwr4var _expvar;</DIV>
<div> output
out=_indtrn std=;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>data _modtmp(keep=_power _val);</DIV>
<div> set _indtrn;</DIV>
<div> if
_type_ = 0 then do;</DIV>
<div>
_power = 0;</DIV>
<div>
_val = ((2*_logvar +
_rt4var)/3)**2;</DIV>
<div>
output;</DIV>
<div>
_power = .25;</DIV>
<div>
_val = ((_logvar + 2*_rt4var +
_sqrtvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = .5;</DIV>
<div>
_val = ((_rt4var + 2*_sqrtvar
+ DURATION)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 1;</DIV>
<div>
_val = ((_sqrtvar + 2*DURATION
+ _sqrvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 2;</DIV>
<div>
_val = ((DURATION + 2*_sqrvar
+ _pwr4var)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 4;</DIV>
<div>
_val = ((_sqrvar + 2*_pwr4var
+ _expvar)/4)**2;</DIV>
<div>
output;</DIV>
<div>
_power = 10;</DIV>
<div>
_val = ((_pwr4var +
2*_expvar)/3)**2;</DIV>
<div>
output;</DIV>
<div>
end;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc sort;</DIV>
<div> by
_val;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>%let _tmpa=1;</DIV>
<div>proc sql;</DIV>
<div> reset noprint;</DIV>
<div> select _power into
:_tmpa </DIV>
<div> from _modtmp;</DIV>
<div>quit;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>proc datasets lib=work nolist;</DIV>
<div> delete
_trntmp _modtmp _indtrn;</DIV>
<div>run;</DIV>
<div>quit;</DIV>
<div><br /></DIV>
<div>**这里进行基本转换;</DIV>
<div>data EMDATA.TRNTSZ2K/view=EMDATA.TRNTSZ2K;</DIV>
<div> set EMDATA.VIEW_KXX;</DIV>
<div> drop
DURATION;</DIV>
<div>
*;</DIV>
<div> format
DURA_BF9 BEST12.;</DIV>
<div> label
DURA_BF9='DURATION: Equalize spread among GOOD_BAD';</DIV>
<div> if
DURATION > 0 then</DIV>
<div>
DURA_BF9=log(DURATION);</DIV>
<div> else
DURA_BF9 = .;</DIV>
<div>
*;</DIV>
<div> format
DURA_QPD BEST12.;</DIV>
<div> label
DURA_QPD='standardize(DURATION)';</DIV>
<div>
DURA_QPD=(DURATION - 20.903) / 12.05881;</DIV>
<div> drop
AMOUNT;</DIV>
<div>
*;</DIV>
<div> format
AMOU_NQU BEST12.;</DIV>
<div> label
AMOU_NQU='square(AMOUNT)';</DIV>
<div>
AMOU_NQU=(AMOUNT)**2;</DIV>
<div>
*;</DIV>
<div> format
AMOU_8V9 BEST12.;</DIV>
<div> label
AMOU_8V9='inverse(AMOUNT)';</DIV>
<div>
AMOU_8V9=1/(AMOUNT);</DIV>
<div>
*;</DIV>
<div> format
AMOU_RQU BEST12.;</DIV>
<div> label
AMOU_RQU='AMOUNT: Maximize normality';</DIV>
<div> if
AMOUNT > 0 then</DIV>
<div>
AMOU_RQU=log(AMOUNT);</DIV>
<div> else
AMOU_RQU = .;</DIV>
<div> drop
AGE;</DIV>
<div>
*;</DIV>
<div> format
AGE_1BYU AGE_1BY_17.;</DIV>
<div> label
AGE_1BYU='Bucket(AGE)';</DIV>
<div>
AGE_1BYU=AGE;</DIV>
<div>run;</DIV>
<div><br /></DIV>
<div>还有一种为BIN转换,方法与变量选择时的方法类似,大家自己去研究吧。</DIV>
<div><br /></DIV>
<div><br /></DIV>
<div><br /></DIV>
<div>本文用到的SAS数据集为dmagecr.sas7bdat,其下载地址:</DIV>
<div><!-- m --><a class="postlink" href="http://ishare.iask.sina.com.cn/f/8641122.html">http://ishare.iask.sina.com.cn/f/8641122.html</a><!-- m --></DIV>
<div>本系列全部数据下载地址:</DIV>
<div><!-- m --><a class="postlink" href="http://iask.sina.com.cn/u/1564153724/ish">http://iask.sina.com.cn/u/1564153724/ish</a><!-- m --></DIV>
<div><br /></DIV><div style="border-top: 1px solid rgb(203, 217, 217); padding-top: 20px; padding-bottom: 10px;">
<p><br><a href="http://move.blog.sina.com.cn/admin/blogmove/blogmove_msn.php" target="_blank">MSN空间完美搬家到新浪博客!</a></p></div>
欢迎光临 SAS中文论坛 (https://mysas.net/forum/)
Powered by Discuz! X3.2