SAS中文论坛

标题: 请教一个SAS编程的题目 [打印本页]

作者: shiyiming 时间: 2011-8-9 21:01
标题: 请教一个SAS编程的题目
向高手请教一段代码，谢谢。
需求如下：
1、原始数据集格式：
item_id,prop
1,A
1,B
1,C
1,D
1,E
1,F
2,A
2,C
2,G
2,H
3,C
3,D
3,J
...
2、需求：
需要计算item与item组合的prop重合度,例如item_id=1的prop是(A,B,C,D,E,F),item_id=2的prop是(A,C,G,H),重合的prop个数是2（即A、C重合）.
计算后的结果如下:item组合后要求item2>item1,item1_count表示item1的prop个数，item2_count表示item2的prop个数，match_count表示item1和item2重合prop个数
item_1,item_2,item1_count,item2_count,match_count
1,2,6,4,2
1,3,6,3,2
2,3,4,3,1
...

作者: shiyiming 时间: 2011-8-10 00:09
标题: Re: 请教一个SAS编程的题目
data raw;
format item_id prop $1.;
infile datalines dlm=',';
input item_id $ prop $ ;
cards;
1,A
1,B
1,C
1,D
1,E
1,F
2,A
2,C
2,G
2,H
3,C
3,D
3,J
;
run;

proc sql;
select count(distinct prop) into:nump from raw;
quit;

data temp1;
length allp $&nump..;
set raw;
by item_ID;
retain allp;
if first.item_ID then do;allp='';nump=0;end;
nump+1;
allp=strip(allp)||prop;
if last.item_id then output;
drop prop;
run;

data result;
   set temp1 nobs=nobs;
      item1= item_id;
      item1_count=nump;
lagp=allp;
do p=_n_+1 to nobs;
set temp1 point=p ;
            item2=item_id;
            Item2_count=nump;
            match_count=0;
            do i=1 to &nump;
               if not missing(substr(allp,i,1)) then match_count+ index(lagp,substr(allp,i,1)) ne 0;
            end;
            output;
end;
   keep item1: item2: mat:;
run;

作者: shiyiming 时间: 2011-8-10 10:58
标题: Re: 请教一个SAS编程的题目
谢谢sun
很好的思路，通过两个set + point=的应用把两张表连接起来了。
我现在担心的就是性能的问题：
1、do p=_n_+1 to nobs;--nobs可能是几百万
2、do i=1 to &nump; --&nump可能是几十万

这里面的循环量比较大，不知道是否还有更高性能的算法？

作者: shiyiming 时间: 2011-8-10 12:12
标题: Re: 请教一个SAS编程的题目
看看这样效率是否会高些....避免了内循环...
[code:2wu8g11e]data test;
infile cards dsd;
input item_id prop $;
cards;
1,A
1,B
1,C
1,D
1,E
1,F
2,A
2,C
2,G
2,H
3,C
3,D
3,J
;
data one(drop=prop);
  set test;
  by item_id notsorted;
  length id_prop $30;
  retain id_prop;
  if first.item_id then id_prop='';
  id_prop=trimn(id_prop)||prop;
  if last.item_id;
run;

data two(keep=item_1 item_2 item1_count item2_count match_count);
  set one nobs=obs;
  retain item_1;
  temp=lag(id_prop);
  item_1=lag(item_id);
  if _n_ ne 1 then do i=_n_ to obs;
set one point=i;
item_2=item_id;
item1_count=length(temp);
item2_count=length(id_prop);
match_count=countc(strip(temp),strip(id_prop),'i');
output;
  end;
run;[/code:2wu8g11e]

作者: shiyiming 时间: 2011-8-10 14:09
标题: Re: 请教一个SAS编程的题目
谢谢天性爱好者提供的思路。

补充一下：prop在真实数据中是一个字符串，不是A、B、C、D之类的单字符。
所以在生成prop连接的时候可以在中间用";"连接，如A;B;C...

match_count=countc(strip(temp),strip(id_prop),'i');
用countc函数来计算两个字符串的重合度好像有问题的。

作者: shiyiming 时间: 2011-8-10 14:17
标题: Re: 请教一个SAS编程的题目
是字符串的话，那就试着换成countw之类的函数试试...

作者: shiyiming 时间: 2011-8-10 14:42
标题: Re: 请教一个SAS编程的题目
字符串的没有考虑到...貌似用那个不好行通，得自己编写个函数... <img src="{SMILIES_PATH}/icon_sad.gif" alt=":(" title="Sad" />

作者: shiyiming 时间: 2011-8-11 05:42
标题: Re: 请教一个SAS编程的题目
looks like a basket analysis or social circle analysis
depending on the number of item_id and unique PROP, efficiency consideration is different
cartesian join is a good tool and on SQL servers, it is actually pretty fast

[code:3hl4svsa]
data test(index=(item_id));
input item_id prop $;
cards;
1 A
1 B
1 C
1 D
1 E
1 F
2 A
2 C
2 G
2 H
3 C
3 D
3 J
;
run;

proc sql;
   create table final as
   select a.item_id as item_id1, b.item_id as item_id2,
         count(distinct a.prop) as item_count1,
count(distinct b.prop) as item_count2,
sum(a.prop=b.prop) as match_count
   from test as a, test as b
   where  a.item_id<b.item_id
   group by a.item_id, b.item_id
   ;
quit;

[/code:3hl4svsa]

作者: shiyiming 时间: 2011-8-12 00:02
标题: Re: 请教一个SAS编程的题目
[code:3f9jd6ql]data raw;
infile datalines dlm=',';
input item_id prop $;
datalines;
1,A
1,B
1,C
1,D
1,E
1,F
2,A
2,C
2,G
2,H
3,C
3,D
3,J
;
data raw;
set raw;
by item_id;
if last.item_id then group_flag=1;
run;
data out;
length prop $8;
declare hash h(hashexp:4);
rc=h.defineKey('prop');
rc=h.defineData('prop');
rc=h.defineDone();
do _n_=1 by 1 until(last.item_id);
      set raw;
      by item_id;
      rc=h.add();
end;
start+_n_;
item_1=item_id; item1_count=_n_;
do i=start+1 to last;
      set raw point=i nobs=last;
      item2_count+1;
      if h.find()=0 then n+1;
      if group_flag then do;
         item_2=item_id;
         output;
         call missing(n,item2_count);
      end;
end;
rc=h.delete();
keep item_1 item_2 item1_count item2_count n;
run;[/code:3f9jd6ql]

作者: shiyiming 时间: 2011-8-18 22:04
标题: Re: 请教一个SAS编程的题目
能否解释一下如下部分代码，SAS是怎样做到连续读入数据到最后，而不是循环地读第一个Item_ID的？
谢谢！
  do _n_=1 by 1 until(last.item_id);
      set raw;
      by item_id;
      rc=h.add();
end;
item1_count=_n_;

作者: shiyiming 时间: 2011-8-19 18:54
标题: Re: 请教一个SAS编程的题目
[quote="sun59338":2qgfuv1j]能否解释一下如下部分代码，SAS是怎样做到连续读入数据到最后，而不是循环地读第一个Item_ID的？
谢谢！
  do _n_=1 by 1 until(last.item_id);
      set raw;
      by item_id;
      rc=h.add();
end;
item1_count=_n_;[/quote:2qgfuv1j]
呵，sun59338应该没注意到set raw语句吧！当第一次循环完时，数据指针指向的你上边所说的第一个Item_ID的最后一条观测已经读取完，接着指针移向下一条未读取的观测，也就是第二个Item_ID的第一条观测，此时，上边的这一段循环结束，继续执行后边未执行的代码，最后run；run完以后，又返回到此data步开头，因为此时raw数据集仍就打开着且指针未指到raw数据集的最后一条观测，所以继续循环执行此data步；而接下是重新执行到上边的这do循环时，是以前边指针所指向第二个Item_ID的第一条观测开始执行；而不是你说的还继续循环的读取第一个Item_ID；因为此时raw数据集的打开状态仍旧是第一次打开后形成的，而未重新open过，所以此时是以第二个Item_ID的第一条观测开始执行，一直循环执行直到数据集raw读取完，或者中间有stop时，才会停止重复执行data步；此外hopewell大哥给的上边这段代码中的循环变量_n_把它理解成普通的变量如：i，j，k等就行了，不一定要理解成sas自动变量，因为此段代码完成的就是对每一item_id内观测进行指针的控制,我的理解是这样，也可自己试试；此外，为了更好地理解那段代码，你可以把哈希表分时段导出来，导给不同的数据集，否则，导入的hash表结果只是最后一次更新的结果，我想这样应该能够更好地了解前边的那段代码的运行过程吧！！
知识有限，学的不够专业，讲的逻辑不是很好，别见笑哈！！

作者: shiyiming 时间: 2011-8-26 23:30
标题: Re: 请教一个SAS编程的题目
谢谢！
我就是搞不明白SAS是怎样区分这里的_n_与它自己的系统变量_N_的。或许我对系统的自动变量_N_的赋值过程及使用理解有误。

欢迎光临 SAS中文论坛 (http://mysas.net/forum/)