请教高手：多列带有重复的ID合并成一列

shiyiming · 发表于 2010-7-20 20:22:10

数据如下：
data a1;
input id1 id2 id3 id4;
cards;
1 11 21 31
1 12 22 31
2 11 22 32
2 13 21 32
3 14 23 32
4 11 22 33
;
run;

每个record有四个id，但凡只要任意两个相同，就视为同一个id。例如第一，二个record，有相同的id1，一，六又有相同的id2，所以一，二，六为同一个id。以此推下去，该数据集实质为一个id。请教大侠们，怎么用这四个id创建一个能标志他们关系的新的id呢？

注：慎用array，数据集有几十万条record。

shiyiming · 发表于 2010-7-21 01:08:23

oh. 想了半天也没有头绪。欢迎高手。

shiyiming · 发表于 2010-7-21 09:04:11

继续欢迎高手们啊：）

shiyiming · 发表于 2010-7-21 16:08:22

已经见过好几个类似的问题了，说穿了就是SNS网站的朋友圈的问题。好象我回答过，不过搜不出来代码了。

shiyiming · 发表于 2010-7-21 17:11:19

嗯，是sxlion提的。

shiyiming · 发表于 2010-7-21 17:52:35

没找到,那帖子在哪呢?

shiyiming · 发表于 2010-7-21 20:49:34

这是我工作中的实际问题..............................

Ahuige 能不能在提供点线索，我可以去找一找。多谢了：）

shiyiming · 发表于 2010-7-21 21:52:05

没有想到什么好的办法。如果你的数据不是很多，那你只能用每行每列的记录与下行每列的所有记录匹配，依次做下去。

shiyiming · 发表于 2010-7-21 23:00:35

to gzgoon
this can be solved using hash list while your obs are index by the four IDs SEPARATELY.

by OP's description, as searching goes on, any subsequent observation that has 2 IDs in the current combination of id pool will be identified as belonging to the same group, then any new ID will be added to the pool. The main hash table contains the record number which is the unqiue ID for each obs, and the other elements in the hash is an 5 auxilary hash lists, 4 for id pool of each ID# while the last auxilary hash list contains all related records

for example, for first record, I search by index on ID1 and find record #2, by examination, it satisfies the criteria, so all the IDs are added to the first auxilary hash list while _n_=2 is also added to the second auxilary hash list. Now our ID pool has a set of ID1={1,2}, ID2={11, 12}, ID3={21, 22}, ID4={31}. Iterate over these four auxilary hash tables to find next record that satisfies the critera.

Now on ID2=11, we find _n_=6, by examination, it contains id2 \in ID3={21,22}, so that this record is added to the the hash table in a way as decribed above.

Keep this iteration until all four ID# lists are exhausted and no new record can be found.

shiyiming · 发表于 2010-7-21 23:18:58

[code:313d781q]data have;
set have;  _obs++1;
run;
%macro aMcr;
%global ns; %local i _i;
%do k = 1 %to 4;
data have12;
set have1;
array id_{500000} _temporary_; *max 500, 000records;
%if &k = 1 %then%do;
   if _n_ = 1 then id_[1] = id1;
if id1 = id_[1] then output;
%end;
%else%do; %let i = 1;
%do%while(%scan(&ns, &i) ne); %let _i = %scan(&ns, &i);
if _obs = &_i then id_[&_i] = id&k;
if id&k = id_[&i] then output;
%let i = %eval(&i+1);
%end;
%end;
run;
proc append base = have0 data = have12 force nowarn; run;
proc sql noprint;
select distinct _obs into :ns separated by ' ' from have0;
quit;
%end;
%mend aMcr;

data have_final; length id _obs id1-id4 8.; stop; run;
%let s = 0;

%macro bMcr;
data _null_;
call symputx('obs', obs);
set have nobs = obs; stop;
run;
%let nobs = 0;
%do%while(&nobs LT &obs); %let s = %eval(&s+1);
data have0; length id _obs id1-id4 8.; stop; run;
proc sql;
create table have1 as select * from have where _obs notin (select distinct _obs from have_final);
quit;
%aMcr
proc sort data = have0 nodupkey; by _obs; run;
data have00; set have0;
id = &s;
run;
proc append base = have_final data = have00 force nowarn; run;
proc sql noprint;
select count(*) into :nobs from have_final;
quit;
%end;
%mend bMcr;

%bMcr
;[/code:313d781q]
如果记录超过1万，程序就开始。。。。我的算法还是太笨拙，而且不见得正确。。。
wow,红变量因为字节太长而截断。所以如果数据很大，就不正确了。 forget it
what about recursive macro？who will do it?

		自动登录	找回密码
密码			立即注册

请教高手：多列带有重复的ID合并成一列

请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列