请教高手：多列带有重复的ID合并成一列

shiyiming · 发表于 2010-7-25 13:32:25

猪头又上当了！
另外浏览了oloolo大师的伯克，大开眼界！尽然有Tensor，是张量代数里那个东西吗？竟然有应用？望oloolo大师不吝赐教解惑众生！

shiyiming · 发表于 2010-7-25 14:13:49

to 死猪头
I made my points crystal clear in my first response, but no gurantee that it is the correct one. I think you know Tensor as well, pls ZKSS. It could also be a good topic for SuperK as Tensor is used for image recognition. SuperK, it is your turn now.

so you are in US? W Va? What do u do right now? Student or faculty or working professional? I am working for an US agency.....

shiyiming · 发表于 2010-7-25 14:27:49

你们大师对话吧,我把代码搞明白先 <img src="{SMILIES_PATH}/icon_lol.gif" alt=":lol:" title="Laughing" />

shiyiming · 发表于 2010-7-25 21:53:42

1，猪头，楼主说的很清楚，任意两个（即一边），你只考虑一个（即一点）。所以得到的结果不符合楼主要求。
2，猪头，边的问题，因为涉及到的边并不是一条边，而是很多叠合在一起的边（很多条观察值共用一条边），所以边数并不是4倍行数，而是所有有两个相同点的边。
3，猪头，你的DFS写的不错，不过针对单点的，正好我上面给的代码得到只需要针对单点的DFS，所以我移花接木了一下，两个程序组合一个就ok，我小试了下10000条，得到了结果，应该是对的。

需要说明的是，下面的代码是通过转换两点为一点的问题，所以DFS只需要针对单点的DFS，编程难度稍微下降，问题能够解决，但是代价就是超长的CPU时间和IO时间。要想速度快，还得像oloolo大侠说的那样，凭借重型武器hash和合适的DFS算法，这样保证很短的时间算完，代码也很简洁很好看，不像下面的裹脚布。

猪头狮身组合的代码（未优化）：
[code:2miq4v7r]data a1;
array id[1:4];
do _n_=1 to 700000;
   do _i_ = 1 to dim(id);
      id[_i_] = ceil(20000*ranuni(12345));
   end;
   call sortn(of id[*]); num1+1;
drop _i_;    output;
end;

run;

data ex1;
array ids [4] id1-id4;
  set a1;
   do i=1 to 4;
   id=ids[i];
   if i=5 then num1+1;
   num2=num1;
   keep id num2;
   output;
end;
run;

data ex2;
   set a1 (rename=(id1=cid1 id2=cid2 id3=cid3 id4=cid4)) nobs=max;
   call symput('max', max);
   array cid [4] cid1-cid4;
/* put _n_;*/
   do i=1 to max*4;
         set ex1 point=i;
         if mod(i,4)=1 then  sum=0;
            do j=1 to 4;
            if num1 ne num2 then  sum+(id=cid[j]);
            end;

         if mod(i,4)=0 then
               do;
                  if sum ge 2  then
               do;
               call sortn(num1,num2);
                  keep num1 num2;
                     output;
                  end;
               end;
   end;

run;

proc sort  data=ex2 out=ex2 nodupkey;
by num1 num2;
run;
data a2(rename=(num1=id1 num2=id2));
set ex2;
run;

data ids(keep=x id);
set a2 nobs=nobs;
if _n_=1 then call symputx('n',nobs);
* &n=the # of observations;
x = _n_;
array ids[*] id1-id2;
do _i_ = 1 to dim(ids);
   id = ids[_i_];
   output;
end;
drop _i_;

proc means data=ids(keep=id) nway noprint;
class id;
output out=distinctids(drop=_type_ _freq_);

data ctrl/view=ctrl;
retain fmtname 'id2seq' type 'i';
set distinctids(keep=id rename=(id=start)) nobs=nobs;
if _n_=1 then call symputx('m', nobs);
* &m=the # of distinct id's;
label = _n_+&n;

proc format cntlin=ctrl;

*construct edge set;
data edges;
set ids;
y = input(id, id2seq.);
output;
temp = x;
x=y;
y=temp;
output;
keep x y;

proc sort data=edges;
by x y;

*construct pointers to the incident edges @ each vertex;
data vertices;
do count=1 by 1 until(last.x);
   set edges;
   by x;
end;
lastobs+count;
firstobs=lastobs-count+1;
keep firstobs lastobs x;

*depth-first search;
data components(keep=x component);
array pointers[1:%eval(&n+&m),1:2] _temporary_;
array color[1:%eval(&n+&m)] _temporary_;
array stack[1:%eval(&n+&m)] _temporary_;
do _n_=1 by 1 until(eof);
   set vertices end=eof;
   pointers[_n_,1] = firstobs;
   pointers[_n_,2] = lastobs;
end;
component = 0;
do _n_=lbound(color) to hbound(color);
   if color[_n_] = . then do;
      component = component+1;
      stack_top = lbound(stack);
      stack[stack_top] = _n_;
      stack_size=1;
      color[_n_]=component;
      do until(stack_size=0);
         top_vertex = stack[stack_top];
         stack_top=stack_top-1;
         stack_size=stack_size-1;
         do _i_=pointers[top_vertex,1] to pointers[top_vertex,2];
            set edges point=_i_;
            if color[y] = . then do;
               stack_top = stack_top+1;
               stack[stack_top] = y;
               stack_size=stack_size+1;
               color[y]=component; *the line missed last time;
            end; *end if color[y]=.;
         end; *end of do _i_=... to ....;
      end; *end of do until (stack_top=0);
   end; *end of if color[_n_]=.;
end; *end of do _n_=1 to &nobs;

do x = 1 to &n;
   component = color[x];
   output;
end;
stop;

data a1_new;
merge a2 components(keep=component rename=(component=id));
run;

proc print;run;

data ex5;
set a1_new(keep=id1 id rename=(id1=num id=g)) a1_new(keep=id2 id rename=(id2=num id=g));
run;
proc sort nodupkey out=ex6;
by num;
run;

proc sql noprint;
select max(g) into :g from ex6;
quit;

data ex7;
do num= 1 to &max.;
  output;
end;
run;

data ex8;
merge ex6 ex7;
by  num;
if g =. then
   do;
      i+1;
      id= &g.+i;
   end;
else id=g;
drop i;
run;
proc print;
/*where g NE .; '.' 表示孤独的点 */
run;[/code:2miq4v7r]

shiyiming · 发表于 2010-7-25 22:30:27

有CPU快点的，可以试试，我用自己构造的需要回溯的数据，结果是没有问题，简单的代码待我学会DFS再来。

看了下别人的看法，认为广度优先用hash来优化，深度优先需要堆栈。

说能讲些DFS和BFS的区别么，特别是实现过程方式。

shiyiming · 发表于 2010-7-26 09:38:58

100000条模拟数据有1952条边，有共同点的id编成了877组，其他剩下的为孤独点（有98557个）。

如果700000条的话，组数应该更多，剩下的孤点更少。

10万条数据我的菜机跑了5个多小时，如果70万的话，估计要一个多星期，太慢。

后来优化了下50 0000条数据跑了15分钟不到。

改进了一下，700000条，25分钟。

Qiong · 发表于 2010-7-26 18:03:29

呀，居然错过了如此好贴~~~
前面的回复都没来得及看。。。
用R语言会方便些么？
一个四维的空间，每个record是一个点，坐标a1,a2,a3,a4.
点a,b 之间的距离是min|ai-bi|,然后用这距离的定义做cluster?距离为0的点都cluster到一起。

随便乱想的。。不要嘲笑我。。。

		自动登录	找回密码
密码			立即注册

请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列

Re: 请教高手：多列带有重复的ID合并成一列