SAS中文论坛

标题: 如何统计出重复出现的ID及其次数? [打印本页]

作者: shiyiming    时间: 2011-10-18 11:59
标题: 如何统计出重复出现的ID及其次数?
有50个data:d1,d2,d3,...d50,这50个data有相同的变量,id,x1,x2,其中ID在单个data里不重复.
问题:我想统计出现在两个或以上data里的ID及其次数.

现在用其中3个data(d1,d2,d3)说明一下:
data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;
那么,我想要的结果:
id freq
11 2
12 3
20 2
如有可能,能给出处就最好了:
id freq locate
11 2    d1 d2
12 3    d1 d2 d3
20 2    d2 d3
谢谢!
作者: shiyiming    时间: 2011-10-18 16:49
标题: Re: 如何统计出重复出现的ID及其次数?
merge by;
and use retain + in= data step option

[quote="gogotiger":2rj8cgdw]有50个data:d1,d2,d3,...d50,这50个data有相同的变量,id,x1,x2,其中ID在单个data里不重复.
问题:我想统计出现在两个或以上data里的ID及其次数.

现在用其中3个data(d1,d2,d3)说明一下:
data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;
那么,我想要的结果:
id freq
11 2
12 3
20 2
如有可能,能给出处就最好了:
id freq locate
11 2    d1 d2
12 3    d1 d2 d3
20 2    d2 d3
谢谢![/quote:2rj8cgdw]
作者: shiyiming    时间: 2011-10-18 22:39
标题: Re: 如何统计出重复出现的ID及其次数?
[code:2kwzil5v]data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;

proc sql;
  create table ids as
   select distinct id,count(*) as freq
     from (select * from d1 union all select * from d2 union all select * from d3)
        group by id
   having freq>1;
quit;[/code:2kwzil5v]
这个对于小数据还可以,但是现在有50个data,而且每个data都很大(平均obs>=10M),所以,请指教一下有何好办法?谢谢!
oloolo,能给个code吗?谢谢!
作者: shiyiming    时间: 2011-10-19 06:52
标题: Re: 如何统计出重复出现的ID及其次数?
for 50 data sets as large as 10M each, I think a doable solution rather than the speed-only may be more realistic. I think of a solution for this but I cannot verify that since i don't have either SAS or such tremendous data.
it is assumed that your data is sortable. usually, a data set with a single variable and 10 million records, can be sorted in a couple of minutes by the lowest efficiency.
if you want to try the following code, please use back-up data sets.
jingju

[code:b5z3fqru]proc sort data =a1(keep =id) out =aa; by id;
run;
data aa;
        length name $ 124;
        retain name "a1" n 1;
        set aa;
run;

%macro countm;
%do i =2 %to 50;
proc sort data =a&i(keep =id); by id;
run;
data aa;
        update aa a&i(in =ina);
        by id;
        if ina        =1 then do;
                n =sum(n, 1);
                name =catx(',', name, "a&i");
        end;
run;
%end;
%mend;

%countm
;[/code:b5z3fqru]




欢迎光临 SAS中文论坛 (http://mysas.net/forum/) Powered by Discuz! X3.2