SAS中文论坛

标题: 如何统计出重复出现的ID及其次数? [打印本页]

作者: shiyiming 时间: 2011-10-18 11:59
标题: 如何统计出重复出现的ID及其次数?
有50个data:d1,d2,d3,...d50,这50个data有相同的变量,id,x1,x2,其中ID在单个data里不重复.
问题:我想统计出现在两个或以上data里的ID及其次数.

现在用其中3个data(d1,d2,d3)说明一下:
data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;
那么,我想要的结果:
id freq
11 2
12 3
20 2
如有可能,能给出处就最好了:
id freq locate
11 2 d1 d2
12 3 d1 d2 d3
20 2 d2 d3
谢谢!

作者: shiyiming 时间: 2011-10-18 16:49
标题: Re: 如何统计出重复出现的ID及其次数?
merge by;
and use retain + in= data step option

[quote="gogotiger":2rj8cgdw]有50个data:d1,d2,d3,...d50,这50个data有相同的变量,id,x1,x2,其中ID在单个data里不重复.
问题:我想统计出现在两个或以上data里的ID及其次数.

现在用其中3个data(d1,d2,d3)说明一下:
data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;
那么,我想要的结果:
id freq
11 2
12 3
20 2
如有可能,能给出处就最好了:
id freq locate
11 2 d1 d2
12 3 d1 d2 d3
20 2 d2 d3
谢谢![/quote:2rj8cgdw]

作者: shiyiming 时间: 2011-10-18 22:39
标题: Re: 如何统计出重复出现的ID及其次数?
[code:2kwzil5v]data d1;
input id $ x1 x2;
cards;
11 0 100
12 1 120
13 0 200
24 1 170
;
data d2;
input id $ x1 x2;
cards;
11 0 100
12 1 120
20 0 200
28 1 170
;
data d3;
input id $ x1 x2;
cards;
10 0 100
12 1 120
20 0 200
29 1 170
;
run;

proc sql;
create table ids as
select distinct id,count(*) as freq
from (select * from d1 union all select * from d2 union all select * from d3)
group by id
having freq>1;
quit;[/code:2kwzil5v]
这个对于小数据还可以，但是现在有50个data，而且每个data都很大（平均obs>=10M），所以，请指教一下有何好办法？谢谢！
oloolo，能给个code吗？谢谢！

作者: shiyiming 时间: 2011-10-19 06:52
标题: Re: 如何统计出重复出现的ID及其次数?
for 50 data sets as large as 10M each, I think a doable solution rather than the speed-only may be more realistic. I think of a solution for this but I cannot verify that since i don't have either SAS or such tremendous data.
it is assumed that your data is sortable. usually, a data set with a single variable and 10 million records, can be sorted in a couple of minutes by the lowest efficiency.
if you want to try the following code, please use back-up data sets.
jingju

[code:b5z3fqru]proc sort data =a1(keep =id) out =aa; by id;
run;
data aa;
length name $ 124;
retain name "a1" n 1;
set aa;
run;

%macro countm;
%do i =2 %to 50;
proc sort data =a&i(keep =id); by id;
run;
data aa;
update aa a&i(in =ina);
by id;
if ina =1 then do;
n =sum(n, 1);
name =catx(',', name, "a&i");
end;
run;
%end;
%mend;

%countm
;[/code:b5z3fqru]

欢迎光临 SAS中文论坛 (http://mysas.net/forum/)