标题: [求助] 关于筛选相同记录的问题 [打印本页] 作者: shiyiming 时间: 2004-4-23 17:11 标题: [求助] 关于筛选相同记录的问题 各位大侠:
如果在一个dataset中存在两条或多条相同记录,该如何将他们选出来。
比如:
dataset: allobs中,含 a b c d e.........等多列。
我用proc sql写:
proc sql;
create table sameobs as select * from allobs
group by a, b, c, d, e ........
having count(a)>1;
我想知道,如果不用sql,用proc summary 怎么写?以前看到有人用proc summay写,忘了具体内容了。
bow!作者: shiyiming 时间: 2004-4-23 23:43
I am not sure if there is any advantage to use proc summary, but the most fundamental way to do it is the following:
proc sort data=allobs;
by a b c d e;
run;
data ck;
set allobs;
by a b c d e;
if first.e+last.e ne 2;
run;作者: shiyiming 时间: 2004-4-24 07:49
谢谢 xic。
你总是最快的回复。谢谢你!作者: shiyiming 时间: 2004-4-24 20:16
和你的SQL相对应的proc summary 写法可以是:
proc summary data=allobs nway;
class a b c d e;
output out=someobs(where=(_freq_>1));
run;
data total;
array v(*) v1 - v5;
do i=1 to &totalSize;
do j=1 to dim(v);
v(j) = 100 + 2 - rantbl(-1, (1 - &DupRate ));
end;
output;
end;
drop i j;
run;[/code:b6e99]
用sql,在有cache便宜可占的情况下,还用了4分钟
[code:b6e99]proc sql;
create table out3
as
select *
from total
group by v1,v2,v3,v4,v5
having count(v1)>1;
quit;[/code:b6e99]
[code:18996]
1 options fullstimer msglevel=i;
2
3 %let totalSize = 5000000;
4
5 %let DupRate = 0.8;
6
7 data total;
8 array v(*) v1 - v5;
9 do i=1 to &totalSize;
10 do j=1 to dim(v);
11 v(j) = 100 + 2 - rantbl(-1, (1 - &DupRate ));
12 end;
13 output;
14 end;
15 drop i j;
16 run;
NOTE: The data set WORK.TOTAL has 5000000 observations and 5 variables.
NOTE: DATA statement used:
real time 33.93 seconds
user cpu time 31.43 seconds
system cpu time 1.65 seconds
Memory 96k
NOTE: There were 5000000 observations read from the data set WORK.TOTAL.
NOTE: The data set WORK.OUT2 has 32 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used:
real time 7.07 seconds
user cpu time 6.29 seconds
system cpu time 0.43 seconds
Memory 121k
22
23 proc sql;
24 create table out3
25 as
26 select *
27 from total
28 group by v1,v2,v3,v4,v5
29 having count(v1)>1;
NOTE: Table WORK.OUT3 created, with 32 rows and 5 columns.
30 quit;
NOTE: PROCEDURE SQL used:
real time 3:13.98
user cpu time 20.98 seconds
system cpu time 4.21 seconds
Memory 417157k
[/code:18996]
NOTE: The data set WORK.TOTAL has 5000000 observations and 5 variables.
NOTE: DATA statement used:
real time 16.16 seconds
user cpu time 13.48 seconds
system cpu time 0.70 seconds
Memory 96k
15
16 proc sql;
17 create table out3
18 as
19 select *
20 from total
21 group by v1,v2,v3,v4,v5
22 having count(v1)>1;
NOTE: Table WORK.OUT3 created, with 32 rows and 5 columns.
23 quit;
NOTE: PROCEDURE SQL used:
real time 5:09.97
user cpu time 27.63 seconds
system cpu time 8.31 seconds
Memory 2299k
NOTE: There were 5000000 observations read from the data set WORK.TOTAL.
NOTE: The data set WORK.OUT2 has 32 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used:
real time 13.76 seconds
user cpu time 7.19 seconds
system cpu time 0.39 seconds
Memory 129k[/code:72d04]
IBM ThinkPad T23
PIII 1.13G 512M SDRAM作者: shiyiming 时间: 2004-4-29 12:16
[quote="shiyiming":8b942]我的结果和结论与SAS_Dream非常相似,机器刚重起,没有其他作业。
我把SQL和SUMMARY的位置对调了,结果还是一样。
运行完程序后,现在机器变得暴慢!
[/quote:8b942]
另外,老大可以调一调config里的sortsize, 调到和你的物理内存一样,这时你可以观察到SQL使用的内存在加大,速度也会不同。作者: shiyiming 时间: 2004-4-29 17:21
[quote:19a25]1 options fullstimer msglevel=i;
2
3 %let totalSize = 5000000;
4
5 %let DupRate = 0.8;
6
7 data total;
8 array v(*) v1 - v5;
9 do i=1 to &totalSize;
10 do j=1 to dim(v);
11 v(j) = 100 + 2 - rantbl(-1, (1 - &DupRate ));
12 end;
13 output;
14 end;
15 drop i j;
16 run;
NOTE: The data set WORK.TOTAL has 5000000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 9.73 seconds
user cpu time 4.43 seconds
system cpu time 1.06 seconds
Memory 80k
NOTE: Multiple concurrent threads will be used to summarize data.
NOTE: There were 5000000 observations read from the data set WORK.TOTAL.
NOTE: The data set WORK.OUT2 has 32 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.03 seconds
user cpu time 8.56 seconds
system cpu time 1.14 seconds
Memory 94k
22
23 proc sql;
24 create table out3
25 as
26 select *
27 from total
28 group by v1,v2,v3,v4,v5
29 having count(v1)>1;
NOTE: Table WORK.OUT3 created, with 32 rows and 5 columns.
30 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 18.60 seconds
user cpu time 18.04 seconds
system cpu time 2.48 seconds
Memory 316k
[/quote:19a25]
联想 P4 3.2G 内存2G DDR400 台式电脑作者: shiyiming 时间: 2004-4-29 17:23
[quote:bca75]1 options fullstimer msglevel=i;
2
3 %let totalSize = 5000000;
4
5 %let DupRate = 0.8;
6
7 data total;
8 array v(*) v1 - v5;
9 do i=1 to &totalSize;
10 do j=1 to dim(v);
11 v(j) = 100 + 2 - rantbl(-1, (1 - &DupRate ));
12 end;
13 output;
14 end;
15 drop i j;
16 run;
NOTE: The data set WORK.TOTAL has 5000000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 9.73 seconds
user cpu time 4.43 seconds
system cpu time 1.06 seconds
Memory 80k
NOTE: Multiple concurrent threads will be used to summarize data.
NOTE: There were 5000000 observations read from the data set WORK.TOTAL.
NOTE: The data set WORK.OUT2 has 32 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.03 seconds
user cpu time 8.56 seconds
system cpu time 1.14 seconds
Memory 94k
22
23 proc sql;
24 create table out3
25 as
26 select *
27 from total
28 group by v1,v2,v3,v4,v5
29 having count(v1)>1;
NOTE: Table WORK.OUT3 created, with 32 rows and 5 columns.
30 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 18.60 seconds
user cpu time 18.04 seconds
system cpu time 2.48 seconds
Memory 316k