SAS中文论坛

标题: 请教一个“复杂”的数据处理方法 [打印本页]

作者: shiyiming 时间: 2004-5-31 22:50
标题: 请教一个“复杂”的数据处理方法
假设一个数据集如下：
data tem;
input id x y z cat;
cards;
1 2.3 4.6 6.9 1
2 2.4 4.6 7.3 1
3 2.4 4.8 7.2 1
4 3.2 6.4 9.6 1
5 4.5 9 13.5 1
6 2.6 5.2 7.8 2
7 3.5 7 10.5 2
8 5.6 11.2 16.8 2
9 2.3 4.6 6.9 2
10 2.3 4.6 6.9 2
11 2.4 4.8 7.2 3
12 3.2 6.4 9.6 3
13 4.5 9 13.5 3
14 2.6 5.2 7.8 3
15 3.5 7 10.5 3
16 5.6 11.2 16.8 3
;
run;

现在需要通过一些处理得到如下的数据:

id x y z cat Di1 Di2 Di3
1 2.3 4.6 6.9 1 与cat=1的所有对象的欧氏距离的平均值与cat=2的所有对象的欧氏距离的平均值与cat=3的所有对象的欧氏距离的平均值
2 2.4 4.6 7.3 1 … … …
3 2.4 4.8 7.2 1 … … …
4 3.2 6.4 9.6 1 … … …
5 4.5 9 13.5 1 … … …
6 2.6 5.2 7.8 2 … … …
7 3.5 7 10.5 2 … … …
8 5.6 11.2 16.8 2 … … …
9 2.3 4.6 6.9 2 … … …
10 2.3 4.6 6.9 2 … … …
11 2.4 4.8 7.2 3 … … …
12 3.2 6.4 9.6 3 … … …
13 4.5 9 13.5 3 … … …
14 2.6 5.2 7.8 3 … … …
15 3.5 7 10.5 3 … … …
16 5.6 11.2 16.8 3 … …
…
其中：Di1表示：该对象与cat=1的所有对象的欧氏距离的平均值（在 x y z维度下）。Di2 Di3类似。

我觉得这个运算量是比较大的，如果数据集有50万条记录，不知道处理起来是否非常困难？
多谢！

作者: shiyiming 时间: 2004-6-1 05:48
标题: try
For your data set, the following program will work.

[code:1b10c]proc sql;
create table center as
select cat, mean(x) as mx, mean(y) as my, mean(z) as mz
from tem
group by cat
order by cat;
quit;

proc sql;
create table temp as
select t.*,
   sqrt((t.x-m1.mx)**2+(t.y-m1.my)**2+(t.z-m1.mz)**2) as di1,
   sqrt((t.x-m2.mx)**2+(t.y-m2.my)**2+(t.z-m2.mz)**2) as di2,
   sqrt((t.x-m3.mx)**2+(t.y-m3.my)**2+(t.z-m3.mz)**2) as di3
from tem t
left join center m1
on m1.cat=1
left join center m2
on m2.cat=2
left join center m3
on m3.cat=3
order by id;
quit;[/code:1b10c]

I did not consider the efficiency.  If you would like to consider the efficiancy, they key is not the number of the records, but the number of categories do you have.  It is possible to omit the second step, but using data step skill to do it.  Please read the documentation of SET statement, you may need two SET statement in one data step to combine the information from TEM and CENTER.

作者: shiyiming 时间: 2004-6-1 08:01
谢谢xic

可能是我表达有些问题：
Di1 Di2 Di3是求每个对象与所有其余对象的距离，然后按照cat分类来求平均值的。而不是先求出cat的每个类的中心点，再与每个对象求距离，这样的计算量是小了不少阿：）。

作者: shiyiming 时间: 2004-6-1 23:27
没有人关心我这个问题吗? <img src="{SMILIES_PATH}/icon_sad.gif" alt=":(" title="Sad" />

作者: shiyiming 时间: 2004-6-2 10:26
标题: 或许有帮助
如下程序对你或许有帮助
[code:2aa33]%macro fl;
proc sql noprint;
  select max(cat) into:cat_max from tem;
%do i=1 %to &cat_max;
  create table tem&i as select * from tem where cat=&i;
%end;
  select count(id) into :id_count from tem;
%do i=1 %to &id_count;
  proc sql noprint;
  select id ,x,y,z into :tid,:tx,:ty,:tz from tem where id=&i;
  %do j=1 %to &cat_max;
data h;
   set tem&j;
   di=sqrt((x-&tx)**2+(y-&ty)**2+(z-&tz)**2);
   tid=&tid;tx=&tx;ty=&ty;tz=&tz;cat=&j;
   keep tid tx ty tz di cat;
run;
proc append base=re data=h;
run;
  %end;
%end;
%mend;
%fl[/code:2aa33]
我的思路是：
1、按cat将原始数据集拆分为若干个小数据集
2、从原始数据集依次提取记录进行分cat计算
3、将每次计算的结果存入一数据集中
4、对最后一个数据集进行整理，达到你的目的（我未给出这一步的代码）
上程序的时间复杂度为O（n×n），空间复杂度O（n）；的确很费时。

作者: shiyiming 时间: 2004-6-2 21:04
标题: another solution
如果是50万的数据量，那可能计算起来比较耗时间。看你的机器性能，以及在数据库性能上做的优化。

[code:e1d61]proc sql;
create table result as
select a.id as id,b.cat as cat,sqrt((a.x-b.x)**2+(a.y-b.y)**2+(a.z-b.z)**2) as di
from tem as a cross join tem as b;

create table result as
select id,cat,avg(di) as di from result group by id,cat;
quit;

proc transpose data=result out=result(drop=_name_) prefix=di;
by id;
id cat;
var di;
run;[/code:e1d61]

作者: shiyiming 时间: 2004-6-3 00:28
标题: 如何优化
willon,在sas上运行此代码如何优化呢？

欢迎光临 SAS中文论坛 (http://mysas.net/forum/)