标题: A test to count missing values for large data [打印本页] 作者: shiyiming 时间: 2011-9-19 23:36 标题: A test to count missing values for large data From Dapangmao's blog on sas-analysis
<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-9P4IRzHAraU/TnddcKSfvQI/AAAAAAAAAwI/MEs_cB2EoVo/s1600/Presentation1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://1.bp.blogspot.com/-9P4IRzHAraU/TnddcKSfvQI/AAAAAAAAAwI/MEs_cB2EoVo/s400/Presentation1.jpg" width="400" /></a></div>This morning Rick introduced <a href="http://blogs.sas.com/content/iml/2011/09/19/count-the-number-of-missing-values-for-each-variable/">how to count the missing frequencies for all variables in a dataset</a>, including character and numeric variables. He provided two solutions by either PROC FREQ or PROC IML. I have a petty macro based on PROC SQL’s nmiss() function to do the same job. In this big data era, I am interested in those SAS codes’ efficiencies to check large data. <br />
<br />
Then I simulated a 1e8 size dataset (5 numerical and 5 character variables) and tested the 3 methods. According to the results, PROC FREQ is slightly better than PROC SQL in both memory usage and speed. However, a macro may be needed to integrate a number of output tables by PROC FREQ for reporting <a href="http://support.sas.com/kb/44/124.html">(the demo macro</a> by SAS Knowledge Base is primitive). Sadly, PROC IML used out all the 250MB memories the server allocated to me and can’t finish this test. Similarly, the memory volume should be considered for other matrix languages like R and Matlab. <br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
**********************(1) SIMULATION STEP**************************************;
data test;
array _c{*} $ c1-c5;
array _n{*} n1-n5;
do i = 1 to 1e8;
do j = 1 to 5;
if ranuni(1234)>0.2 then _c[j]= 'A';
else call missing(_c[j]);
if ranuni(4321)>0.3 then _n[j] = rannor(0);
else call missing(_n[j]);
end;
output;
drop i j;
end;
run;
**********************(2) MODULE-BUILDING STEP*********************************;
%macro countmiss(data = );
ods listing close;
ods output variables = _varlist;
proc contents data = &data;
run;
proc sql;
alter table _varlist
add nmiss num
;
select count(*) into :nvar
from _varlist
;
select count(*) into :nobs
from &data
;
quit;
%do i = 1 %to &nvar;
proc sql;
select cats('nmiss(', variable, ')') into :miss_var
from _varlist
where num = &i
;
select &miss_var into: miss_num
from &data
;
update _varlist
set nmiss = &miss_num
where num = &i
;
quit;
%end;
proc sort data = _varlist;
by num;
run;
ods listing;
proc report data = _varlist nowd split = '*';
columns type num variable label nmiss pcnmiss;
define type / group;
define variable / width = 15;
define nmiss / display 'Missing*Number';
define pcnmiss / computed format = percent8.3
'Missing*Percentage' width = 10;
compute pcnmiss;
pcnmiss = nmiss / &nobs;
endcomp;
run;
%mend countmiss();
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;