SAS中文论坛

标题: Writing Effecient SAS Codes [打印本页]

作者: shiyiming 时间: 2004-5-18 14:04
标题: Writing Effecient SAS Codes
It is unfortunate that efficient computing gets less and less attention because CPUs become progressively faster and RAM and disk becomes progressively cheaper. For a small dataset, the difference between inefficient and efficient SAS codes may be un-noticeable. But for a large dataset, efficient computing is still very important.

There are two aspects of efficiency:

Efficient use of computing resources: The definition of efficient computing is: Given that the results of two sets of program segments are equal, a better program is the one which comsumes less computing resources, which include CPU cycles, RAM, and disk storage. In some situations, use of CPU and RAM and use of storage is in a negative relationship. For instance, if the index of data is stored in the hard disk, it takes less CPU and RAM to do sorting and parsing.

Efficient use of human resources: If two sets of codes consume equal amount of computing resources and produce the same results, but one set requires less human efforts (typing, modification, maintenance...etc), that one is considered more efficient.

Usually a more compact program requires less computing power and may even require less human resources. For instance, Novell Netware has 10-million lines of source code whereas Windows 2000 has 40 to 45-million lines. Even if the two network operating systems carry the same features, the one with less lines of codes is considered more desirable.
Besides shortening the program, there are other ways to achieve efficient computing. This write-up will illustrate efficient computing with examples of SAS codes.

Logical branching and comparison
The first example is conditional branching. When a blocking factor such as "age" is used in computing a ANOVA model, conditioning branching should be employed. Compare the following two sets of codes:
[quote:fcb0c]If age <= 10
      then group = "child ";
If age => 11 and age <= 19
      then group = "teenager ";
If age => 20 and age <= 29
      then group = "young adult";
If age => 30 and age <= 45
      then group = "adult    ";
If age => 46 and age <= 59
      then group = "middle age ";
If age => 60
      then group = "senior ";[/quote:fcb0c]

[quote:fcb0c]If age <= 10
      then group = "child ";
else if 11 <= age <= 19
      then group = "teenager ";
else if 20 <= age <= 29
      then group = "young adult";
else if 30 <= age <= 45
      then group = "adult    ";
else if 46 <= age <= 59
      then group = "middle age ";
else group = "senior ";[/quote:fcb0c]

Which set of source codes is more efficient?

There are two reasons to support this claim:
1. The program segment on the left uses "if" instead of "else if" after the first if-then statement. For each if-then statement, SAS must parse the entire dataset to classify the subjects into proper age groups. In the other program, SAS put aside all children after processing the first if-then statement and scan only the rest of the data. After processing the second if-then statement, SAS ignores all children and teenagers, and only look for young adults in the remaining data, and so forth.

2. The bad code uses "AND" to combine two criteria whereas the one on the right combines two comparisons into one statement. Although it doesn't make a noticeable difference in using computing resources.

[b:fcb0c]more efficient[/b:fcb0c]
[quote:fcb0c]select;
when (age <=10) group = "child          ";
when (11 <= age <= 19) group = "teenager ";
when (20 <= age <= 29) group = "young adult";
when (30 <= age <= 45) group = "adult    ";
when (46 <= age <= 59) group = "middle age ";
otherwise group = "senior ";
end;[/quote:fcb0c]

Overwriting same dataset and variable

The following may go against common sense. On some occasions, it is advisable to overwrite the same dataset and variables even if you have made changes on them. Doing so can release SAS from holding too many data on disk. Take a look at the two following pseudo codes:

[quote:fcb0c]Data one; infile "c:\data.txt";
      define variables;
Data two; set one;
      first program segment;
Data three; set two;
      second program segment;
Data four; set three;
      third program segment;  [/quote:fcb0c]

[quote:fcb0c]Data one; infile "c:\data.txt";
      define variables;
Data two; set one;
      first program segment;
      delete one;
Data one; set two;
      second program segment;
      delete one;
Data one; set three;
      third program segment;  [/quote:fcb0c]

The bad program segment keeps all four datasets on disk all the time. But it may be unnecessary. If you will not reuse the temporary dataset, there is no need to keep all of them. Therefore, the program segment on the right deletes the same dataset after each data step.
Not only you should overwrite the same dataset, but also you should overwrite the same variables if necessary. Compare the following two sets of SAS codes:

[quote:fcb0c]array a{10} a1-a10;
array b{10} b1-b10;
      do i = 1 to 10;
      if a{i} = 7 then a{i} = 0;
      else if a{i} = 6 then b{i} = 1;
      else if a{i} = 5 then b{i} = 2;
      else if a{i} = 4 then b{i} = 3;
      else if a{i} = 3 then b{i} = 4;
      else if a{i} = 2 then b{i} = 5;
      else if a{i} = 1 then b{i} = 6;
      else if a{i} = 0 then b{i} = 7;
end; [/quote:fcb0c]

[quote:fcb0c]array a{10} a1-a10
      do i = 1 to 10;
      if a{i} = 7 then a{i} = 0;
      else if a{i} = 6 then a{i} = 1;
      else if a{i} = 5 then a{i} = 2;
      else if a{i} = 4 then a{i} = 3;
      else if a{i} = 3 then a{i} = 4;
      else if a{i} = 2 then a{i} = 5;
      else if a{i} = 1 then a{i} = 6;
      else if a{i} = 0 then a{i} = 7;
end;    [/quote:fcb0c]

It is a common practice for researchers to recode the data. The preceding SAS codes just did that. Also, it is not unusual that people create a new set of variables to store recoded data as shown on the above . Indeed, the better program is more efficient because it writes new data back to the original variables rather than creating new ones. By the first glance, the second program does not work. If the value "7" has been changed to "0" by the first if-then statement and the new value is written back to the variable, will the new value "0" be reverted to "7" by the last if-then statement? No, it is because here "else if" instead of "if" is used. After all "7"s are changed, they are put aside and unaffected by the subsequent "else if" statements. This is another reason why you should use "else-if" rather than "if."

[b:fcb0c]Using numeric variable names[/b:fcb0c]
This tip is very simple. But it is often overlooked by many people. This simple tip is: Use numbers at the end of variable names rather than characters. Although either one does not make a difference in using CPU power, it does make a difference to human resources (typing and looking up field names)! Look at following two sets of variable definitions:

[quote:fcb0c]Data one; input
      Q1 Q1b Q1c Q1d Q1other
      Time_SH Time_Wk
      Com_Ex Web_Ex Res_Ex
      Q4a Q4b Q5c Q5d;
cards;  [/quote:fcb0c]

[quote:fcb0c]Data one;
      input Q1-Q16;
cards;  [/quote:fcb0c]

In SAS you can assign variables as "Q1-Q26," but you cannot assign variables as "Qa-Qz." If you use numeric variable names, you can be more efficient by saving time from typing and from matching the names on the hard copy and the variable names on the screen. When you have many variables, using character labels makes referencing extremely difficult. When I was an inexperienced SAS programmer many years ago, I coded a survey with over a few hundreds fields using character-based names. As a result...you know!

Further, when someday you want to rename the variables, using numeric names will be very convenient. For example, to rename Q1-Q100 as Question1-Question100, the code is: data new(rename=(q1-q100 = question1-question100));

[b:fcb0c]Using a value list in a variable[/b:fcb0c]

This tip not only reduces the use of CPU and memory resources, but also saves yourself from tedious coding. The following two codes perform the same task. The one on the left repeats the same comparison using "or," but the one on the right simply puts a list of values into a variable. If you know the concept of array and list, you know processing a list or an array of data is faster than processing data one by one.


[quote:fcb0c]If name = "Tom" or
  name = "Peter" or
  name = "Mary" or
  name = "Alex" or
  name = "Jane" or
  name = "Louis" then delete;[/quote:fcb0c]

[quote:fcb0c]If name in
("Tom","Peter","Mary",
"Alex","Jane","Louis")
then delete;[/quote:fcb0c]

[b:fcb0c] PROC SQL vs. PROC SUMMARY[/b:fcb0c]

Once I wrote an inefficient SAS program to extract user log data from a web server. Eldon pointed out that to parse data, the structural query language (SQL) is more powerful than the regular data parsing method. For instance, the codes on the left panel uses three PROCs to rank webpage by the number of page access. The code on the right, which utilizes SQL, can perform the job in one PROC. Also, it is not necessary to create one more data set and thus it avoids further consuming computer resources.


[quote:fcb0c]data two; set one;
count = 1;
proc summary data=two;
   class page; var count;
output out=new sum= ;
proc sort; by descending count;[/quote:fcb0c]

[quote:fcb0c]data one;
proc sql; select link,
count(*) label=count from one
group by link
order by count;
quit;[/quote:fcb0c]

欢迎光临 SAS中文论坛 (http://mysas.net/forum/)