Title: Efficient SAS programming with Large Data
1Efficient SAS programming with Large Data
- Aidan McDermott
- Computing Group, March 2007
2Axes if Efficiency
- processing speed
- CPU
- real
- storage
- disk
- memory
-
- user
- functionality
- interface to other systems
- ease of use
- learning
- user development
- methodologies
- reusable code
- facilitate extension, rewriting
- maintenance
3Dataset / Table
4- Datasets consist of three parts
5General (and obvious) principles
- Avoid doing the job if possible
- Keep only the data you need to perform a
particular task (use drop, keep, where and ifs)
6Combining datasets -- concatenation
7General (and obvious) principles
- Often efficient methods were written to perform
the required task use them.
8General (and obvious) principles
- Often efficient methods were written to perform
other tasks use them with caution. - Write data driven code
- its easier to maintain data than to update code
- Use length statements to limit the size of
variables in a dataset to no more than is needed. - dont always know what size this should be, dont
always produce your own data. - Use formatted data rather than the data itself
9Memory resident datasets
10Compressing Datasets
- Compress datasets with a compression utility such
as compress, gzip, winzip, or pkzip and
decompress before running each SAS job - delays execution and there is need to keep track
of data and program dependency. - Use a general purpose compression utility and
decompress it within SAS for sequential access. - system dependent (need a named pipe), sequential
dataset storage.
11Compressing Datasets
12SAS internal Compression
- allows random access to data and is very
effective under the right circumstances. In some
cases doesnt reduce the size of the data by
much. - There is a trade-off between data size and CPU
time.
13- indata is a large dataset and you want to produce
a version of indata without any observations
14- The data step is a two stage process
- compile phase
- execute phase
15Data step logic
16Data step logic
17(No Transcript)
18data step
19data admits set admits discharge
admit length format discharge
date8. run
PDV compile phase
Name type size drop retain format value
patientID C 6 n y
gender C 1 n y
admit N 8 n y date8.
length N 8 n y
discharge N 8 n n date8.
_N_
_ERROR_ 0
20data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 1
_ERROR_ 0
21data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
22data admits set admits discharge
admit length format discharge
date8. run / implicit output /
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
23data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 2
_ERROR_ 0
24Efficiency suspend the PDV activities
25General principles
- Use by processing whenever you can
- Given the data below, for each region, siteid,
and date, calculate the mean and maximum ozone
value.
26General principles
27General principles
- Suppose there are multiple monitors at each site
and you still need to calculate the daily mean? - Combine multiple observations onto one line and
then compute the statistics? - Suppose you want the 10 trimmed mean?
- Suppose you want the second maximum?
- Use Arrays to sort the data?
- Write your own function?
28(No Transcript)
29(No Transcript)