Efficient SAS programming with Large Data

About This Presentation

Title:

Efficient SAS programming with Large Data

Description:

compile phase. execute phase. Data step logic. Data step logic. data step. data admits; ... PDV: compile phase. data admits; set admits; discharge = admit ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 30

Provided by: Aid96

Learn more at: https://www.biostat.jhsph.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient SAS programming with Large Data

1
Efficient SAS programming with Large Data

Aidan McDermott
Computing Group, March 2007

2
Axes if Efficiency

processing speed
CPU
real
storage
disk
memory
user
functionality
interface to other systems
ease of use
learning
user development
methodologies
reusable code
facilitate extension, rewriting
maintenance

3
Dataset / Table
4

Datasets consist of three parts

5
General (and obvious) principles

Avoid doing the job if possible
Keep only the data you need to perform a
particular task (use drop, keep, where and ifs)

6
Combining datasets -- concatenation
7
General (and obvious) principles

Often efficient methods were written to perform
the required task use them.

8
General (and obvious) principles

Often efficient methods were written to perform
other tasks use them with caution.
Write data driven code
its easier to maintain data than to update code
Use length statements to limit the size of
variables in a dataset to no more than is needed.
dont always know what size this should be, dont
always produce your own data.
Use formatted data rather than the data itself

9
Memory resident datasets
10
Compressing Datasets

Compress datasets with a compression utility such
as compress, gzip, winzip, or pkzip and
decompress before running each SAS job
delays execution and there is need to keep track
of data and program dependency.
Use a general purpose compression utility and
decompress it within SAS for sequential access.
system dependent (need a named pipe), sequential
dataset storage.

11
Compressing Datasets
12
SAS internal Compression

allows random access to data and is very
effective under the right circumstances. In some
cases doesnt reduce the size of the data by
much.
There is a trade-off between data size and CPU
time.

indata is a large dataset and you want to produce
a version of indata without any observations

The data step is a two stage process
compile phase
execute phase

15
Data step logic
16
Data step logic
17
(No Transcript)
18
data step
19
data admits set admits discharge
admit length format discharge
date8. run
PDV compile phase
Name type size drop retain format value
patientID C 6 n y
gender C 1 n y
admit N 8 n y date8.
length N 8 n y
discharge N 8 n n date8.
_N_
_ERROR_ 0
20
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 1
_ERROR_ 0
21
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
22
data admits set admits discharge
admit length format discharge
date8. run / implicit output /
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
23
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 2
_ERROR_ 0
24
Efficiency suspend the PDV activities
25
General principles

Use by processing whenever you can
Given the data below, for each region, siteid,
and date, calculate the mean and maximum ozone
value.

26
General principles

Easy

27
General principles

Suppose there are multiple monitors at each site
and you still need to calculate the daily mean?
Combine multiple observations onto one line and
then compute the statistics?
Suppose you want the 10 trimmed mean?
Suppose you want the second maximum?
Use Arrays to sort the data?
Write your own function?

28
(No Transcript)
29
(No Transcript)

Write a Comment

User Comments (0)