Efficient SAS programming with Large Data - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient SAS programming with Large Data

Description:

compile phase. execute phase. Data step logic. Data step logic. data step. data admits; ... PDV: compile phase. data admits; set admits; discharge = admit ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 30
Provided by: Aid96
Category:

less

Transcript and Presenter's Notes

Title: Efficient SAS programming with Large Data


1
Efficient SAS programming with Large Data
  • Aidan McDermott
  • Computing Group, March 2007

2
Axes if Efficiency
  • processing speed
  • CPU
  • real
  • storage
  • disk
  • memory
  • user
  • functionality
  • interface to other systems
  • ease of use
  • learning
  • user development
  • methodologies
  • reusable code
  • facilitate extension, rewriting
  • maintenance

3
Dataset / Table
4
  • Datasets consist of three parts

5
General (and obvious) principles
  • Avoid doing the job if possible
  • Keep only the data you need to perform a
    particular task (use drop, keep, where and ifs)

6
Combining datasets -- concatenation
7
General (and obvious) principles
  • Often efficient methods were written to perform
    the required task use them.

8
General (and obvious) principles
  • Often efficient methods were written to perform
    other tasks use them with caution.
  • Write data driven code
  • its easier to maintain data than to update code
  • Use length statements to limit the size of
    variables in a dataset to no more than is needed.
  • dont always know what size this should be, dont
    always produce your own data.
  • Use formatted data rather than the data itself

9
Memory resident datasets
10
Compressing Datasets
  • Compress datasets with a compression utility such
    as compress, gzip, winzip, or pkzip and
    decompress before running each SAS job
  • delays execution and there is need to keep track
    of data and program dependency.
  • Use a general purpose compression utility and
    decompress it within SAS for sequential access.
  • system dependent (need a named pipe), sequential
    dataset storage.

11
Compressing Datasets
12
SAS internal Compression
  • allows random access to data and is very
    effective under the right circumstances. In some
    cases doesnt reduce the size of the data by
    much.
  • There is a trade-off between data size and CPU
    time.

13
  • indata is a large dataset and you want to produce
    a version of indata without any observations

14
  • The data step is a two stage process
  • compile phase
  • execute phase

15
Data step logic
16
Data step logic
17
(No Transcript)
18
data step
19
data admits set admits discharge
admit length format discharge
date8. run
PDV compile phase
Name type size drop retain format value
patientID C 6 n y
gender C 1 n y
admit N 8 n y date8.
length N 8 n y
discharge N 8 n n date8.
_N_
_ERROR_ 0
20
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 1
_ERROR_ 0
21
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
22
data admits set admits discharge
admit length format discharge
date8. run / implicit output /
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
23
data admits set admits discharge
admit length format discharge
date8. run
PDV execute phase
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 2
_ERROR_ 0
24
Efficiency suspend the PDV activities
25
General principles
  • Use by processing whenever you can
  • Given the data below, for each region, siteid,
    and date, calculate the mean and maximum ozone
    value.

26
General principles
  • Easy

27
General principles
  • Suppose there are multiple monitors at each site
    and you still need to calculate the daily mean?
  • Combine multiple observations onto one line and
    then compute the statistics?
  • Suppose you want the 10 trimmed mean?
  • Suppose you want the second maximum?
  • Use Arrays to sort the data?
  • Write your own function?

28
(No Transcript)
29
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com