Title: Haas MFE SAS Workshop Lecture 3:
1Haas MFE SAS WorkshopLecture 3
- Peng Liu http//faculty.haas.berkeley.edu/peliu/c
omputing -
Haas School of Business, Berkeley, MFE 2006
2Commonly used PROCeduresin Financial Economics
- Peng Liu http//faculty.haas.berkeley.edu/peliu/c
omputing -
Haas School of Business, Berkeley, MFE 2006
3SAS PROCs
- Basic Statistical Analysis
- Univariate statistics
- PROC MEANS
- PROC UNIVARIATE
- PROC FREQ
- Bivariate and Multivariate Statistics
- PROC CORR
- PROC NPAR1WAY
- PROC TTEST
- Financial Econometrics using SAS
- Linear Models (OLS, GLS and their variates)
- PROC REG
- PROC GLM (Skip)
- Logistic Regression
- PROC LOGISTIC
- PROC GENMOD
- Hazard Regression (Cox-P.H.)
- PROC PHREG
4Basic Statistical Analysis
- Univariate statistics
- PROC MEANS
- PROC UNIVARIATE
- PROC FREQ
- Bivariate and Multivariate Statistics
- PROC CORR
- PROC NPAR1WAY
- PROC TTEST
5Comparison of PROC MEANSand PROC UNIVARIATE
- PROC MEANS
- DESCRIPTIVE STATISTICS
- CLM CSS CV KURTOSIS LCLM MAX MEAN MIN N NMISS
RANGE SKEWNESS STD STDERR SUM SUMWGT UCLM USS VAR - QUANTILE STATISTICS
- MEDIANP50 Q1P25 Q3P75 P1 P5 P10 P90 P95 P99
RANGE - HYPOTHESIS TESTING
- PROBT T
- PROC UNIVARIATE
- DESCRIPTIVE STATISTICS
- CSS CV KURTOSIS MAX MEAN MIN MODE N NMISS RANGE
SKEWNESS STD STDMEAN SUM SUMWGT USS VAR - QUANTILE STATISTICS
- MEDIAN P1 P5 P10 P90 P95 P99 Q1 Q3 RANGE
- QUANTILE STATISTICS
- NORMAL PROBN MSIGN PROBM SIGNRANK PROBS T PROBT
- ROBUST STATISTICS
- GINI MAD QN SN STD_SINI STD_MAD STD_QN
STD_QRANGE STD_SN
6PROC MEANS
- PROC MEANS DATAmfe.loan
- VAR appraisal ltv
- CLASS state
- RUN
PROC MEANS DATAmfe.loan max min VAR appraisal
ltv OUTPUT OUTm maxmaxvalue maxltv
minminvalue minltv RUN
- The default output for PROC MEANS are variable
label N Mean Std Dev Min max - median min max clm alpha0.05 are examples of
options you can specify. - You can get summary statistics for many variables
- CLASS statements will produce summary stat for
each grouping class. - You can suppress print using NOPRINT option
- You can save the result in a self-defined sas
dataset.
7PROC UNIVARIATE
- PROC UNIVARIATE DATAmfe.loan
- VAR ltv ID id
- RUN
PROC UNIVARIATE DATAmfe.loan VAR ltv
HISTOGRAM QQPLOT /normal RUN
- Use VAR to specify which variable you want to
analyze, otherwise, this PROC will produce all
variables - Use ID to identify Extreme Observations, without
ID statement it will use observation number by
default - Can plot histogram, quantile-quantile plots etc.
- Can do twosided T test, etc.
8PROC FREQ
- PROC FREQ DATAmfe.loan
- TABLE term
- RUN
PROC FREQ DATAmfe.loan TABLE state
stateterm/nocol norow RUN
- One-way v.s two-way frequency table
- /CHISQ or /BINOMIAL option can be used to test
equal proportion - In one TABLE statement, you can produce more than
one frequency tables - You can suppress col percentage or/and row
percentage by option /nocol norow
9PROC CORR
- PROC CORR DATAmfe.loan
- VAR rate ltv fico_orig
- RUN
PROC CORR DATAmfe.loan COV SPEARMAN VAR rate
ltv fico_orig RUN
- The CORR procedure computes Pearson correlation
coefficients, three nonparametric measures of
association (Spearman rank-oder correlation,
Kendalls taub and Hoeffdings measure of
dependence D), and the probabilities associated
with these statistics for numeric variables - The default is Pearson correlation.
- COV option evolke the computation of covariance
10PROC TTEST
- DATA
- INPUT a b _at__at_
- DATALINES
- 51 55 64 61 75 74 86 90
- 95 93 68 71 73 72 90 95
-
- RUN
PROC TTEST PAIRED ab RUN
- DATA step will produce automatic dataset, if user
did not specify one. - _at__at_ in INPUT lets SAS continuously read from
datelines - DATALINES is a SAS statement followed by lines
of raw data. - Data are typed continuously separated by blank,
you can separated into a different line in the
way you like. - should be stand by itself
- PROC step will perform specified procedure on
current dataset in working directory if user did
not specify a particular dataset name - Paired T-Test
11PROC NPAR1WAY
- PROC NPAR1WAY DATAmfe.loan
- CLASS state
- VAR ltv
- RUN
- NONPARAMETRIC TEST FOR DIFFERENCE ACROSS ONE-WAY
CLASSIFICATION. - IF the normality assumption does not hold, we may
use some nonparametric tests. - PROC NPAR1WAY performs nonparametric tests for
location and scale differences across a one-way
classiication, based on the following scores
Wilcoxin, Median, Van Der Waerden, Savage,
Siegel-Tukey, Ansari-Bradley, Klotz, and Modd
Scores.
12Financial Econometrics using SAS
- Linear Models (OLS, GLS and their variates)
- PROC REG
- PROC GLM (Skip)
- Logistic Regression
- PROC LOGISTIC
- PROC GENMOD
- Hazard Regression (Cox-P.H.)
- PROC PHREG
13Linear Model Theory
- Data (yi, xi(xi1, xi2, xik)) for i1, , n
and yi ? R - Model yi ?0?-1xi1 ?kxik ?i for i1,,n
- For short yX??
- where
- Assumption ?i are i.i.d. normal N(0,?2)
- Ordinary Least Square Estimation
- ? (XTX)-1XTy
14PROC REG
- PROC REG is a SAS procedure for simple or
multivariate linear regression models with
continuous dependent variables. - Part of SAS/STAT
- Model fitting (parameters, residuals, confidence
limits, influential statistics, etc) - Model selection (forward, backward, stepwise,
,etc) - Hypothesis testing
- Model diagnostics
- Plotting
- Outputting estimates and statistics
15PROC REG Examples
- PROC REG DATAmfe.loan
- MODEL ltv rate
- PLOT ltv rate
- QUIT
MODEL ltv rate fico_orig OLSMODEL ltv term
rate fico_orig MODEL ltv rate fico_orig
term/SELECTIONF
- Begin with PROC REG end with QUIT
- Multiple independent , dependent variables are
separated by space - Label OLS is optional, useful for multiple
MODEL statement in one PROC REG - By default, a constant is included
- Use /Options to request additional stat or
specify model selection method - PLOT creates a scatter plot of your regression
data and automatically adds the regression line.
16Logistic Regression Theory
- Data (yi, xi(xi1, xi2, xik)) for i1, , n
and yi is a binary or ordinal response variable.
e.g. yi ? 0,1 - Model
-
- Maximum Likelihood estimate of ?
- Assumption binomial Variation
17Logistic Regression SAS procedure
- SAS has several procedures that performs logistic
regression, e.g. GENMOD, CATMOD and LOGISTIC - PROC LOGISTIC
- Works for binary or ordinal response variables
- Performs MLE using different optimization
algorithms - 4 model selection methods F, B, Stepwise, Score
- Outputs statistics to dataset
- Tests linear hypotheses of parameters
18PROC LOGISTIC Examples
- PROC LOGISTIC DATAmfe.loan
- CLASS state edu
- MODEL default ltv age edu term rate
state/LINKLOGIT - RUN
- Begin with PROC LOGISTIC end with QUIT
- /LINKLOGIT option can be ignored, other options
PROBIT, CLOGIT, CLOGLOG - Use CLASS statement to avoid creating dummy in
DATA step - /option can be used to request additional stat,
or specify selection method. - TEST statement
19Survival Analysis Background 1
20Survival Analysis Background 2
21Cox Proportional Hazard Regression
22PROC PHREG - Example
- PROC PHREG DATAmfe.loan
- MODEL loanageprepay(0) age edu race rate ltv
fico_orig state - RUN
- Use WHERE option to subset sample to want to
regress - You can define, group variables inside PHREG
after MODEL using IF THEN ELSE - Handling tied data /TIESEXACT, other option
DISCRETE - Run PHREG for different group, use BY option,
need to sort data. - Use CLASS statement to create dummy variables
23SAS SQL
- Peng Liu http//faculty.haas.berkeley.edu/peliu/c
omputing -
Haas School of Business, Berkeley, MFE 2006
24PROC SQL - What
- What does SQL can do?
- Selecting
- Ordering/sorting
- Subsetting
- Restructuring
- Creating table/view
- Joining/Merging
- Transforming variables
- Editing
25PROC SQL - Why
- The Advantage of using SQL
- Combined functionality
- Faster for smaller tables
- SQL code is more portable for non-SAS
applications - Not require presorting
- Not require common variable names to join on.
(need same type , length)
26Selecting Data
- PROC SQL
- SELECT DISTINCT rating FROM MFE.MOVIES
- QUIT
- The simplest SQL code, need 3 statements
- By default, it will print the resultant query,
use NOPRINT option to suppress this feature. - Begin with PROC SQL, end with QUIT not RUN
- Need at least one SELECT FROM statement
- DISTINCT is an option that removes duplicate rows
27Ordering/Sorting Data
- PROC SQL
- SELECT
- FROM MFE.MOVIES
- ORDER BY category
- QUIT
- Remember the placement of the SAS statements has
no effect so we can put the middle statement
into 3 lines - SELECT means we select all variables from
dataset MFE.MOVIES - Put ORDER BY after FROM.
- We sort the data by variable category
28Subsetting Data- Character searching in WHERE
- PROC SQL
- SELECT title, category
- FROM MFE.MOVIES
- WHERE category CONTAINS 'Action'
- QUIT
- Use comma (,) to separate selected variables
- CONTAINS in WHERE statement only for character
variables - Also try WHERE UPCASE(category) LIKE 'ACTION'
- Use wildcard char. Percent sign () with LIKE
operator.
29Subsetting Data- Phonetic Matching in WHERE
- PROC SQL
- SELECT title, category, rating
- FROM MFE.MOVIES
- WHERE category 'Drana'
- QUIT
- Always Put WHERE after FROM
- Sounds like operator
- Search movie title for the phonetic variation of
drama, also help possible spelling variations
30Case Logic - reassigning/recategorize
- PROC SQL
- SELECT title, rating,
- CASE rating
- WHEN 'G' THEN General'
ELSE 'Other' - END AS level
- FROM MFE.MOVIES
- QUIT
- The order of each statement is important
- CASE END AS should in between SELECT and FROM
- Note there is , after the variables you want to
select - Use WHEN THEN ELSE to redefine variables
- Rename variable from rating to level
31Creating New Data - Create Table
- PROC SQL
- CREATE TABLE ACTION AS
- SELECT title, category
- FROM MFE.MOVIES
- WHERE category CONTAINS 'Action'
- QUIT
- CREATE TABLE AS can always be in front of
SELECT FROM statement to build a sas file. - In SELECT, the results of a query are converted
to an output object (printing). Query results can
also be stored as data. The CREATE TABLE
statement creates a table with the results of a
query. The CREATE VIEW statement stores the query
itself as a view. Either way, the data identified
in the query can beused in later SQL statements
or in other SAS steps. - Produce a new dataset (table) ACTION in work
directory, no printing
32Creating New Data - Create View
- PROC SQL
- CREATE VIEW G_MOVIES AS
- SELECT title, category, rating
- FROM MFE.MOVIES
- WHERE rating 'G'
- ORDER BY title
- SELECT FROM G_MOVIES
- QUIT
- First step-creating a view,no output is produced
then display the desired output results - Use to separate two block of code inside of
proc sql - When a table is created, the query is executed
and the resulting data is stored in a file. When
a view is created, the query itself is stored in
the file. The data is not accessed at all in the
process of creating a view.
33Join Tables (Merge datasets) - Cartesian Join
- PROC SQL
- SELECT
- FROM MFE.CUSTOMERS, MFE.MOVIES
- QUIT
- Terminology Join (Merge) datasets (tables)
- No prior sorting required one advantage over
DATA MERGE - Use comma (,) to separate two datasets in FROM
- Without WHERE, all possible combinations of rows
from each tables is produced, all columns are
included - Turn on the HTML result option for better
display - Tool/Options/Preferences/Results/ check Create
HTML/OK
34Join Tables (Merge datasets) - Inner Join using
WHERE
- PROC SQL
- SELECT
- FROM MFE.MOVIES, MFE.ACTORS
- WHERE MOVIES.title ACTORS.title
- QUIT
- Use WHERE to specify connecting columns (title)
- table1.matchvar table2.matchvar
- Produce rows that have same movie title
- The matching variable can be of different name
different datasets
35Join Tables (Merge datasets) - Inner Join using
WHERE (Cont.)
- PROC SQL
- SELECT M.title, M.rating, A.actor_leading
- FROM MFE.MOVIES M, MFE.ACTORS A
- WHERE MOVIES.title ACTORS.title
- QUIT
- Short-cut for table names
- Can be used in SELECT and WHERE statements
- Need to be declared in FROM statement
36Join Tables (Merge datasets) - Join three tables
- PROC SQL
- SELECT C.cust_no,
- M.title,M.rating, M.category,
- A.actor_leading
- FROM MFE.CUSTOMERS C,
- MFE.MOVIES2 M,
- MFE.ACTORS A
- WHERE C.cust_no M.cust_no AND M.title
A.title - QUIT
- Use AND in WHERE statement to specify two
matching conditions - Produce rows that satisfies all the conditions
- Note We use MOVIES2 in this example
- Can join up to 32 tables in one SQL code
37Join Tables (Merge datasets) - Inner Joins
using ON
- PROC SQL
- SELECT M.title, rating,actor_leading
- FROM MFE.MOVIES M
- INNER JOIN MFE.ACTORS A
- ON M.TITLE A.TITLE
- QUIT
- Same result as using where
- WHERE is used to select rows from inner joins
- ON is used to select rows from outer or inner
38Join Tables (Merge datasets) - Left Outer Joins
- PROC SQL
- SELECT MOVIES.title, actor_leading, rating
- FROM MFE.MOVIES
- LEFT JOIN
- MFE.ACTORS
- ON MOVIES.title ACTORS.title
- QUIT
- Resulting output contains all rows for which the
SQL expression, referenced in the ON clause,
matches both tables and all rows from LEFT table
(MOVIES) that did not match any row in the right
(ACTORS) table. - Essentially the rows from LEFT table are
preserved and captured exactly as they stored in
the table itself, regardless if a match exists. - Need to specify a table name for the matching
variable in SELECT
39Join Tables (Merge datasets) - Right Outer Joins
- PROC SQL
- SELECT ACTORS.title, actor_leading, rating
- FROM MFE.MOVIES
- RIGHT JOIN
- MFE.ACTORS
- ON MOVIES.title ACTORS.title
- QUIT
- Resulting output contains all rows for which the
SQL expression, referenced in the ON clause,
matches both tables and all rows from RIGHT table
(ACTORS) that did not match any row in the right
(MOVIES) table. - Essentially the rows from RIGHT table are
preserved and captured exactly as they stored in
the table itself, regardless if a match exists.
40Join Tables (Concatenating) - Outer Union
- PROC SQL
- SELECT FROM MFE.CUSTOMERS
- OUTER UNION
- SELECT FROM MFE.MOVIES
- QUIT
- SQL performs OUTER UNION, similar to DATA steps
with a SET statement to Concatenate datasets. - The result contains all the rows produced by the
first table-expression followed by all the row
produced by the second table-expression.
41Transforming Data- Creating new Variables
- /Creating new variables/
- PROC SQL
- SELECT title, length, category, year, rating,
- 2006-year AS age
- FROM MFE.MOVIES
- QUIT
- You can create new variables within SELECT
statement, the name of new variable follows after
AS. - Note the order of the express is reversed
42Transforming Data- Summarizing Data using SQL
functions
- PROC SQL
- SELECT ,
- COUNT(title) AS notitle,
- MAX(year) AS most_recent,
- MIN(year) AS earliest,
- SUM(length) AS total_length,
- NMISS(rating) AS nomissing
- FROM MFE.MOVIES
- GROUP BY rating
- QUIT
- Simple summarization functions available
- All function can be operated in GROUPs
43Editing Data Insert observations.
- PROC SQL NOPRINT
- INSERT INTO MFE.CUSTOMERS
- VALUES(1 'Peng')
- INSERT INTO MFE.CUSTOMERS
- SET Cust_no2,Name'Sasha'
- QUIT
- There are two ways of inserting observations into
a table. Data type should be the same. - VALUES( ) new values are separated by space.
- SET column name newly assigned values,
delimited by commas.
44Editing Data Deleting rows and Dropping
columns
- /Deleting rows/
- PROC SQL
- DELETE
- FROM MFE.MOVIES
- WHERE length LE 100
- QUIT
- /Droping variables/
- PROC SQL
- CREATE TABLE NEW (DROPrating) AS
- SELECT
- FROM MFE.MOVIES
- QUIT
- Deleting columns can be done in SELECT or in DROP
on created table
45Editing Data Update observations
- /Updating Observation/
- PROC SQL NOPRINT
- UPDATE MFE.CUSTOMERS
- SET Name'Liu'
- WHERE Cust_no1
- QUIT
- UPDATE SET WHERE
- Find the observation and set new value
- If more than one observations satisfies the
condition, all are updated with the new data in
SET statement
46SAS IML
- Peng Liu http//faculty.haas.berkeley.edu/peliu/c
omputing -
Haas School of Business, Berkeley, MFE 2006
47PROC IML - What
- OUTLINES
- Defining Matrix
- Creating Matrix
- Combining Matrix
- Matrix Algebra (Function, Reduction Operators)
- Read-In SAS data sets
- Create SAS data sets
48Defining and Referencing matrices
- PROC IML
- RESET PRINT
- A3
- B1 2 3 row vector
- C1,2,3 column vector
- D1 2 3, 4 5 62 X 3 matrix
- QUIT
PROC IML D1 2 3, 4 5 6 E1D,2
E2D1, E3D2,3 E4D1 3,2 3 PRINT D
E1 E2 E3 E4 QUIT
- Begin with PROC IML end with QUIT
- RESET PRINT produce all output otherwise specify
what you want to PRINT , but you have to assign
first. - Use curly brackets to create vector or matrix
- Use square bracket to refer to elements
- E4D1 3,2 3 assigns submatrix to E4
consisting 1,3 rows and 2,3 columns of matrix D
49Creating matrices
- PROC IML
- RESET PRINT
- index15 rindex51
- vars'X1''X8'
- series do(0,84,12)
aa I(6) bb j(5,5,0) cc j(6,1) dd
diag(1 2 4) ee diag(1 2,3 4) QUIT
- Index operator creates a row vector of
consecutive integers or character strings - Do function creates series with any increment
- Identity matrix I(size)
- Constant matrix J(nrow, ncol, value)
- Diagonal matrices DIAG(VECTOR)
50Combining matrices
- PROC IML
- RESET PRINT
- b1 2 3 4
- btt(b)
- c1 2 3 4,2 4 6 8, 3 9 7 5 ,9 8 7 6
- i1c//bi2cbt
- QUIT
- Transpose operation using t()
- "//" and " concatenates matrices vertically and
horizontally respectively
51Matrix Algebra - Basics
52Matrix Algebra Matrix functions
53Matrix Algebra Matrix Reduction Operators
54Read SAS data sets
- PROC IML
- RESET PRINT
- USE MFE.LOAN
- READ ALL INTO m
- WHERE(state'CA')
- QUIT
- READ ALL VARid term rate INTO m
- USE specify which SAS data set is to be used
- READ ALL INTO converts SAS data set into a
matrix. - READ ALL only read numerical variables
- Use VARterm rate option to select variables
- Use WHERE to subset data.
55Create SAS data sets
- PROC IML
- RESET PRINT
- seedJ(3,3,9)
- r ranuni(seed)
- CREATE data FROM r
- APPEND FROM r
- QUIT
- J function build 3 by 3 matrix with seed9 for
each element - CREATE FROM opens the new SAS data set
- APPEND FROM writes to the SAS data set
- Output dataset has default column names COL1
COL2 You can rename by option COLNAME R