Title: Intro to Stata
1SIAP-SRTC Training Course on Sampling Acceed
Center, AIM, Makati Philippines 4 April 2002
2OUTLINE
- Statistical Computing Resources
- Data Management with Stata
- Table Generation
- Tab and Table Commands
- Survey Commands
3(No Transcript)
4Computing Resources
- The Age of ICT has brought about a synergy of
computing and communications - Implications
- More DATA collected
- More DATA stored
- More DATA accessible and distributed
5Computing Resources
- There are a host of statistical software that
provide pre-programmed analytical and data
management capabilities. These software may be
classified according to use and cost.
6Computing Resources
- Types of Stat Software by usage
- General Purpose -- SAS, SPSS, R, Splus,
Statistica, Stata - Special Purposes -- econometric modeling
(Eviews), seasonal adjustment (X12), Bayesian
modeling (WINBUGS), survey data tabulation
variance estimation (IMPS, CENVAR)
7Computing Resources
- Types of Stat Software by cost
- Commercial Software - SAS, SPSS, Stata, S-plus
- Freeware - R, IMPS, X12
8Computing Resources
- FOR SURVEY DATA
- Bascula from Statistics Netherlands.
- CENVAR ( IMPS)from U.S. Bureau of the Census.
- CLUSTERS from University of Essex.
- Epi Info from Centers for Disease Control.
- Generalized Estimation System (GES) from
Statistics Canada. - IVEWare (beta version) from University of
Michigan.
9Computing Resources
- FOR SURVEY DATA
- PCCARP from Iowa State University.
- SAS/STAT from SAS Institute.
- Stata from Stata Corporation.
- SUDAAN from Research Triangle Institute.
- VPLX from U.S. Bureau of the Census.
- WesVar from Westat, Inc.
10Computing Resources
- Lists of Statistical Software
- http//members.aol.com/johnp71/javasta2.html
- http//www.stir.ac.uk/Departments/HumanSciences/S
ocInfo/Statistical.htm - http//www.fas.harvard.edu/stats/survey-soft/
- http//www.feweb.vu.nl/econometriclinks/software.
html
11Computing Resources
- This afternoon, we will provide a demonstration
on how to use STATA for accomplishing some of the
most common tasks of data management, statistical
computing and analysis of survey data.
12Computing Resources
- Stata
- Estimation of means, totals, ratios, and
proportions - linear regression, logistic regression, and
probit. - Point estimates, associated standard errors,
confidence intervals, and design effects for the
full population or subpopulations are displayed.
13Computing Resources
- Stata
- Auxiliary commands display various information
for linear combinations (e.g., differences) of
estimators, and conduct hypothesis tests. - New in Stata contingency tables with Rao-Scott
corrections of chi-squared tests new
survey-corrected regression commands including
tobit, interval, censored, instrumental
variables, multinomial logit, ordered logit and
probit, and Poisson
14Computing Resources
- Stata
- stratified designs
- cluster sampling
- FPCs can be calculated for simple random sampling
w/o replacement of sampling units within strata - variance estimation for multistage sample data
carried out through the customary
between-PSU-squared-differences calculation.
15Computing Resources
- Stata
- Variance estimation is done thru Taylor-series
linearization in the survey analysis commands.
There are also commands for jackknife and
bootstrap variance estimation, but these are not
specifically oriented toward survey data.
16Computing Resources
- Note
- We will demonstrate the use of STATA version 6.
Current version is version 7 even a Special
Edition (SE) which can handle up to 32,766
variables w/ strings up to 244 chars, and up to
11,000 x 11,000 matrices.
17(No Transcript)
18Data Management
- STARTING UP
- Go to Start, Programs, Stata, Intercooled Stata
- Alternatively, from Windows Explorer, go to
folder - c\stata
- Double click
- wstata.exe
19Data Management
20Data Management
- CREATING A NEW DATASET
- Open the STATA spreadsheet editor
21Data Management
- CREATING A NEW DATASET
- Enter data into the editor, when done close the
editor.
22Data Management
- CREATING A NEW DATASET
- In the STATA COMMAND window enter the command
- save newfile
23Data Management
- NOTE
- A STATA dataset will have extension name dta.
That is, newfile is actually newfile.dta - Public use files of some surveys, e.g. VLSS
(Vietnam Living Standards Survey), are in Stata
format.
24Data Management
- INSPECTING DATA BASE
- In the STATA COMMAND window enter the following
commands - describe
- list
- summarize
25Data Management
- NOTE
- Stata is case sensitive.
- Stata commands may be abbreviated, e.g. D for
DESCRIBE, SUM for SUMMARIZE, etc. - We may use Page Up/Down keys or mouse for
re-selecting commands in the Review window.
26Data Management
- NOTE
- Commands and output are shown in Results window.
Windows may be re-sized. - Commands and output may be logged into a log
file by pressing Open Log button.
27Data Management
- RENAMING VARIABLES
- ONE WAY (From Data Editor) Double click
anywhere in the variables column resulting in a
dialogue box -
28Data Management
- RENAMING VARIABLES
- SECOND WAY (In the STATA COMMAND window) enter
- rename var1 domain
- rename var2 hcn
- rename var3 age
- label variable age HH head age
- d
29Data Management
- SAVING EDITED DATABASE
- In the STATA COMMAND window enter the following
commands - save newfile, replace
- Note typing only
- save newfile
- will result in an error message
30Data Management
- READING PRE-EXISTING
- STATA DATASET
- If dataset is in folder c\fies2000 and filename
is fies00small.dta, enter - clear
- set mem 64m
- cd c\fies2000
- use fies00small
NOTE Impt for MEMORY MANAGEMENT
31Data Management
- IMPORTING DATA
- Suppose we have a dataset try.txt in c\fies2000
folder -
NOTE Missing Data coded as .
32Data Management
- IMPORTING DATA
- Suppose we have a dataset try.txt in c\fies2000
folder - Use the infile command with syntax
- infile variable-list using filename.raw
- In particular, enter
- cd c\fies2000
- infile domain hcn age using try.txt,
- automatic
33Data Management
- TRIVIA ON STRING VARIABLES
- When using the infile command for character
(string) variables, we need to identify these
variables. For instance - infile domain hcn str30 prov using tr.txt
- For more details regarding infile, enter
- help infile1
-
34Data Management
- IMPORTING DATA
- Suppose we have a dataset try2.txt in c\fies2000
folder with the data in specific fields -
Assumes last line is blank line
35Data Management
- IMPORTING DATA
- Suppose we have a dataset try2.txt in c\fies2000
folder with the data in specific fields - Use the infix command
- infix domain 1 hcn 2 age 3-4 using try2.txt,
clear -
36Data Management
- Thus, Stata can read text files with
- Infile (if the data in text is separated by
spaces and does not have strings, or if strings
are just one word, or if all strings are enclosed
in quotes) - Infix (fixed format text)
- Insheet (if text file was created by a
spreadsheet or db program)
37Data Management
- NOTE
- The commands infile, infix, insheet read data
from ASCII files. Outfile is a way to save the
data in ASCII. - There are third party programs, esp.
Stat/Transfer and DBMS/COPY, that perform
translations from one data format (e.g., dBASE,
Excel, SAS, SPSS, Stata) to another.
38Data Management
39Data Management
- OTHER USEFUL COMMANDS
- To sort the dataset by age
- sort age
- To get a listing of the dataset
- list
- To get a listing of the 2nd-4th data
- list in 2/4
40Data Management
- OTHER USEFUL COMMANDS
- To summarize the restricted dataset of HHs whose
heads age is less than/equal to 50 - summarize if age lt50
- HH head age between 35 and 50
- summarize if age lt50 age gt35
41Data Management
- Comparison operators
- gt gt
- lt lt !
- Logical operators
- (and) ! (not)
- (or) (not)
42Data Management
- OTHER USEFUL COMMANDS
- To tabulate domain
- tab domain
- To generate contingency tables
- tab domain hcn if agegt35
- To get the correlation matrix
- correlate x y z
43Data Management
- GENERATING REPLACING VARIABLES
- Suppose we want to obtain per capita income (pci)
of FIES 2000 households - clear
- cd d\fies00
- use fies00small
- gen pcitoinc/hsize
44Data Management
- GENERATING REPLACING VARIABLES
- Now tag the household as poor (1) if pci lt some
threshold, say 13823, determine percent of HHs
that are poor. - gen poor1 if pci lt 13823
- replace poor0 if poor.
- sum poor awrfact
- save fies00small, replace
45Data Management
- NOTE
- Small portion of data set of FIES 2000 was used.
The Family Income and Expenditure Survey (FIES)
is conducted by the National Statistics Office
(NSO)every 3 years. Data may be purchased
through the NSO website - www.census.gov.ph
46SIAP-SRTC Training Course on Sampling Acceed
Center, AIM, Makati Philippines 5 April 2002
47Data Management
- RECALL
- That if we use our fies2000 data set
- set mem 64m
- cd c\fies2000
- use fies00small
- sum poor awrfact
- Note poverty line we provided is a weighted
average of the variable poverty lines in the
Philippines (for urban-rural areas across the
different regions)
48(No Transcript)
49Estimating Food Poverty Line
- Food poverty line estimated from low cost one day
menus (breakfast, lunch, supper snack)
constructed for each urban-rural area of a region
by Food and Nutrient Research Institute (FNRI)
which meet 100 sufficiency in energy and protein
requirements and 80 sufficiency of other
nutrients and vitamins. - RDAs for energy 2000 Kcal per person
- RDAs for protein 50 grams per person
- 29 such menus constructed on the basis of the
1988 Food Consumption Survey
50Annual Per Capita Food Line Urban, by Region
51Annual Per Capita Food Line Rural, by Region
52Estimating Poverty Line
- Poverty Line Food Threshold/ Engels Coefficient
- Engels coefficient estimated by analyzing the
consumption pattern of families having incomes
within plus or minus 10 percentage points from
food threshold. - Engels coeff Food Exp/ Total Basic Exp
53Annual Per Capita Poverty Line Urban, by Region
54Annual Per Capita Poverty Line Rural, by Region
55Poverty Statistics (Family)
Measures 2000 1997
Poverty Incidence 33.60.3 31.8
Poverty Gap 10.7 0.1 10.0
Severity Index 4.6 0.1 4.3
Standard Error
56Poverty Incidence All Areas, by Region
57Small Area Poverty Stats?
- Stata has some add ons for generating SEs for
poverty stats - If we wish to generate provincial poverty
statistics, we will find out that SEs are too
high, i.e. figures are unreliable
58(No Transcript)
59Data Management
- RECALL
- That if we use our fies2000 data set
- set mem 64m
- cd c\fies2000
- use fies00small
- sum poor awrfact
- Note poverty line we provided is a weighted
average of the variable poverty lines in the
Philippines (for urban-rural areas across the
different regions)
60Data Management
- NOTE
- STATA uses several types of weights
- fw frequency weights
- aw analytic weights
- iw importance weights
- pw probability weights
61Data Management
- NOTE
- Within the command generate or replace, we may
transform or create variables by using functions,
e.g., - generate logincln(toinc)
- generate ycos(x_pi/180)
- replace newvarnormd(z)
- generate rvaruniform()
62Data Management
- DELETING VARIABLES/DATA
- To drop a variable, say age
- drop age
- To drop some observations
- drop in 2/3
- Try also the command keep.
- To drop all data in memory
- clear
63Data Management
- NOTE
- So far we have used STATA interactively. We can
also do batch processing through the DO FILE
editor.
64Data Management
- NOTE
- The STATA toolbar has 13 buttons.
- The first three are to OPEN a Stata dataset
- SAVE to the disk the resident dataset
- PRINT a graph or log
65Data Management
- The next five are for Starting/stopping/suspendin
g a LOG - Bringing the Log to the Front
- Bringing the Dialog to Front
- Bringing the Results to Front
- Bringing the Graph to Front
66Data Management
- The last five are for
- Opening the DO FILE editor
- Opening the DATA editor
- Opening the DATA Browser
- Telling Stat to continue when it has paused
in mid of long output - Stopping the current task
67Exercise
- What is the average income of families that are
below or above the mean family expenditure?
68Exercise
- Compare correlation of food expenditures (fexp)
and nonfood expenditures for families in rural
urban areas.
69Extra
70Extra
- Now try
- sort urb
- graph food nfood, by (urb)
- graph food nfood, by (urb) total
71Extra
- Matrix plots
- graph toinc food nfood, matrix
72(No Transcript)
73Table Generation w/ tab
- Earlier, we showed the use of the tab(ulate)
command. Try - tab urb
- tab urb awrfact
- tab urb iwrfact
- tab urb regn
74Tab
- The tab command has options for generating 1-way
tables of freqs - tab urb, summ(toinc)
- and two way tables
- tab urb sex
- tab urb sex, row
- tab urb sex, row col chi2
- tab urb sex, all exact
75Table Generation w/ table
- Aside from the tab command, we can generate
tables of statistics with the table command.
Compare - tab urb
- with
- table urb
-
76Table
- To generate the average (family) income and
average (family) expenditure across urban and
rural areas, enter - table urb, c(mean toinc mean toexp)
- Using weights
- table urb awrfact, c(mean toinc mean toexp)
77Table
- The contents option may specify at most five of
the ff statistics - freq (for frequency)
- mean varname (for mean of varname)
- sd varname (for standard deviation)
- sum varname (for sum)
- rawsum varname (for sums ignoring optionally
specified weight) - count varname (for count of nonmissing data)
78Table
- The contents option may specify at most five of
the ff statistics - n varname (same as count)
- max varname (for maximum)
- min varname (for minimum)
- median varname (for median)
- p1 varname (for 1st percentile)
- p2 varname (for 2nd percentile)
- ...
- iqr varname (for interquartile range)
79Exercise Using Table
- Obtain the average and median per capita income
of households by sex of household head - table sex, c(mean pci median pci)
- Obtain the weighted frequency of poor and
nonpoor households across regions - table poor regn iwrfact
80Using Survey Commands
- STATA has designed a family of commands
especially for sample surveys. These commands
all begin with svy - svyset setting variables
- svydes describe strata and PSUs
- svymean estimate popn subpop means
- svytotals estimate popn subpop totals
-
81Using Survey Commands
- Svy commands
- svyprop estimate popn subpop props
- svyratio estimate popn subpop ratios
- svytab for two way tables
- svyreg for regression
- svyivreg for instrumental variables reg
- svylogit for logit reg
- svyprobit for probit reg
-
82Using Survey Commands
- Svy commands
- svytest for hypothesis testing
- svylc for estimating linear combs
- svymlog for multinomial logistic reg
- svyolog for ordered logistic reg
- svyoprob for ordered probit reg
- svypois for poisson reg
- svyintrg for censored interval reg
-
83Using Survey Commands
- Before issuing any svy estimation command, we
identify the weight, strata and PSU identifier
variables - svyset pweight rfact
- svyset strata domain
- svyset psu hcn
-
84Using Survey Commands
- To obtain the average family income average
family expenditure - svymean toinc toexp
- To obtain the total family income, total family
expenditure by province - svytotal toinc toexp, by(regn)
85Using Survey Commands
- To obtain the per capita income per capita
expenditure - svyratio toinc/fsize toexp/fsize
- pci pce by urban/rural
- svyratio toinc/fsize toexp/fsize, by(urb)
86Using Survey Commands
- Linear regression of ln(pci)
- gen logincln(pci)
- svyreg loginc age fsize sex prov urb
- Compare the results with the regular regression
command - reg loginc age fsize sex prov urb
87Using Survey Commands
- Two way tables
- svytab urb poor, row se
- compared with
- tab urb poor awrfact, no freq row
88Alternatives to STATA
89Learning More about Stata
- Online tutorial, type
- tutorial intro
- List of Tutorials
- Tutorial Description
- --------------------------------------------------
--- - intro An introduction to Stata
- graphics How to make graphs
- tables How to make tables
- regress Estimating regression models, inc
2SLS - anova Estimating one-, two- and N-way
ANOVA and ANCOVA models
90Learning More about Stata
- Tutorial Description
- --------------------------------------------------
--- - logit Estimating maximum-likelihood logit
and probit models - survival Estimating ML survival models
- factor Estimating factor and principal
component models - ourdata Description of the data we provide
- yourdata How to input your own data into Stata
91Learning More about Stata
- Email distribution list. Send email to
- Majordomo_at_hsphsun2.harvard.edu
- In the body of your email message type the
message subscribe statalist email_at_addressor
for a daily summary - subscribe statalist-digest email_at_address
92- Maraming Salamat sa inyong pakikinig.
- (Thank you for your attention)