Title: STATA Lab: EP521
1STATA Lab EP521 Learning by Doing Session 1
Exploring Data
Ray Boston boston_at_vet.upenn.edu Room 604
Blockley 610 925 6557
2This 6 session Stata Lab series will expose the
2nd level functionality of Stata through
practical demonstrations and exercises relating
to your course EP 521
Course Schedule
Presenter Ray Boston Location Room 604
Blockley Phone 610 925 6557 boston_at_vet.upenn.edu
3Commands used in this lab use use a Stata
dataset earlier stored on disk note replace
option inspect inspect specific variables note
missing value info available here describe descri
be variables in a Stata dataset note detail
option summarize summarize a Stata dataset note
works on individual variables codebook report
details of data coding for indicated
variable(s) display display a message, or a
variable value (scalar or local) label label a
categorical variable encode make a numeric
variable for a sting variable generate generate
a new Stata variable replace replace the value
of a variable list list specific variables,
note tables can be also generated table
tabulate an interval variable by a categorical
variable tabstat tabulate some statistics for
specific variables tabulate tabulate some
information, note this is the command for
Fishers test sort sort the data gsort sort
the dataset in a specified way (generalized sort)
4Secondary commands used in this lab for loop
for a series of objects, note that this is an
out-of-date command collapse reduce your dataset
to summary statistics scatter produce a Stata 8
scatter plot gr7 produce s Stata 7 graph d d
cr change the end-of-line delimiter ( and
cr) preserve preserve a copy of the current
Stata dataset in computer memory restore retrieve
the preserved copy of the Stata dataset note
the stored copy is no-longer available, and
the original dataset is replaced scalar generate
a Stata scalar variable local generate a Stata
local variable note also called a local
macro cc generate a case-control type of epi
table cs generate a cohort-study type of epi
table logit perform a logistic
regression poisson perform a poisson
regression more holdup processing until the
spacebar is pressed input enter the following
data into Stata, terminate entry with end We
may return to these commands for specific
purposes in later labs
5Problem Woodward presents the following table
(Table 2.9, p. 48) relating to sex versus smoking
status in the Scottish Heart Health Study. Adapt
the information in this table for analysis with
STATA
Variables sex, smoker, count
Coding
smoker 0, non-smoker 1, smoker sex 0, female
1, male count actual cell count
6The data can be entered into STATA via the data
editor
Then, when the data for each variable, is entered
we can name, and label the variables. Usually
variable naming is all that we attempt here.
Double click on any cell associated with a
variable and the naming box pops up
7Here we display the appropriately labeled
variables and their contents
Label the values of sex and smoker so that our
table make sense
. label define smlabel 0 "Non smoker" 1 " Smoker
" . label define selabel 0 "Female" 1 " Male
" . label val sex selabel . label val smoker
smlabel
Note that cell counts, and NOT margins are
entered into STATA
8Label the variable count
. label var count "Cell count"
For some preliminary limbering lets explore the
data as it stands
.list -----------------------------
sex smoker count
----------------------------- 1. Female
Smoker 1562 2. Male Non smoker
2241 3. Female Non smoker 2259 4.
Male Smoker 2279
-----------------------------
Why wasnt count value labeled like sex, and
smoker?
We should now save the table as a file
. save table 2_9 Woodward.dta",replace
Where was the data saved? cd Why did we
include the replace option? pre-existence Why do
we refer to replace as an option? , Why did we
use quotes () around the file name? space What
format was the data saved in? .dta
9Lets see a table of this data
. table sex smoker fwecount, row
col ---------------------------------------------
- smoker
sex Non smoker Smoker
Total -------------------------------------------
-- Female 2,259 1,562
3,821 Male 2,241 2,279
4,520 Total 4,500
3,841 8,341 --------------------------------
--------------
Lets see how we recall the coding schemes .. why
would this be needed?
. codebook sex sex -------------------------------
-------------------------------- (unlabeled)
type numeric (byte)
label selabel range 0,1
units 1 unique
values 2 coded missing 0 /
4 tabulation Freq. Numeric
Label 2 0
Female 2 1
Male
10We will explore this data using the Stata command
sequence which follows
- First some EXTREMELY important points
- In practice you will ALWAYS build your
statistical exploration of data - using command sequences such as we now
demonstrate - Why?
- The nature of the commands in the command
sequence is ALWAYS - retained on your computer in a disk file,
usually close to the dataset - (table 2_9 Woodward.dta) for which it was
developed. - Why?
- Commands are stored as ordinary text in files
called do files - Why?
- Stata has a special editor, the do file editor,
for the creation, and - editing of do files.
- Why?
11use "C\Stata\EP521\Epi 521 04\Session 1\table
2_9 Woodward.dta",clear Information about the
raw data correctness/screening list codebook desc
ribe summarize summarize sex smoke
fwecount label define smlabel 0 "Non smoker"
1 " Smoker " label define selabel 0 "Female" 1
" Male " label val sex selabel label val smoker
smlabel list If we want to copy the table to
Excel Select, and Edit copy table, and
Paste the following table list, nolabel noobs
clean codebook inspect describe Some tables
describing the data tabulate sex fwecount,
su(smoke) mean table sex fwecount, c(mean
smoke freq) format(7.2f) tabulate sex smoke
fwecount, chi table sex smoke fwecount, row
col tabstat smoke fwecount, s(mean sd sem N)
by(sex) long Present some simple graphs of
this data preserve collapse smoke fwecount,
by(sex) generate pos3(sex1)
Get the data into Stata
Screening the input using list
describe summarize codebook inspect,
and table variations
Preparing to graph
12scatter smoke sex, c(l) ml(sex) more scatter
smoke sex, c(l) ml(sex) mlabv(pos) more Now for
adjustments required by Stata 8 graphics
syntax d scatter smoke sex, c(l) ms(Sh)
mlabv(pos) xlabel(0 1, valuelabel)
title("Smoking Proportion By
Sex") ytitle(" ")
ylabel(,angle(0)) d cr more gr7 requests a
Stata 7 type graph You establish Stata 7 graph
preferences using 'oldgprefs' gr7 smoke sex, c(l)
s(sex) xlabel(0 1) ylabel l1("Smoking
Proportion By Sex") more Let's determine the
malefemale risk ratio for smoking display "Risk
ratio " max(smoke1,smoke2)/min(smoke1,smok
e2) restore Two alternate ways of looking
at the data - Risk perspective cs smoke sex
fwecount poisson smoke sex fwecount, irr
nolog ro Using scalars let's calculate the
malefemale odds ratio for smoking gsort sex
-smoke scalar prob_femalecount1/(count1count
2) scalar odds_female prob_female/(1-prob_fema
le) scalar prob_malecount3/(count3count4)
scalar odds_maleprob_male/(1-prob_male) scalar
odds_ratioodds_male/odds_female scalar list
_all Two alternate ways of looking at the data
- Odds perspective cc smoke sex fwecount logit
smoke sex fwecount, or nolog
Stata 8 Graphing commands
Stata 7 Graphing command
Manual rr calculation
Two other ways of determining risk ratio - rr
Manual or calculation
Two other ways of determining odds ratio - or
13An exercise to get you started using Stata
productively on your own
14The following table is from Kahn Sempos (p. 81)
and reflects a distillation of some information
extracted from the Framingham study.
Ultimately we would like to use these numbers to
possibly tell us to what degree blood pressure
elevation disposes us to CHD what is the overall
risk for CHD amongst study participants in the
table how much is the risk of CHD elevated if we
have high blood pressure
15Getting the CHD data into STATA and naming the
variables again. What do we mean by naming the
variables?
16Perform the following tasks Screen the data
entered to confirm its correctness How could you
generate the margins to add confidence here? Do
it. Label the variables appropriately. What
constitutes appropriate labeling? Save the Stata
data file. Where did you save it? What format
was used? Verify that you have indeed save the
Stata data file Perform tests to verify that you
have correctly prepared your data Tables Reprodu
ce the table in which the problem is first
introduced Tabulate the proportion of subjects
with CHD by blood pressure grouping Add standard
error estimates to this table Are the proportions
with CHD different by blood pressure
group? Graphs Collapse the data into
proportions with CHD, by blood pressure
group Produce a simple Stata 8 graph of CHD
proportion against blood pressure Add features to
your graph to make it publication ready Produce a
Stata 7 graph of the same data which was easier?
17clear the Stata work space
clear input Blood_pressure CHD count 1 1 95 0 1
173 1 0 201 0 0 894 end list table
Blood_pressure CHD fwecount, row col label
define blabel 0 "lt 160" 1 "gt160" label define
clabel 0 " No CHD" 1 CHD label value
Blood_pressure blabel label value CHD
clabel label var count Count save
"Blood_Pressure CHD Backup", replace use
"Blood_Pressure CHD Backup", clear list table
Blood_pressure CHD fwecount, row col codebook
Blood_pressure CHD table Blood_pressure
fwecount, c(mean CHD) format(7.3f)
row tabstat CHD fwecount, by(Blood_pressure)
s(mean sem) format(7.3f) cs CHD Blood_pressure
fwecount poisson CHD Blood_pressure
fwecount, irr nolog
Alternate mode of data entry
list and tabulate the data
label the values of the 2 categorical Variable
save the data reuse the data
Screen the data
tabulate the data
Compare CHD between the BP Groups what is our
basis for comparison, odds or risks?
18collapse the data reassign the line end..
collapse CHD fwecount, by(Blood) d scatter
CHD Blood, c(l) ms(Oh) clc(blue) clw(thick)
clp(dash) xlabel(0 1, valuelabel)
ylabel(,angle(0)) ytitle("Proportion with
CHD") xtitle("Blood Pressure")
title("Proportion of Subjects with CHD by Blood
Pressure") more label var Blood " " gr7
CHD Blood, c(l) s(Blood) xlabel(0 1) ylabel
l1(Proportion with CHD)
t1(Proportion of Subjects with CHD by Blood
Pressure) b1(Blood Pressure) d cr
Draw a Stata 8 graph
Draw a Stata 7 graph
19Some points to help clarify the above instruction
sequence
1. The data entry mode, input, was simplified
for display purposed only 2. Frequency weighted
data is always manipulated, e.g. tabulated, with
the assistance of fweight
frequency_weight_number, e.g. fewcount 3.
Labeling the values of categoric variables
required 2 steps defining the labels for
the values, and applying the labels 4. When
saving a Stata dataset use the replace option or
else you may be advised that the saveset
already exists, and processing will stall 5.
When loading a saved dataset use the clear
option or Stata will protect the current
dataset thus stalling the load 6. When
tabulating the data use format specifications
to acquire neat tables 7. When using the tabstat
command with 2 categoric variables the
longstub option assists in achieving
clearer output 8. Epitabs command cs (cohort
study) will, like, poisson, produce risk
ratios as opposed to odds ratios in
describing the association of the outcome with
exposure. Compare with cc and logit. 9.
collapse reduces a dataset to a statistical
summarization of same. Use the data editor
to make sure you understand the consequences of
data modification commands like collapse
(see contract, expand, reshape)
20Some points to help clarify the above instruction
sequence, cont.
- 10. Because some Stata commands become extremely
long it is not practical - to place the entire command on a single
line of text. To break the command - into multiple lines we replace the end of
command terminator, normally the - carriage return (cr) with the semi colon
(). Dont forget to restore the usual - terminator (cr)
- Stata8 graphing commands are very verbose and
need to be distributed - over several lines using the semi colon
command terminator - 12. The command more holds up execution with
Stata until the spacebar is - pushed. To ensure that plots are held
on the screen it helps to include a - more after each plot
- 13. Within Stata8 we can access the much simpler
Stata7 graphing services. - To perform a Stata7 plot using Stata8 use
the command gr7, and the - appropriate syntax. You may need to set
the Stata7 graphing preferences - and this can be done with the command
oldgprefs. - 14. To void variable labels assign a null label
to a variable. This will ensure - that Statas attempt to smarten your
Stata7 graphs doesnt take the plot - appearance out of your control.
- Stata commands can occasionally be shortened to
ease their use - 16. Variable names can also be contracted to
ease their access
21The Excel file, cardatarb.xls contains some
recent (New Yorker, Jan 05, 2004) accident
statistics relating to indirect and direct road
deaths when a range of different car types were
involved. The purpose of the investigation
under- pinning the data was to see if large
vehicles are associated with different types of
accidents than small cars. You are asked to
perform the following tasks Get the data from
Excel directly into Stata Describe and summarize
the data Generate a neat table of all types of
deaths (these are actually death rates
per million vehicles of the indicated type) by
vehicle type. Is there a suggestion of an
association here? Make a numeric variable out of
the car type variable. Confirm that the new
variable you have created is indeed of the type
sought Label the numeric variable appropriately.
Hint youll need codebooks help here The
vehicles are essentially of two classes, large
and small. Create a new numeric variable which
is 1 for large vehicles, and 0 for others. Label
this variable appropriately. See if your data
breaks down equitably by your new numeric
variable. Tabulate a breakdown of deaths of the
different types by your new size-related vehicle
group variable. How could you actually detect a
statistically significant difference here? (see
nptrend)
22Load and screen the data describe and summarize
version 8.2 cd "C\Stata\EP521\Epi 521 04\Session
1" use "Car Death Data.dta", clear describe summar
ize tabstat Direct Indirect Total, by(Car_type)
format(7.2f) Now we need to combine the cars
into small and large encode Car_type,
gen(eCar_type) codebook eCar_type gen size1 if
eCar1 eCar3 eCar5 eCar6 replace
size0 if size. table eCar size, row
col label def slabel 1 Large 0 Small label val
size slabel tabstat Direct Indirect Total,
by(size) format(7.2f) for var D - T nptrend X,
by(size)
Create a neat table
Create a numeric variable eCar_type from
Car_type, then create a new binary form of
eCar_type, size
Tabulate eCar_type vs size
label the values of size and tabulate the obs.
against size See is there is a pattern in the
observations with size