Title: Stata Seminar
1Stata Seminar
- Session 1
- Francisco Jose Gonzalez Carreras
- fjg23_at_sussex.ac.uk
2Source
- This is the source used. The sessions will be
- First session Hands on session. (Chapter 1)
- Second session Grammar of Stata. (Chapter 3)
- Third session. Creating and changing variables.
(Chapter 5) - Fourth session. Charts and Linear Regression.
(Chapters 6 and 8). - This is the 2005 edition, there is a new one
forthcoming. - Not in library but could be borrowed by
interlibrary loan (need to pay 2, though)
3Starting Stata
- Download the data from
- http//www.stata-press.com/data/kk.html kk.zip
file - To start the session Start gt All programs gt
Intercooled Stata.
4Screen
Pop-Up menu
Past commands appear here
Results appear here
Working directory displayed here
Variable list displayed here
Commands typed appear here
5Stata Screen
- Change the default windows
- Right click the mouse in the results screen and
you can change the font of the results windows. - You can also move windows around
- If something was changed and you want to restore
the original settings, Pop Up menu gt Prefs gt
Manage Preferences gt Load preferences gt Factory
settings (In version 10 you have to go Pop Up
menu gt Edit gt Preferencesgt and the rest is the
same)
6Analysis Input commands
- Type d in the command window and press Return
- d is the abbreviation of describe, a command that
describes file. - The number of observations and variables is zero
we have not loaded any file. - Memory is of working memory being used. Data
are loaded in the RAM memory. - Sorted by no sorting criteria.
- We have not loaded any data, lets load a file.
7Analysis Directory
- Command cd change directory. We have to move to
the directory where the files are. Type cd
c\data\kk to move. This allow us to name the
file without having to write the whole path of
the file. - Dir will show all the files containded in the
file. - See that more- appears click enter and you
advance a file, click space bar and you will see
the next screen. Click q to stop results coming. - You can use dir .dta, which will show you all
the files with dta extension.
8Analysing data loading data
- To use a file, type
- use data1
- This command loads stata file into the working
memory (RAM). - Default memory size is 1 mega and sometimes you
will need to set a bigger capacity. For big files
set memory would be needed - Stata assumes it is a .dta file.
- Then type describe
9Analysing data Variables Observations
10Analysing data Variables Observations
- The file is a subsample of the German
Socioeconomic Panel (GSOEP). It is a survey
taking place since 1984 in which the same
households, families, individuals are interviewed
once a year. - In the screen
- Observations are 3,340 and the nr of variables
47. This means that 47 pieces of information are
stored for each individual. - The first, persnr is the first variable and does
contain only the code that is unique for each
individual. Sometimes you will need to create
this from other information. - Storage type has to do with the size and is
important to save resources (more on this next
sessions) - Labels is a brief description of the variable
(more coming)
11Analysing data Looking at data
- We have too many observations, so we get rid of
some. Type - drop ymove-np9507 (get rid of the range of the
variables included in the command, keep gets rid
of the variables not included) - To have a look all observations type list (Then
q to stop more screens!!) - Too much information, not operational. We will
reduce it. - Lets focus on the second Man, was born in 1971,
household head, single.. - Missing value . was not questioned of the
person did not respond.
missing value
12Analysing data Looking more carefully
- Listing data in this fashion is not useful so we
will be more specific. - We could type list to list just a number of
variables. Type list gender income - but again we have more than 3,000
observations!!. - To narrow down our look we will use first the in
qualifier. This qualifier limits by the position
of the observation in a particular order. Type - sort income
- list gender income in 1/10
13Analysing data looking more carefully
- What does these commands do????
- sort income sort the data in ascending order, so
the person with the lowest income is the first.
This establish the order. - list gender income in 1/10 list the first 10
observations. It will show gender and income
corresponding to the ten observations
(respondents) with the lowest income (remember
we sorted in ascending order by income) - What would this do?
- list gender income in 2/4
- Individuals from the second to the fourth
14Analysing data Summary statistics
- To obtain summary statistics about income, type
- Summarize income
- The information about is the nr. of observations
used to calculate the arithmetic mean, the
standard deviation, the minimum and the maximum.
You have only 3,034 observations because some of
them were set to missing (.) and they are not
taken into account when doing the calculations - You can summarize a list of variables simply by
adding more in the list. If you want to summarize
all the variables, just type summarize (also sum
as an abbreviation would work)
15Analysing data if qualifier
- What if we wanted to have a look at the
difference in income between women and men? We
use the if qualifier and summarize data
conditional on the variable meeting the if
condition. Type - summarize income if gender1
- summarize income if gender2
- The first summarize only the observations in
which gender is equal to a particular value. 1
refers to males and 2 to females in this survey.
See the difference in income. - The double equal is necessary, otherwise it will
show invalid syntax
16Analysing data missing values
- Men seem to earn more than women. But these
averages are calculated taking into account those
observations with income0. These might be to be
more frequent among women so in order to compare
only individuals with positive income we can
either type - sum income if gender1 incomegt0
- sum income if gender2 incomegt0
- or recode 0 incomes to . (missing) so that
they will not be taken into account when
calculating the average. Type - mvdecode income, mv(0.a)
- sum income if gender1
- sum income if gender2
17Analysing data by prefix
- A prefix is a command that is written in front of
the actual stata command. - It has two parts
- prefix itself, by
- variable list, in our example only gender.
- Structure would be
- prefix command actual command
- In the case of by, the actual command is repeated
for all the categories in the prefix list or
bylist. - A condition is that the data have to be sorted by
the variables in the bylist - Type
- sort gender
- by gender summarize income
18Analysing data missing recoding by prefix
Same mean
19Analysing data Command options
- Options are command specific, unlike in and if
qualifiers or by prefix. - They are written after the actual command,
following a comma. - In the case of summarize, the detail option
will give much more information about the income
distribution skewness, kurtosis - Type
- sum income, detail
Median
Moments
20Analysing data Frequency tables
- The command that generates frequency tables is
tabulate (or tab), which has to be followed by
one or two variables, generating one way
frequency table or two way frequency table. - Type
- tabulate gender
- tabulate emp gender
- First variable is the row variable, second
variable is the column variable. - Options for this command are row or column which
return the row and column percentages
21Variable labels and value labels
- See the differences between the first and the
second table. In the second we only have the
values that correspond to the different types of
employment status. - label of the variable is a brief description of
the variable. Lets change it typing (does not
matter if it already had one) - label variable emp Status employment in 97
- label values is the label for the different
values. In income, the label for value 1 was
male and the label for value 2 was female.
This variable has seven different values. Lets
label the values. Type (not breaking the line) - label define emplb 1 Full time 2 Part time 3
Retraining gt 4 irregular 5 not working 6
military service 7 gt gtunemployed , modify - label values emp emplb
- Labels are stored in emplb. They can be assigned
to any other variable with same values. Let
tabulate again to see the changes - tab emp gender, column nofreq
22Variable labels and value labels
was Employment Status 1997
value labels created
23Analysing data Graphs
- Part-time employment or unemployment is more
frequent among women. Maybe income differences
are due to employment status. - Graph the distribution might give us a quick and
very informative look over the data. Type - graph box income, over(emp)
- To get a box-and-whisker plot, this result in a
graph with one distribution graph over each group
of emp ,(over (emp)) - Outliers are the dots. Income are skewed for all
subgroups. Median for full time is higher than
for the rest. If there are relatively more part
time women represented, we might think that
income inequality could be due to division of
labor within the couples than to gender
discrimination. We first must control for
employment status
24Analysing data graphs
Outliers
Third quartile
Median
First quartile
employment status
25Getting help
- How to find out about the effects of gender and
employment status on income?. Regression
analysis. How to do it? Lets have a look at the
help - command search looks in all stata resources some
topic that might be linked to your search. Type - search Linear Regression
- search model
- search OLS
- Also you can use the help command to get
information about a command (now that we know
that we should use regress). Type - help regress
- You find the syntax, explanation and description
of the available options (options, as formerly
said are command specific) - Pop up menu Help gt Search or Help gt Stata
command
26Getting Help
27Analysing data Recoding variables
- We have the dependent variable, income, and two
independent variables gender and employment
status. - Gender dichotomous variable. They conventionally
take the values of 0 or one in regressions. We
recode the variable to make men 1, women 0.
Type - generate men1 if gender1
- replace gender0 in gender2
- Create a variable 1 if gender1 and missing
otherwise. Then replace with 0 those missing
values that meet the criteria of gender2 - With employment status we will do something
similar. This variable is not dichotomous. We
will do just - generate fulltime1 if emp1
- replace fulltime 0 if emp2
- because the analysis will be limited to full
time/part time.
28Analysing data Linear Regression
- We will run a basic linear regression with the
data at hand. We saw that we needed to use
regress. We type regress followed by the
dependent variable followed by the independent
variables. Type - regress income men fulltime
- Interpretation
- average monthly income for individuals with
income0 and fulltime0 (part time women
employees) is 965. - female full time workers earn on average 806
more. - independent of full time/part time, men earn on
average 451 more than women. - Therefore, income inequality cannot be explained
by the higher proportion of female part time
workers in the data file.
29Analysing data Linear Regression
30Do files
- To reproduce the results in your session do
files. - Text file where you save your commands in order
to store your work sessions. - Type
- doedit
- Opens the do file. The first line establish the
version so that the do file can be run with any
future version. We have written the commands that
were necessary to do the regression as above. - Once you have copied the commands File gt Save as
gt an1.do in the current directory. Next type - do an1.do
- You can run all the commands again!!!!
31Do files
32Exiting Stata
- Once you have saved your session or your work in
your do file, it is better to leave stata without
saving changes. - Changes in do files are easy do to, changes in
the original database might not be possible to
undo. HAVE ALWAYS MORE THAN ONE COPY OF THE
ORIGINAL DATABASEjust in case. - If you want to save changes, save them in a new
file, typing, for instance - save mydata
- Then you can exit Stata exit, clear