Stata Seminar - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Stata Seminar

Description:

... if we wanted to have a look at the difference in income between women ... Graph the distribution might give us a quick and very informative look over the data. ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 33
Provided by: suss
Category:

less

Transcript and Presenter's Notes

Title: Stata Seminar


1
Stata Seminar
  • Session 1
  • Francisco Jose Gonzalez Carreras
  • fjg23_at_sussex.ac.uk

2
Source
  • This is the source used. The sessions will be
  • First session Hands on session. (Chapter 1)
  • Second session Grammar of Stata. (Chapter 3)
  • Third session. Creating and changing variables.
    (Chapter 5)
  • Fourth session. Charts and Linear Regression.
    (Chapters 6 and 8).
  • This is the 2005 edition, there is a new one
    forthcoming.
  • Not in library but could be borrowed by
    interlibrary loan (need to pay 2, though)

3
Starting Stata
  • Download the data from
  • http//www.stata-press.com/data/kk.html kk.zip
    file
  • To start the session Start gt All programs gt
    Intercooled Stata.

4
Screen
Pop-Up menu
Past commands appear here
Results appear here
Working directory displayed here
Variable list displayed here
Commands typed appear here
5
Stata Screen
  • Change the default windows
  • Right click the mouse in the results screen and
    you can change the font of the results windows.
  • You can also move windows around
  • If something was changed and you want to restore
    the original settings, Pop Up menu gt Prefs gt
    Manage Preferences gt Load preferences gt Factory
    settings (In version 10 you have to go Pop Up
    menu gt Edit gt Preferencesgt and the rest is the
    same)

6
Analysis Input commands
  • Type d in the command window and press Return
  • d is the abbreviation of describe, a command that
    describes file.
  • The number of observations and variables is zero
    we have not loaded any file.
  • Memory is of working memory being used. Data
    are loaded in the RAM memory.
  • Sorted by no sorting criteria.
  • We have not loaded any data, lets load a file.

7
Analysis Directory
  • Command cd change directory. We have to move to
    the directory where the files are. Type cd
    c\data\kk to move. This allow us to name the
    file without having to write the whole path of
    the file.
  • Dir will show all the files containded in the
    file.
  • See that more- appears click enter and you
    advance a file, click space bar and you will see
    the next screen. Click q to stop results coming.
  • You can use dir .dta, which will show you all
    the files with dta extension.

8
Analysing data loading data
  • To use a file, type
  • use data1
  • This command loads stata file into the working
    memory (RAM).
  • Default memory size is 1 mega and sometimes you
    will need to set a bigger capacity. For big files
    set memory would be needed
  • Stata assumes it is a .dta file.
  • Then type describe

9
Analysing data Variables Observations
10
Analysing data Variables Observations
  • The file is a subsample of the German
    Socioeconomic Panel (GSOEP). It is a survey
    taking place since 1984 in which the same
    households, families, individuals are interviewed
    once a year.
  • In the screen
  • Observations are 3,340 and the nr of variables
    47. This means that 47 pieces of information are
    stored for each individual.
  • The first, persnr is the first variable and does
    contain only the code that is unique for each
    individual. Sometimes you will need to create
    this from other information.
  • Storage type has to do with the size and is
    important to save resources (more on this next
    sessions)
  • Labels is a brief description of the variable
    (more coming)

11
Analysing data Looking at data
  • We have too many observations, so we get rid of
    some. Type
  • drop ymove-np9507 (get rid of the range of the
    variables included in the command, keep gets rid
    of the variables not included)
  • To have a look all observations type list (Then
    q to stop more screens!!)
  • Too much information, not operational. We will
    reduce it.
  • Lets focus on the second Man, was born in 1971,
    household head, single..
  • Missing value . was not questioned of the
    person did not respond.

missing value
12
Analysing data Looking more carefully
  • Listing data in this fashion is not useful so we
    will be more specific.
  • We could type list to list just a number of
    variables. Type list gender income
  • but again we have more than 3,000
    observations!!.
  • To narrow down our look we will use first the in
    qualifier. This qualifier limits by the position
    of the observation in a particular order. Type
  • sort income
  • list gender income in 1/10

13
Analysing data looking more carefully
  • What does these commands do????
  • sort income sort the data in ascending order, so
    the person with the lowest income is the first.
    This establish the order.
  • list gender income in 1/10 list the first 10
    observations. It will show gender and income
    corresponding to the ten observations
    (respondents) with the lowest income (remember
    we sorted in ascending order by income)
  • What would this do?
  • list gender income in 2/4
  • Individuals from the second to the fourth

14
Analysing data Summary statistics
  • To obtain summary statistics about income, type
  • Summarize income
  • The information about is the nr. of observations
    used to calculate the arithmetic mean, the
    standard deviation, the minimum and the maximum.
    You have only 3,034 observations because some of
    them were set to missing (.) and they are not
    taken into account when doing the calculations
  • You can summarize a list of variables simply by
    adding more in the list. If you want to summarize
    all the variables, just type summarize (also sum
    as an abbreviation would work)

15
Analysing data if qualifier
  • What if we wanted to have a look at the
    difference in income between women and men? We
    use the if qualifier and summarize data
    conditional on the variable meeting the if
    condition. Type
  • summarize income if gender1
  • summarize income if gender2
  • The first summarize only the observations in
    which gender is equal to a particular value. 1
    refers to males and 2 to females in this survey.
    See the difference in income.
  • The double equal is necessary, otherwise it will
    show invalid syntax

16
Analysing data missing values
  • Men seem to earn more than women. But these
    averages are calculated taking into account those
    observations with income0. These might be to be
    more frequent among women so in order to compare
    only individuals with positive income we can
    either type
  • sum income if gender1 incomegt0
  • sum income if gender2 incomegt0
  • or recode 0 incomes to . (missing) so that
    they will not be taken into account when
    calculating the average. Type
  • mvdecode income, mv(0.a)
  • sum income if gender1
  • sum income if gender2

17
Analysing data by prefix
  • A prefix is a command that is written in front of
    the actual stata command.
  • It has two parts
  • prefix itself, by
  • variable list, in our example only gender.
  • Structure would be
  • prefix command actual command
  • In the case of by, the actual command is repeated
    for all the categories in the prefix list or
    bylist.
  • A condition is that the data have to be sorted by
    the variables in the bylist
  • Type
  • sort gender
  • by gender summarize income

18
Analysing data missing recoding by prefix
Same mean
19
Analysing data Command options
  • Options are command specific, unlike in and if
    qualifiers or by prefix.
  • They are written after the actual command,
    following a comma.
  • In the case of summarize, the detail option
    will give much more information about the income
    distribution skewness, kurtosis
  • Type
  • sum income, detail

Median
Moments
20
Analysing data Frequency tables
  • The command that generates frequency tables is
    tabulate (or tab), which has to be followed by
    one or two variables, generating one way
    frequency table or two way frequency table.
  • Type
  • tabulate gender
  • tabulate emp gender
  • First variable is the row variable, second
    variable is the column variable.
  • Options for this command are row or column which
    return the row and column percentages

21
Variable labels and value labels
  • See the differences between the first and the
    second table. In the second we only have the
    values that correspond to the different types of
    employment status.
  • label of the variable is a brief description of
    the variable. Lets change it typing (does not
    matter if it already had one)
  • label variable emp Status employment in 97
  • label values is the label for the different
    values. In income, the label for value 1 was
    male and the label for value 2 was female.
    This variable has seven different values. Lets
    label the values. Type (not breaking the line)
  • label define emplb 1 Full time 2 Part time 3
    Retraining gt 4 irregular 5 not working 6
    military service 7 gt gtunemployed , modify
  • label values emp emplb
  • Labels are stored in emplb. They can be assigned
    to any other variable with same values. Let
    tabulate again to see the changes
  • tab emp gender, column nofreq

22
Variable labels and value labels
was Employment Status 1997
value labels created
23
Analysing data Graphs
  • Part-time employment or unemployment is more
    frequent among women. Maybe income differences
    are due to employment status.
  • Graph the distribution might give us a quick and
    very informative look over the data. Type
  • graph box income, over(emp)
  • To get a box-and-whisker plot, this result in a
    graph with one distribution graph over each group
    of emp ,(over (emp))
  • Outliers are the dots. Income are skewed for all
    subgroups. Median for full time is higher than
    for the rest. If there are relatively more part
    time women represented, we might think that
    income inequality could be due to division of
    labor within the couples than to gender
    discrimination. We first must control for
    employment status

24
Analysing data graphs
Outliers
Third quartile
Median
First quartile
employment status
25
Getting help
  • How to find out about the effects of gender and
    employment status on income?. Regression
    analysis. How to do it? Lets have a look at the
    help
  • command search looks in all stata resources some
    topic that might be linked to your search. Type
  • search Linear Regression
  • search model
  • search OLS
  • Also you can use the help command to get
    information about a command (now that we know
    that we should use regress). Type
  • help regress
  • You find the syntax, explanation and description
    of the available options (options, as formerly
    said are command specific)
  • Pop up menu Help gt Search or Help gt Stata
    command

26
Getting Help
27
Analysing data Recoding variables
  • We have the dependent variable, income, and two
    independent variables gender and employment
    status.
  • Gender dichotomous variable. They conventionally
    take the values of 0 or one in regressions. We
    recode the variable to make men 1, women 0.
    Type
  • generate men1 if gender1
  • replace gender0 in gender2
  • Create a variable 1 if gender1 and missing
    otherwise. Then replace with 0 those missing
    values that meet the criteria of gender2
  • With employment status we will do something
    similar. This variable is not dichotomous. We
    will do just
  • generate fulltime1 if emp1
  • replace fulltime 0 if emp2
  • because the analysis will be limited to full
    time/part time.

28
Analysing data Linear Regression
  • We will run a basic linear regression with the
    data at hand. We saw that we needed to use
    regress. We type regress followed by the
    dependent variable followed by the independent
    variables. Type
  • regress income men fulltime
  • Interpretation
  • average monthly income for individuals with
    income0 and fulltime0 (part time women
    employees) is 965.
  • female full time workers earn on average 806
    more.
  • independent of full time/part time, men earn on
    average 451 more than women.
  • Therefore, income inequality cannot be explained
    by the higher proportion of female part time
    workers in the data file.

29
Analysing data Linear Regression
30
Do files
  • To reproduce the results in your session do
    files.
  • Text file where you save your commands in order
    to store your work sessions.
  • Type
  • doedit
  • Opens the do file. The first line establish the
    version so that the do file can be run with any
    future version. We have written the commands that
    were necessary to do the regression as above.
  • Once you have copied the commands File gt Save as
    gt an1.do in the current directory. Next type
  • do an1.do
  • You can run all the commands again!!!!

31
Do files
32
Exiting Stata
  • Once you have saved your session or your work in
    your do file, it is better to leave stata without
    saving changes.
  • Changes in do files are easy do to, changes in
    the original database might not be possible to
    undo. HAVE ALWAYS MORE THAN ONE COPY OF THE
    ORIGINAL DATABASEjust in case.
  • If you want to save changes, save them in a new
    file, typing, for instance
  • save mydata
  • Then you can exit Stata exit, clear
Write a Comment
User Comments (0)
About PowerShow.com