Title: Automatisation in Stata
1Automatisation in Stata
- Jan Hagemejer
- Joanna Tyrowicz
2Plan
- Standard solutions
- Where they do not work?
- Usually more than one way to estimate how to
chose? - Using loops and global function together
- Generating the resultssets for atypical
estimations. - Difficulties with using bootstrap (and obtaining
resultssets) - Summary comments and some advices
3The standard route
- Problem several estimations of similar form.
- Need to compare results.
- Three simple solutions
- Solution 1 brute force sit type
- Solution 2 use parmby/parmest if estimations on
simple categories in data (limitations of by
command) - Solution 3 use loops
- See N. Coxs material from previous SUGM)
- Commands developed by Roger Newson
outreg/outreg2 - nicely formatted tables,
- publication-ready,
- in many formats, even LaTeX.
- Note if you need nice summary statistics, you
can use outsum either with by or within loops
4Where the problems come from?
- 2nd and 3rd solution works only with
regression-type estimations - However, some procedures are incompatible with
pre-cooked solutions - Examples
- Marginal effects,
- Use outreg2 in Stata10 if use dprobit/logit
instead of probit/logit - Use outreg2 in Stata11 with margins and/or mfx2
(remeber about replace option) - Nice statistics
- Use tempname and postfile syntax
- Rolling window on any of this type of analysis
5Not everything may be solved this way
- Reason 1 things more complex than they seem (to
come in a sec..) - Reason 2 some things are not listed in the
output - Example various versions of R2 or sample size in
simple regressions - outreg/parmest typically do not include them
- they can be included as additional locals
- you need to know what locals they are gt
solution the family of return list commands - ret li gt results stored in r(), general commands
- eret li gt results stored in e(), estimation
commands - sret li gt results stored in s(), programming
commands - Practical example
6Cookbook for simple problems
- Run procedure
- Check with the use of return list family, which
statistics you need - Add locals that should be generated after the
procedure - Add these statistics to outreg2/parmest commands
- forvalues no1(1)10
- xi xtreg x y z i.year i.month if gno'1, fe
robust - local Betweene(r2_b)
- local Withine(r2_w)
- local No_mine(g_min)
- local No_maxe(g_max)
- outreg2 using file.xls, bdec(4) title(Title)
ctitle(no') append excel addstat(R2 between,
Between', R2 within, Within', No min, No_min',
No max, No_max', No average, No_avg') -
-
7Our problem is different application to PSM
- Need to report
- output of the procedure
- sample properties after matching
- balancing properties of matching
- Problem1 actually, none of these is in the
typical output - Problem2 we need it for many estimations looped
over many variables and each one of them takes a
looooong time
8Detailed problem description
- Analyse the effects of privatisation
- Observe what happens before and after the event
of privatisation, but time runs - E.g. firm A may be one year before privatisation
in 1999 and firm B in 2006, so event is an
anchor and time runs both ways. - Effects may be observed in many spheres
- E.g. profits, investments, international
competitiveness, employment - Effects may be due to self-selection
- E.g. only better firms are privatised, so
difference in performance is not due to the
privatisation - Effects may be largerly due to self-selection
- Heckman correction will tell about the
statistical significance but not about the
economic relevance - Propensity score matching is the best solution
9Detailed problem desciption
- Run logistic regression
- Dependent variable Y 1, if participate Y 0,
otherwise. - Choose appropriate conditioning (instrumental)
variables. - Obtain propensity score predicted probability
(p) or logp/(1 - p). - Match each participant to one or more
nonparticipants on propensity score - Choose an adequate metric
- Compare outcome variables
- Example test means equality in sample treated
and control group - In PSM obtaining pscore is irrelevant, but
matching is key - To verify if matching is ok, need to run some
diagnostics - Example compare the balancing properties after
matching (so-called bias reduction thanks to
matching)
10Detailed problem description
- Thus, in our case
- Many time periods (for each time-to-anchor a
separate estimation) - Many variables (for each variable separate
outcomes, but within one period the same
balancing properties) - Two ways of estimating regular and bootstrapping
(especially the latter made things complex) - Each estimation roughly 1.5-3.5 hours
- Over a hundred estimations
- Additional pitfalls
- We needed some statistics for all estimations and
they were not in the return list - More precisely procedure computes them to be
able to produce output, but they were not added
to the return list by authors
11Summary of the problems
- Our problem was quite specific BUT consisted of
many general problems - Loops take a lot of time need to find efficient
ways - Some things cannot be obtained fast gt even more
reasons to run it automatically - Obtaining datasets of the variables we need
(so-called resultssets) - Getting visible data if they are not an output
- Using invisible data
- Getting around with bootstrap
12The structure of our estimations
13Using pscore or psmatch?
14Using pscore or psmatch?
Event loop
- Typical psmatch syntax
- psmatch2 treat treatment_determinants,
out(outcomes) options - Alternative
- Estimate pscore first
- pscore treatment treatment_determinants,
pscore(name) - Run
- psmatch2 treatment pscore, out(outcomes) options
- How to choose?
- If you want to bootstrap, pscore estimated once
will save you time - If you want to introduce data-fitted caliper into
options, pscore first is a must
15How global function can be usefull?
16Using the global function for estimations
Event loop
- Our application observe the same firms back and
forth from the moment of the privtisation
(event) - Events happen in different years
- But we can only match on one dimension has or
has not the event - Conceptual solution use lags and forwards to get
the time dimension - Technical problem many outcomes variables and de
facto many loops - Technical solution define separately matching
variables and output variables
- global in"cut remoteness eksporter energia
obrot klratio roa ros indebtedness wsk_plynnosci
net_income_efficiency klratio_new roa_new
indebtedness_new indebtedness_new
wsk_plynnosci_new" - global out"te_new redukcja wzrost_zatr
share_export lewar s_eff" - global outf1"ff1_te_new ff2_te_new ff3_te_new
ff4_te_new ff5_te_new ff1_redukcja ff2_redukcja
ff3_redukcja ff4_redukcja ff5_redukcja
ff1_wzrost_zatr ff2_wzrost_zatr ff3_wzrost_zatr
ff4_wzrost_zatr ff5_wzrost_zatr" - global outf2"ff1_share_export ff2_share_export
ff3_share_export ff4_share_export
ff5_share_export ff1_lewar ff2_lewar ff3_lewar
ff4_lewar ff5_lewar ff1_s_eff ff2_s_eff ff3_s_eff
ff4_s_eff ff5_s_eff"
17The begining of the estimations so far
Event loop
- forvalues d6(1)18
- use data, clear
- capture log close
- capture drop our_pscore caliper mean diff
ttest se_after se_before treated nontreated - log using priv_caliper_d', text replace
- pscore dd' in, pscore(our_pscore_d')
- ttest our_pscore_d', by(dd') unequal
- capture drop sd_nontreated sd_treated
- gen sd_nontreatedr(sd_1)'
- gen sd_treatedr(sd_2)'
- gen caliper_d' ((sd_treated2sd_nontreated2)/
2)0.5 - sum caliper_d'
- local c_realr(mean)'
- hist nasz_pscore_d', by(dd')
- graph save our_pscore_dd'.png", replace
- psmatch2 dd' our_pscore_d', out(out outf1
outf2) common add mahalanobis(nace)
caliper(c_real')
18Getting from results to resultssets
19Why (and what) do we need (in) the resultssets?
- Why?
- Most importantly without resultssets we cannot
- analyse the changes over time
- decompose the observed differentials
- If we do not do it automatically, it would have
to be copied manually from logs many
estimations, many variables, etc - What ? Step 1 find out the reality
- Size of each of the three groups treated, total
and control ( matched) - Averages in all three groups (medians, etc.)
- Knowledge if in fact they are different ( test
of the statistical significance based on
difference and standard error of this difference) - What? Step 2 find out, how good the findings are
statistically - Balancing properties!
20Our solution to step 1
Variables loop
- foreach out in out outf1 outf2
- local se_afterr(seatt_out')
- gen se_after_out'se_after'
- local diff_afterr(att_out')
- gen diff_after_out'diff_after'
- sum out' if dd'0 _support1
- local mean_nontreatedr(mean)
- gen mean_nontreated_out'mean_nontreated'
- sum out' if dd'1 _support1
- local mean_treatedr(mean)
- gen mean_treated_out'mean_treated'
- ttest out' if _support1, by(dd') unequal
- local se_beforer(se)
- gen se_before_out'se_before'
- local mean_beforer(mu_2)-r(mu_1)
- gen diff_before_out'mean_before'
- gen ttest_before_out'diff_before_out'/se_bef
ore_out' - gen ttest_after_out'diff_after_out'/se_after
_out
21Our solution to step 1 - continued
Variables loop
- foreach type in before after
- label var se_type'_out' "Standard error of
difference type' matching" - label var diff_type'_out' "Difference type'
matching" - label var ttest_type'_out' "T-test of
difference" -
-
- label var mean_treated_out' "Mean of treated
companies" - label var mean_nontreated_out' "Mean of
non-treated companies (before matching)" -
-
- count if dd'1 _support1
- local treatedr(N)
- gen treatedtreated'
- label var treated "No of treated companies"
- count if dd'0 _support1
- local nontreatedr(N)
- gen nontreatednontreated'
- label var nontreated "No of control companies"
22Our solution to step 2
Variables loop
- pstest in
- foreach in in in
- capture local bias_reductionr(bired_in')
- capture local pvalue_befr(pbef_in')
- capture local pvalue_afterr(paft_in')
- capture gen b_red_in'bias_reduction'
- capture gen pval_ber_in'pvalue_bef'
- capture gen pval_aft_in'pvalue_after'
-
-
- outsheet b_red pval using stats_priv_d',
replace - psgraph
- graph save priv_support_d', replace
- graph export priv_supportd'.png, replace
- drop b_red pval
23Missing statistics
24Solving problem of missing statistics
- Look into the ado file you are using
(procedure) - Throughout the file, there are commands
- return scalar xsomelocal
- Sometimes for clarity scalars are dropped at
the end of procedure - Your prefered statistic (if it is in the output,
it has to be at least a local) would simply have
to have a local like that too - If it does not you can always generate it based
on your preferences and available locals - gt Modify the original ado file
25Solving problem of missing statistics example
1
- Modified ado file line 380
- Original ado file line 380
- qui foreach v of varlist varlist'
- replace _v' . if _support0
- tempname m1t m0t u0u u1u att dif0
- sum v' if _treated1, mean
- scalar u1u' r(mean)
- sum v' if _treated0, mean
- scalar u0u' r(mean)
- sum v' if _treated1 _support1, mean
- scalar m1t' r(mean)
- local n1 r(N)
- sum _v' if _treated1 _support1, mean
- scalar m0t' r(mean)
- scalar att' m1t' - m0t'
- scalar dif0' u1u' - u0u
- return scalar att att'
- return scalar att_v' att'
- qui foreach v of varlist varlist'
- replace _v' . if _support0
- tempname m1t m0t u0u u1u att dif0
-
- /all the same as earlier plus /
- return scalar diff dif0'
- return scalar diff_v' dif0
- return scalar mean0 u0u'
- return scalar mean0_v' u0u
- return scalar mean1 u1u'
- return scalar mean1_v' u1u'
26Solving problem of missing statistics example
2
- Modified ado file line 440
- Original ado file line 440
- return scalar seatt stderr'
- return scalar seatt_v' stderr'
- qui regress v' _treated
- scalar ols' _b_treated
- scalar seols' _se_treated
- return scalar seatt stderr'
- return scalar seatt_v' stderr'
- qui regress v' _treated
- scalar ols' _b_treated
- scalar seols' _se_treated
- return scalar seols seols
- return scalar seols_v' seols'
27Problems with bootstrap
28Problems with bootstrap
- Why did we need bootstrap?
- After estimations s.e.s were relatively large
(heterogenous sample) - When we tried bootstraping, the reduction in the
size of s.e.s was roughly 50 while estimators
were essentially unaffected - What problems with bootstrap?
- Need to run it separately for each variable (it
bootstraps only one standard error at a time) - Output is given in a totally different form
- It takes a looong time
- New piece of code for just BS standard errors gt
- new variable loops within each time loop
29Problems with bootstrap
- foreach out in out outf1 outf2
- use data, clear
- sum caliper_d /this is where the initial
pscore comes useful/ - local c_realr(mean)
-
- bootstrap r(att) psmatch2 dd' our_pscore_d',
out(out') common add mahalanobis(nace)
caliper(c_real') -
- matrix mat e(b), e(se) /without this, no
resultssets/ - mat li mat
- svmat mat
- rename mat1 ad'_diff_after_bs_out
- rename mat2 ad'_se_after_bs_out
- gen time_of_eventd'
- keep se diff ttest mean time_of_event a
- drop if _ngt1
- save priv_bs_out'd', replace
-
30Final steps
- Merge files obtained from bootstrap on event
(to have a complete resultsset within each
event period) - Merge bootstrap resultssets with
- Append the files for event periods
- Organise the data
- Produce tables and graphs (again in loops)
- Write paper
31The resulting graphs (1)
- There are 6x3 figures alltogether
32The resulting graphs (2)
- There are 6x2 figures alltogether
33The resulting graphs (3)
- There are 6x3 figures alltogether
34Some advices we did not take at the right time ?
- Save your computers time (your wasted time is
your problem ?) - Use sample 10 for testing your procedures -
saves a lot of time - Leaving mess is not useful if you ever want to
come back - Your memory lasts shorter than that of saved
files describing dofiles really helps - Loops are better than copypaste and less messy
too - STATA is not that complicated modifying
ado-files is really easy if you know what you
want
35Thank you for your attention!