Title: Welcome to Statistics 111
1Welcome to Statistics 111
The goal of this course is to develop basic tools
for data analysis, probability and statistical
methods. Key topics covered in the course include
exploratory data analysis, regression,
probability, estimation, and hypothesis testing
2Syllabus notes website
- All handouts will be available on the website
- http//stat.wharton.upenn.edu/braunsf/stat111.htm
l - Website also contains my contact information
- Link on website for getting Wharton class account
if you are not a Wharton student - Helpful if you want to use Wharton computer labs
3Syllabus notes Homeworks
- Homeworks will be handed out at the beginning of
every week - 5 homeworks in all
- Homeworks will be submitted at the beginning of
class on Mondays - You are encouraged to work together on homework,
but homeworks are to be completed separately and
handed in individually. - Do not copy from another person.
- No late homeworks will be accepted!!
- Late homeworks will get a score of zero, without
exception - Your lowest homework grade is not included in
final grade
4Syllabus Notes Midterm Exam
- Midterm is held on following date
- Monday, June 15th (in class)
- No makeup midterm examination!
- A missing midterm exam counts as a zero score
- Consider taking this class in the fall or spring
if you can not attend the midterm!
5Student Questionnaire
- Fill out a questionnaire and hand it in before
the break - I will try to incorporate some of the subjects
that interest you into future lectures
6Course Overview
Collecting Data
1
Exploring Data
2
Probability Intro.
3
Inference
4
Comparing Variables
Relationships between Variables
2
1
1
1
Means
Proportions
Regression
Contingency Tables
7Out in public You do statistics ?!?
- I hated that class in college!
- That was the most boring class ever!
- Lame.
8Big Picture Ideas
- Statistics is all about uncertainty
- Focus as much on what we dont know (or havent
observed) instead of what we know - Formulating the question that we want to answer
is often the most difficult part - Statistics is part mathematics, part
roll-up-your-sleeves-and-get-thinking.
9Science and Skepticism
- We always need to be cautious about conclusions
based on data - Possible sources of bias and confounding?
- How might things have gone wrong?
- A little bit of skepticism is a good thing!
10Statistical Modeling
- Inference using mathematical models of
uncertainty to answer questions - Connect probability concepts to our data
- Can not make claims without using models and
making assumptions - Are the assumptions reasonable?
11After the break
- Collecting Data Design of Experiments
- Sections 3.1-3.2 in Moore, McCabe and Craig
- First couple of classes will not involve much
math at all, but we will get into lots of data
analysis after that!
12Break!
- Hand in questionnaire
- 5 minutes
13Outline for Second Half of Lecture
- Introduction to Experiments
- Sources of Bias in Experiments
- Techniques for Avoiding Bias
- Matching
- Randomization
- Block Designs
- Blinding and Double-Blinding
- Experiments vs. Observational Studies
- Association vs. Causation
14Experiments
- Used to address a specific question
- Often used to examine causal effects
- Eg. medical trials, education interventions
Treatment Group
Treatment
Result
1
Experimental Units
2
3
4
Population
Control Group
No Treatment
Result
- Can we just look at difference in results to get
the causal effect of the treatment? - Depends on whether the experiment was done well
- many possible sources of bias in design of
experiments
15Sources of Bias
- An experiment or study is biased if it
systematically favors a particular outcome - Subjects are not representative of the population
- Treatment and control groups are inherently
different on some lurking or confounding variable - Subjects are influenced by knowing they are in
treatment or control groups - Evaluator of outcomes is influenced by knowing
they are in treatment or control groups
Treatment Group
Treatment
Result
1
Experimental Units
2
3
4
Population
Control Group
No Treatment
Result
16Bias 1 Non-representative units
- If your subjects are not representative of the
population, you wont be able to generalize the
results even if the experiment is well done - Here are two examples
- Treatment group High Level NICUs
- Control Group Low Level NICUs
- Problem classification of NICU is different from
state to state, so a hospital that might qualify
as a high level NICU in one state might not in
another - Observed differences between the groups can not
be generalized from one state to another
17Bias 2 Confounding/Lurking Variables
- Treatment group and control group are different
on some variable that also influences the outcome - A confounding variable means that we cant
attribute difference in outcomes to just the
treatment - Part of the difference may be due to the
confounding variable not the treatment - Simple example a breast cancer drug trial where
only women receive the treatment and only men
receive the control - Gender becomes a confounding variable
- Are treatment vs control outcomes different due
to the treatment or gender differences between
groups?
18Bias 3 Subject knows treatment assignment
- A subjects outcome is influenced by knowing that
he/she is in a treatment or control group - Eg. drug trials patients improve just because
they think they are receiving the drug - Solution blinded experiment with placebo
- Placebo appears to be the treatment, so all
subjects (treatment and control) dont know their
true treatment assignment - Controls may improve outcomes slightly this is
often called the placebo effect
19Bias 4 Evaluator knows treatment assignment
- Person evaluating outcome (eg. doctor in drug
trial) may also be influenced by knowing who
receives treatment - Not a problem if outcome is something
indisputable, such as death! - This is a problem for more subjective measures
like pain reduction or results from social
programs - Solution double-blinded experiment where neither
subjects not evaluators know treatment
assignments
20Association vs Causation
- In the presence of a confounding variable, we can
only conclude there is an association between
treatment and outcome, not causation
21Examples Reporters are stupid
- Children who watch many hours of TV get lower
grades in school on average than those who watch
less TV - Does this mean that TV causes poor grades?
- What are potential confounding variables?
- People who use artificial sweeteners in place of
sugar tend to be heavier than people who use
sugar - Does this mean that sweeteners cause weight gain?
- What is probably happening here?
22One solution Matching
- Make sure that treatment and control groups are
very similar on observed variables like race,
gender, age etc. - Block designs divide subjects into blocks with
similar observed variables before dividing them
into treatment vs control - Special case Matched Pairs
- Subjects are matched up into pairs, then one
- member of each pair gets treatment and the
- other gets control
- Example Dandruff experiment
- treatment applied to one side and control
- to other side of head
- No reason to expect difference
- in sides except for treatment
23Another Solution Randomization
- Problem with matching is that you cannot usually
match on unobserved characteristics (eg.
Genetics) - Eg. Cholesterol drug trial - cant match
treatment and control groups on genetic
predisposition for high cholesterol - Randomly assign subjects to treatment or control
- Random assignment should lead to groups that are
similar or balanced on both observed and
unobserved confounding variables - Example student questionnaire earlier in class -
each form you filled out was randomly assigned
either a 1 or 2
24Randomization of In-Class Survey
- Check to see if groups are balanced
- There are differences, but are they
significant? - Later on in the course, we will be able to answer
questions like this - Of course, we cant check the balance for
unobserved variableswe just have to trust the
randomization process - This is why good science needs to be replicable
25Even Better Randomization Matching
- Randomization generally leads to treatment and
control groups that are evenly balanced but you
can still get unlucky and get unbalanced groups - Example randomly placing 20 people (10 males, 10
females) into treatment and control groups. - How many males will end up in treatment group?
- Ideally, we would have 5 males in treatment
group, and 5 males in control group (balanced) - However, there is a chance to get 9 males in
treatment and 1 male in control group (unbalanced)
26Even Better Randomization Matching
- Randomized Blocks randomize within blocks of
observed variables - Example
- Divide up subjects into males and females first,
then randomly assign treatment or control to
subjects in each group separately - Guarantees that equal number of males end up in
treatment group and control group (same with
females) - Randomized Matched Pairs randomly decide which
member of each pair gets treatment vs. control - Example
- For each head in dandruff experiment, randomly
assign which side of head to get dandruff shampoo
vs. control
27Experiments vs. Observational Studies
- Often, we want the causal effect of some
treatment, but our data are from an observational
study - Observational studies examine effects of some
variable but without the advantages of a
controlled experiment - No treatment is applied in observational studies
- Example health effects of smoking
- Unethical to randomly impose a treatment
- Could there be some confounding variable that
explains health differences between smokers and
non-smokers ? - Very risky to make causal statements from
observational data, since we can not avoid bias!
28Health Effects of Chocolate
- Report to European Society of Sexual Medicine
- 153 Italian women filled out sexual function
questionnaires - intriguing correlation sexual function/desire
significantly greater among chocolate-eaters - Observational study association does not imply
causation! - Confounding average age is 35 among frequent
chocolate-eaters, compared with 40.4 in
non-chocolate group
29Next Class - Lecture 2
- Collecting Data
- Surveys and Sampling
- Graphical summaries of a single variable
- Moore, McCabe and Craig Sections 3.3 and 1.1