Title: P1252122010NtFbp
1STATISTICS
Essentials of
Statistiek voor Informatiekunde
docent Frits de Vries assistent Joris Stegeman
MARIO F. TRIOLA
EDITION
3rd
2Programma vandaag
- 1e uur
- Welkom en kennismaking
- Organisatie en opzet van het onderwijs
- 2e uur
- Waarom statistiek?
- Vooruitblik op de stof hfst. 1,2 en 3
31. Welkom en kennismaking
- Docent en assistent
- Mix van 2e jaars en schakelaars
- Presentielijst
- Huishoudelijke mededeling oekaze CvB
42. Organisatie en opzet (1)
- Werkgroepen - presentielijst
- Website cursus
- Introductie
- Literatuur
- Beoordeling en deadlines
- Links
- Proeftentamen
- Rooster
5Website cursus
6Boek
- Literatuur
- Mario Triola
- Essentials of Statistics, 3rd edition
- Addison-Wesley Higher Education, 2007
72. Organisatie en opzet (2)
- Website cursus (vervolg)
- Regels !
- Schema van de oefeningen
- Tentamenstof
- Opdrachten week 2 hfst 1,2 en 3
- Boek kopie stof 1,2 en 3
8Organisatie
- Geen hoorcolleges
- vragenuur op basis van ingediende vragen
- heel veel oefenmateriaal
- Verplichte werkcolleges
- Het maken van opgaven is essentieel en daarom
verplicht. - Steeds de uitwerkingen van de aangegeven
exercises voorafgaand aan het werkcollege
inleveren in 2-voud. - Werkgroepen begeleiding
- groep 1 woensdag 1
- groep 2 woensdag 2,
- groep 3 vrijdag
- Computerpracticum?
93. Waarom statistiek?
- Lezen en schrijven artikelen vakgebied IK
- Voorbeeld artikel MIS Quarterly
- Lezen en schrijven in het dagelijks leven
- Voorbeeld tabel actiecommitee in de buurt
- Baisvoorwaarde logisch denken en redeneren
- Voorbeeld het Monty Hall-probleem
- Voorbeeld Doping gebruik
10Tabel (1) artikel MIS Quarterly
11Tabel (2) artikel MIS Quarterly
12Tabel buurtcomité
13Intuitie is moeilijk
- Quiz hoofdprijs U mag kiezen uit 3 deurenU
kiest een deur Welke kans heeft U op de
hoofdprijs?
14Maar
- Stel de quizmaster opent NA UW KEUZE een van de
twee overgebleven deuren en laat zien dat daar
niets in zit. - U mag nu nog van deur wisselen.
- Doet U dit?
- Ja !! want dit vergroot Uw kans !!!
15Analyse
- Stel de hoofdprijs zit achter deur 1
- U koos deur 1 (auto). De quizmaster opent een
andere deur waarachter niets staat. Ruilen levert
verlies op - U koos deur 2 (leeg). De quizmaster opent deur 3
waarachter niets staat. Ruilen levert hoofdprijs! - U koos deur 3 (leeg). De quizmaster opent deur 2
waarachter niets staat. Ruilen levert hoofdprijs!
16pauze
17Triola, hoofdstuk 1
- Belangrijke definities voor gebruik bij de
statistiek
18Sektie 1.1 Belangrijke definities
- Data
- Statistiek
- Populatie
- Census
- Steekproef
19Definitie Statistiek
- a collection of methods for- planning studies
and experiments,- obtaining data, - and then
organizing, summarizing, presenting, analyzing,
interpreting, - and drawing conclusions based on
the data
20Chapter Key Concepts
- Sample data must be collected in an
appropriate way, such as through a process
of random selection. - If sample data are not collected in an
appropriate way, the data may be so completely
useless that no amount of statistical torturing
can salvage them.
21Sektie 1.2 Data typen
- Definities
- Populatie parameter versus steekproef statistic
- Kwantitatieve versus kwalitatieve data
- Discrete versus continue data
- Meetnivos nominaal, ordinaal, interval, ratio
22Levels of Measurement
- Nominal - categories only
- Ordinal - categories with some order
- Interval - differences but no natural starting
point - Ratio - differences and a natural starting point
23Sektie 1.3 Kritisch denken
- Misbruik, ondeskundig gebruik, verkeerd gebruik
van de statistiek
24Misuse 1- Bad Samples
- Voluntary response sample
- (or self-selected sample)- one in which the
respondents themselves decide whether to be
included. In this case, valid conclusions can be
made only about the specific group of people who
agree to participate.
25Misuse 2- Small Samples
Conclusions should not be based on samples that
are far too small. Example Basing a school
suspension rate on a sample of only three students
26Misuse 3- Graphs
To correctly interpret a graph, you must analyze
the numerical information given in the graph, so
as not to be misled by the graphs shape.
27Misuse 4- Pictographs
Part (b) is designed to exaggerate the difference
by increasing each dimension in proportion to the
actual amounts of oil consumption.
28Misuse 5- Percentages
Misleading or unclear percentages are sometimes
used. For example, if you take 100 of a
quantity, you take it all. 110 of an effort
does not make sense.
29Other Misuses of Statistics
- Loaded Questions
- Order of Questions
- Refusals
- Correlation Causality
- Self Interest Study
- Precise Numbers
- Partial Pictures
- Deliberate Distortions
30Sektie 1.4 Ontwerp van het experiment
- Soorten studies
- Observationeel
- Experimenteel
- Retrospectief
- Prospectief (longitudinaal, cohort)
31Definition
- Confounding
- occurs in an experiment when the experimenter
is not able to distinguish between the effects
of different factors
32Controlling Effects of Variables
- Blinding
- subject does not know he or she is receiving a
treatment or placebo
- Blocks
- groups of subjects with similar characteristics
- Completely Randomized Experimental Design
- subjects are put into blocks through a process
of random selection
- Rigorously Controlled Design
- subjects are very carefully chosen
33steekproeven
34Definitions
- Random Sample
- members of the population are selected in such
a way that each individual member has an equal
chance of being selected -
- Simple Random Sample (of size n)
- subjects selected in such a way that
every possible sample of the same size n has the
same chance of being chosen
35Methods of Sampling
- Random
- Systematic
- Convenience
- Stratified
- Cluster
36Saunders-hfst 6
37Triola, hoofdstuk 2
- Statistiek voor het samenvatten en weergeven van
data
38Sektie 2.1 Overview Important Characteristics of
DataCVDOT
- 1. Center A representative or average value
that indicates where the middle of the data set
is located.2. Variation A measure of the
amount that the values vary among themselves. 3.
Distribution The nature or shape of the
distribution of data (such as bell-shaped,
uniform, or skewed).4. Outliers Sample values
that lie very far away from the vast majority of
other sample values.5. Time Changing
characteristics of the data over time.
39Sektie 2.2 Frequentieverdelingen
- Gewone (rechte) telling van waarden in een tabel
- Samenvoegen van waarden in categorieën (classes)
40Frequency Distribution Ages of Best Actresses
Frequency Distribution
Original Data
41Samenhangende definities
- Lower class limits
- Upper class limits
- Class boundaries
- Class midpoints
- Class width
- Relatieve frequenties
- Cumulatieve frequenties
- (cumulatieve percentages)
42Frequency Tables
43Sektie 2.3 Histogrammen
- Grafische weergave van verdelingen
44Histogram
A bar graph in which the horizontal scale
represents the classes of data values and the
vertical scale represents the frequencies
45Relative Frequency Histogram
Has the same shape and horizontal scale as a
histogram, but the vertical scale is marked with
relative frequencies instead of actual frequencies
46Critical ThinkingInterpreting Histograms
One key characteristic of a normal distribution
is that it has a bell shape. The histogram
below illustrates this.
47Sektie 2.4 Statistical graphics
- Andere vormen van visuele weergave
- Polygon
- Ogive
- Dot plot
- Stemplot
- Pareto chart
- Pie chart
- Scatter plot
- Time series
48Ogive
A line graph that depicts cumulative frequencies
Insert figure 2-6 from page 58
49Dot Plot
Consists of a graph in which each data value is
plotted as a point (or dot) along a scale of
values
50Other Graphs
51Triola, hoofdstuk 3
- Statistiek voor het beschrijven, verkennen en
vergelijken van data
52Sektie 3.1 Overzicht
- Descriptive Statistics
- summarize or describe the important
characteristics of a known set of data - Inferential Statistics
- use sample data to make inferences (or
generalizations) about a population
53Sektie 3.2 Centrummaten
- Gemiddelde (mean)
- Van steekproef en van populatie (mu)
- Mediaan (x-tilde)
- Modus
- Midrange
- Gewogen gemiddelde
54Notation
is pronounced x-bar and denotes the mean of a
set of sample values
- µ is pronounced mu and denotes the mean of all
values in a population
55Round-off Rule for Measures of Center
- Carry one more decimal place than is present in
the original set of values.
56Mean from a Frequency Distribution
- use class midpoint of classes for variable x
57Best Measure of Center
58Skewness
59Sektie 3.3 Variatiematen
- Range
- Standaard deviatie
- steekproef en populatie (sigma)
- Variantie
- Variatiecoëfficiënt (CV)
60Key Concept
Because this section introduces the concept of
variation, which is something so important in
statistics, this is one of the most important
sections in the entire book.
Place a high priority on how to interpret values
of standard deviation.
61Definition
The standard deviation of a set of sample values
is a measure of variation of values about the
mean.
62Sample Standard Deviation Formula
63Rationale for using n-1 versus n
The end of Section 3-3 has a detailed explanation
of why n 1 rather than n is used. The student
should study it carefully.
64Standard Deviation - Important Properties
- The standard deviation is a measure of
variation of all values from the mean.
- The value of the standard deviation s is
usually positive.
- The value of the standard deviation s can
increase dramatically with the inclusion of one
or more outliers (data values far away from all
others).
- The units of the standard deviation s are the
same as the units of the
original data values.
65Population Standard Deviation
? (x - µ)
2
?
N
This formula is similar to the previous formula,
but instead, the population mean and population
size are used.
66Variance - Notation
standard deviation squared
s ??
2
Sample variance
Notation
2
Population variance
67Estimation of Standard Deviation Range Rule of
Thumb
For estimating a value of the standard deviation
s, Use Where range (maximum value) (minimum
value)
68Estimation of Standard Deviation Range Rule of
Thumb
For interpreting a known value of the standard
deviation s, find rough estimates of the minimum
and maximum usual sample values by using
69The Empirical Rule
70Definition
The coefficient of variation (or CV) for a set of
sample or population data, expressed as a
percent, describes the standard deviation
relative to the mean.
Sample
Population
71Sektie 3.4 Maten van relatieve afwijking
- Z-scores
- Quartielen
- Percentielen
72Key Concept
This section introduces measures that can be used
to compare values from different data sets, or to
compare values within the same data set. The
most important of these is the concept of the z
score.
73Definition
- z Score (or standardized value)
- the number of standard deviations that a given
value x is above or below the mean
74Measures of Position z score
Round z to 2 decimal places
75Interpreting Z Scores
Whenever a value is less than the mean, its
corresponding z score is negative Ordinary
values z score between 2 and 2 Unusual
Values z score lt -2 or z score gt 2
76Quartiles
Q1, Q2, Q3 divide ranked scores into four
equal parts
77Percentiles
Just as there are three quartiles separating data
into four parts, there are 99 percentiles denoted
P1, P2, . . . P99, which partition the data into
100 groups.
78Sektie 3.5 EDA
- Uitbijters (outliers)
- Boxplot
79Important Principles
- An outlier can have a dramatic effect on the
mean. - An outlier can have a dramatic effect on the
standard deviation. - An outlier can have a dramatic effect on the
scale of the histogram so that the true nature
of the distribution is totally obscured.
80Definitions
- For a set of data, the 5-number summary consists
of the minimum value the first quartile Q1 the
median (or second quartile Q2) the third
quartile, Q3 and the maximum value.
- A boxplot ( or box-and-whisker-diagram) is a
graph of a data set that consists of a line
extending from the minimum value to the maximum
value, and a box with lines drawn at the first
quartile, Q1 the median and the third quartile,
Q3.
81Boxplots
82Boxplots - cont
83Einde vooruitblik 1,2 en 3