Title: Data analysis
1Data analysis
2- The first step in any data analysis strategy is
to calculate summary measures to get a general
feel for the data. Summary measures for a data
set are often referred to as descriptive
statistics. Descriptive statistics fall into
three main categories - measures of position (or central tendency)
- measures of variability
- measures of skewness
3- The purpose of descriptive statistics is to
describe the data. - The type of data will determine which descriptive
statistic is appropriate. - Specifically, one can only calculate a mean with
interval or ratio data, whereas a mode can be
calculated with nominal, ordinal, interval or
ratio data.
4Measures of Position
- Measures of position (or central tendency)
describe where the data are concentrated. - MeanThe Mean is simply the mathematical average
of the data. T - the mean provides you with a quick way of
describing your data, and is probably the most
used measure of central tendency. - However, the mean is greatly influenced by
outliers. For example, consider the following
set 1 1 2 4 5 5 6 6 7 150 - While the mean for this data set is 18.7, it is
obvious that nine out of ten of the observation
lie below the mean because of the large final
observation. - Consequently, the mean is not always the best
measure of central tendency.
5- MedianThe median is the middle observation in a
data set. That is, 50 of the observation are
above the median and 50 are below the median
(for sets with an even number of observation, the
median is the average of the middle two
observation). - The median is often used when a data set is not
symmetrical, or when there are outlying
observation. - For example, median income is generally reported
rather than mean income because of the outlying
observation.
6-
- To get the median, first put your numbers in
ascending or descending order. Then just use
check to see which of the following two rules
applies - Rule One. If you have an odd number of numbers,
the median is the center number (e.g., three is
the median for the numbers 1, 1, 3, 4, 9). - Rule Two. If you have an even number of numbers,
the median is the average of the two innermost
numbers (e.g., 2.5 is the median for the numbers
1, 2, 3, 7).
7- ModeThe Mode is the value around which the
greatest number of observation are concentrated,
or quite simply the most common observation. - Mode is often used with nominal data, but is not
the preferred measure for other types of data.
8- The mean, median, and mode are affected
differently by skewness (i.e., lack of symmetry)
in the data.
9- When a variable is normally distributed, the
mean, median, and mode are the same number.
10- When the variable is skewed to the left (i.e.,
negatively skewed), the mean is pulled to the
leftthe most, the median is pulled to the left
the second most, and the mode the least affected. - Therefore, mean lt median lt mode.
11- When the variable is skewed to the right (i.e.,
positively skewed), the mean is pulled to the
right the most, the median is pulled to the right
the second most, and the mode the least affected.
- Therefore, mean gt median gt mode.
12Measures of Variability
- While measures of position describe where the
data points are concentrated, measures of
variability measure the dispersion (or spread) of
the data set. - Range
- The range is the difference between the largest
and the smallest observations in the data set.
However, This is a limited measure because it
depends on only two of the numbers in the data
set. Using the above data set again, the range is
149, but that does not provide any information
regarding the concentration of the data at the
low end of the scale. Another limitation of range
is that it is affected by the number of
observations in the data set. - Generally, the more observation there are, the
more spread out they will be. One use of range in
everyday life is in newspaper stock market
summaries, which give the day's high and low
numbers.
13- Measures of Variability
- Measures of variability tell you how "spread out"
or how much variability is present in a set of
numbers. - For example, which set of the following numbers
appears to be the most spread out? - Set A. 93, 96, 98, 99, 99, 99, 100
- Set B. 10, 29, 52, 69, 87, 92, 100
- Right! The numbers in set B are more "spread
out." - One crude indicator of variability is the range
(i.e., the difference between the highest and
lowest numbers).
14- Two commonly used indicators of variability are
the variance and the standard deviation. - VarianceUnlike range, variance takes into
consideration all the data points in the data
set. If all the observation are the same, the
variance would be zero. The more spread out the
observation are, the larger the variance. - The variance tells you (exactly) the average
deviation from the mean, in "squared units."
15- Standard DeviationStandard deviation is the
positive square root of the variance, and is the
most common measure of variability. Standard
deviation indicates how close to or how far the
numbers tend to vary from the mean. The larger
the standard deviation, the more variation there
is in the data set. -
- (If the standard deviation is 7, then the
numbers tend to be about 7 units from the mean.
If the standard deviation is 1500, then the
numbers tend to be about 1500 units from the
mean.)
16- Virtually everyone in education is already
familiar with the normal curve - An easy rule applying to data that follow the
normal curve is the "68, 95, 99.7 percent rule."
That is . . . - Approximately 68 of the cases will fall within
one standard deviation of the mean. - Approximately 95 of the cases will fall within
two standard deviations of the mean. - Approximately 99.7 of the cases will fall within
three standard deviations of the mean.
17- Higher values for both of these indicators stand
for a larger amount of variability. Zero stands
for no variability at all (e.g., for the data 3,
3, 3, 3, 3, 3, the variance and standard
deviation will equal zero).
18- Frequency Distributions
- One useful way to view information in a variable
is to construct a frequency distribution (i.e.,
an arrangement in which the frequencies, and
sometimes percentages, of the occurrence of each
unique data value are shown). - When a variable has a wide range of values, you
may prefer using a grouped frequency distribution
(i.e., where the data values are grouped into
intervals, 0-9, 10-19, 20- 29, etc., and the
frequencies of the intervals are shown).
19- Graphic Representations of Data
- Another excellent way to clearly describe your
data (especially for visually oriented learners)
is to construct graphical representations of the
data (i.e., pictorial representations of the data
in two-dimensional space). - A bar graph uses vertical bars to represent the
data. The height of the bars usually represent
the frequencies for the categories shown on the X
axis(i.e., the horizontal axis). (By the way, the
Y axis is the vertical axis.)
20- A line graph uses one or more lines to depict
information about one or more variables. - A simple line graph might be used to show a trend
over time (e.g., with the years on the X axis and
the population sizes on the Y axis). - Line graphs are used for many different purposes
in research. For example, (GPA is on the X axis
and frequency is on the Y axis)
21- A scatterplot is used to depict the relationship
between two quantitative variables. - Typically, the independent or predictor variable
is represented by the X axis (i.e., on the
horizontal axis) and the dependent variable is
represented by the Y axis (i.e., on the vertical
axis).
22- The relationship is not always positive
- Correlation coefficient range between -1 and 1
- Interpretation of Pearson r
- 1 highly positvely correlated
- -1 highly negatively correlated
- Close to zero, no correlation
23- Correlation does not necessarily indicate
causation - .82 tells us that a person with an average score
on the test will probably obtained an average
score on other test
24- How to Interpret the Values of Correlations.
- The correlation coefficient (r) represents the
linear relationship between two variables. If the
correlation coefficient is squared, then the
resulting value (r2, the coefficient of
determination) will represent the proportion of
common variation in the two variables (i.e., the
"strength" or "magnitude" of the relationship). - In order to evaluate the correlation between
variables, it is important to know this
"magnitude" or "strength" as well as the
significance of the correlation.
25- Outliers.
- Outliers are atypical (by definition), infrequent
observations. - Outliers have a profound influence on the slope
of the regression line and consequently on the
value of the correlation coefficient. - A single outlier is capable of considerably
changing the slope of the regression line and,
consequently, the value of the correlation, as
demonstrated in the following example.
26(No Transcript)
27- Analyses for Comparison
- Nominal Data Chi-Square
- Interval Data t-Test
- Interval Data One-Way ANOVA
- Interval Data Factorial ANOVA
- Analyses for Association
- Interval Data Pearson Product-Moment Correlation
(r) - Nominal Data Phi Coefficient
- Ordinal Data Spearman Rank-Order Correlation
28parametric Methods Non parametric Methods
t-test for independent samples Mann-Whitney U test
ANOVA/MANOVA (multiple groups) Kruskal-Wallis analysis of ranks and the Median test.
t-test for dependent samples (two variables measured in the same samplE) Sign test and Wilcoxon's matched pairs test
29- t-test for independent samples
-
- Purpose, Assumptions.
- The t-test is the most commonly used method to
evaluate the differences in means between two
groups. - For example, the t-test can be used to test for a
difference in test scores between a group of
patients who were given a drug and a control
group who received a placebo. - Theoretically, the t-test can be used even if the
sample sizes are very small (e.g., as small as
10 some researchers claim that even smaller n's
are possible), as long as the variables are
normally distributed within each group and the
variation of scores in the two groups is not
reliably different
30- The normality assumption can be evaluated by
looking at the distribution of the data (via
histograms) or by performing a normality test. - The equality of variances assumption can be
verified with the F test, or you can use the more
robust Levene's test. - If these conditions are not met, then you can
evaluate the differences in means between two
groups using one of the nonparametric
alternatives to the t- test (Nonparametrics).
31- Independent sample t test
Mean N Std.Deviation Std. Error Mean
Talk Low stress High stress 42.20 22.07 15 15 24.97 27.14 6.45 7.01
Sx SD/v15
Standard deviation of the sample means
IV
DV
F Sig. T Df Sig. (2-tailed Mean diff Std. error diff
Talk Equal variance assumed Equal variance not assumed .023 .881 2.43 2.430 28 27.808 .022 .022 .
In this case, variances are similar
Tested at a .05
Levenes test for equality of variance
You want a small F
Here you want variance to equal
The larger the F value the more dissimilar the
varainces are
32- An independent t st was conducted to evaluate the
hypothesis that students talk differently (amount
of talkin) under different stress condition. The
test was significant, t (28) 2.43, p .022.
Students in high stress-condition talked less
(M22.07 SD 27.14) than students in
low-stressed condition (M45.20 SD 24.97)
33- t-test for dependent samples (paired sampel
t-test - Test two groups of observations (that are to be
compared) are based on the same sample of
subjects who were tested twice (e.g., before and
after a treatment )
Mean N Std.Deviation Std. Error Mean
PAY SECURITY 5.67 4.50 30 30 1.49 1.83 .27 .33
Sx SD/v30
Standard deviation of the sample means
34 Mean Std. Dev. Std. Err. Lower Upper t df Sig. (2-tailed)
Pay- security 1.17 2.26 .41 .32 2.01 2.827 29 .008
- A paired-sample t test was conducted to evaluate
whether employees were more concerned with pay or
job security. The results indicated that the mean
concern for pay (M 5.67, SD 1.49) was
significantly greater than the mean concern for
security (M 4.50, SD 1.83), t (29) 2.83, p
.008.
35- It was suggested (Marija J. Norusis) that
- When reporting your results, give the exact
observed significance level. It will help the
rader evaluate your findings - Eg p .008, 8 chances in 1000 you would
observe the difference between the two sample. - Eg p .08 8 chances in 100 but you have set
that you will only acet if it is 5 chances in
100
36- Pearson Chi-square.
- The Pearson Chi-square is the most common test
for significance of the relationship between
categorical variables. - This measure is based on the fact that we can
compute the expected frequencies in a two-way
table (i.e., frequencies that we would expect if
there was no relationship between the variables).
- For example, suppose we ask 20 males and 20
females to choose between two brands of jeans
(brands A and B). - If there is no relationship between preference
and gender, then we would expect about an equal
number of choices of brand A and brand B for each
sex. - The Chi-square test becomes increasingly
significant as the numbers deviate further from
this expected pattern that is, the more this
pattern of choices for males and females differs.
37- The Goodness of Fit test used to find out if the
population under study follow the distribution
values - Ho the population distribution is uniform, that
is, each brand of cola drinks is prefered by an
equal percentage of the population - Ha the population distribution is not uniform,
that is, each brand of cola drinks is not
prefered by an equal percentage of the population
38brand O E O-E (O-E)2 (O-E)2/E
A 50 60
B 65 60
C 45 60
D 70 60
E 70 60
Total 300 60
X 2 (df5) 9.18, let say the significant value
is 9.49, then Ho has to rejected and we cannot
say that cola brands are preferred by an equal
percentage of the population Df (r-1). (c-1)
39- Test of independence we can test the
realtionship between nominal variables) - The data are obtained from a random sample
- We use count data (frequencies)
- We want to test whether perception of life is
independent of gender or men and women find life
equaly exciting
40Life excitement male female
excited 300 384 684
Not excited 296 481 777
596 865 1461
Chi square 4.76, DF 1 p .0290
What can you conclude?