Title: Methodology II
1Methodology II
2Sampling/Data Collection
- Research is almost always directed at
characterizing and understanding a segment of the
world, a population, on the basis of observing a
smaller segment, or sample. - By definition, a population is the entire group
of interest in a research study (e.g., all
learning disabled children in the U.S.). - Populations are defined, not by nature, but by
rules of membership invented by investigators
(e.g., all children with mental handicaps
currently residing in NW PA).
3Sampling/Data Collection
- In contrast, a sample is some subset of a
population. - It is a select group from the population chosen
to represent this population. - A sample can be any size, as long as it consists
of a number less than the total number of the
population being observed. - The most accurate information about a population
will come from a sample that is representative of
the population from which it is selected.
4Sampling/Data Collection
- In order to get an accurate picture of the
population as a whole, all of its characteristics
must be represented in the sample in appropriate
proportions. - A sample is said to be biased when it is not
representative of the entire population to which
an investigator wants to generalize. - A representative sample is accomplished through
sampling methods.
5Random Sampling
- A sample is random when
- Every member of the population has an equal
chance of being selected to be in the sample and - The selection of any one member of the population
does not influence the chances of selecting any
other member. - One very simple way to obtain a random sample is
to put the names or code numbers of all members
of the population into a hat, shake them up, and
without looking, draw out enough for your sample.
6Random Sampling
- This is usually the way winning lottery tickets
are selected and the procedure gives each ticket
an equal chance of winning. - Another procedure frequently used to obtain a
random sample is a Table of Random Numbers. - The numbers in the table are generated by a
computer so that every digit is as likely to
appear as every other. - If you want to select a sample of 100 cases from
a population of 500, you would first assign every
member of the population a number.
7Random Sampling
- Then you would enter the table at any point, and
then read the numbers until you had found 100
numbers between the values of 1 and 500. - Any numbers that you encounter higher than 500
you would ignore. - Random sampling does not eliminate all
possibility of error, but it does guard against
any systematic bias slipping into the selection
of the sample.
8Systematic Random Sampling
- A slight variation on simple random sampling is
systematic random sampling. - Subjects are selected from a population listing
(e.g., a phone book) in a systematic way (e.g.
every 10th name). - This method is fast and easy, but it is accurate
only if the listing of the population is not
biased in any way (i.e., against people without
phones or with unlisted numbers).
9Stratified Sampling
- Stratified sampling involves a family of
sophisticated sampling techniques. - These methods use known characteristics of the
population and a sample is selected
(proportionate or disproportionate) based upon
these known characteristics. - Say for example, you had a population of 100
people 70 were male 30 were female. - If you took a stratified portion sample of 10,
you would have 7 males and 3 females in you
sample.
10Stratified Sampling
- The variable of gender would be the stratifying
variable upon which the sample is proportionately
controlled. - Proper representation of the sample to the
population on this stratifying variable has now
been ensured. - Complex strata can be simultaneously controlled
in any one sample. - If the strata of income, age, and gender for a
population is already known, a sample can be
selected to ensure proportional representation on
all three of these strata simultaneously.
11Stratified Sampling
- In rough terms, statistical tests require about
30 subjects for group category analysis. - When using a multistage stratified sampling
method to represent a large population, the
number of subjects in each category of each
strata must still be at or around a minimum of
30. - So, for the stratifying variable of gender, a
minimum of 60 total subjects would be required
30 males and 30 females. - The most common stratified sampling error
involves the use of too many stratifying
variables on inadequate sample sizes.
12Sample Size Determination
- A common sampling question is how many subjects
need to be sampled to accurately reflect the
population? - There are numerous statistical and
non-statistical approaches to this issue. - Unfortunately, the statistical processes for
estimating sample size are not used frequently in
professional research. - Sample sizes are usually derived from similar
studies, advisor recommendations, or the
researchers own common sense. - The nature of the study often dictates specific
sample size.
13Sample Size Determination
- For example, in aphasiology, size is usually
determined by the availability of the subjects
usually 15 subjects per group (e.g., left CVAs,
right CVAs, and normals) are recommended. - When collecting data on normals (e.g., African
American adults), 50 members per subject grouping
has been recommended, with a minimum of 10
members per stratified variable (e.g., SES). - All initial sample size estimates should be
adjusted upward to compensate for attrition
(e.g., subject drop out), subject refusal to
participate, or other similar circumstances.
14Data Collection Methods
- Once the sample is determined, one needs to
consider how to collect the data. - The three most common data collection methods are
group administration, mail administration, and
in-person administration. - The group administered method of data collection
is relatively accurate, low in cost, and
traditionally accepted, particularly in education
(classroom) research.
15Group Administration
- Very commonly, data is collected by the
researcher, classroom teacher, or other
professional within a group setting. - Regardless of who collects the data (e.g.,
researcher, classroom teacher, field worker)
caution must be exercised to ensure that biases
do not result. - To protect the quality of the data collected the
instructions and time lines regarding the project
must be implemented exactly as desired. - Alterations in these areas can create or promote
serious invalidity into the research design.
16Mail-Administration
- Mail questionnaires have advantages of subject
privacy and convenience. - They are also relatively low in cost to
implement. - Mail surveys have notoriously low response rates
of approximately 25 resulting in time consuming
follow-up of non-respondents. - If only interested persons return the survey,
serious biasing could result. - Mailing lists used for sampling may be inherently
biased depending upon the source of the list.
17Mail-Administration
- Concerning the instrument itself, instructions
must be absolutely clear on mailed questionnaires
since the respondent completes the form
privately. - Question wording must be unambiguous and response
scales, if used, must facilitate an ease of
answering. - The key to success in using mail questionnaires
involves meticulous planning of the instructions
, questions, response types, form layout, and
follow-up activities.
18In-Person Administration
- Personal interviewing or individual test
administration is often used in behavioral
research. - The researcher or other qualified field worker
administers the test or interviews each subject
to gain information regarding test performance or
in-depth reactions and/or attitudes specific to a
research topic. - Examiner bias must be carefully controlled.
- Audio- or video-taping is frequently used as
scoring and response reliability issues must
often be addressed.
19Instrumentation and Testing
- Instruments are used to collect data.
- An instrument may be a published or original
test or survey form, a unobtrusive measuring
device, or other type of measure tool. - All instruments must be selected for use based on
their data collection validity, reliability, and
practicality factors.
20Validity
- All research instruments must first be considered
in terms of their validity. - Validity simply refers to the question does the
instrument measure what it is supposed to
measure? - Test or survey items must be germane to the
subject area under investigation. - Three formal methods for evaluating instrument
validity are content validity, criterion-related
validity, and construct validity.
21Content Validity
- Content validity is the extent to which a
measurement reflects the specific intended domain
of content (Carmines Zeller, 1991). - It consists of logical thought and judgment as
the method to derive valid test or survey items. - There is no quantitative evidence to objectively
and scientifically demonstrates the instruments
validityonly the researcher's opinion.
22Criterion-related Validity
- Criterion-related validity is used to demonstrate
the accuracy of a measuring procedure by
comparing it with another procedure which has
been demonstrated to be valid - Also referred to as instrumental validity, it is
a method by which correlation coefficients are
established for the instrument. - There are two common types of criterion-related
validity concurrent validity and predictive
validity.
23Criterion-related Validity
- Both of these validity types utilize statistical
correlation to arrive at a value or validity
coefficient for the instrument. - The correlation may range from 0.0 to 1.00.
- The closer to 1.00 the correlation coefficient
is, the stronger the criterion-related validity. - Concurrent validity is the extent to which a test
yields the same results as other measures of the
same phenomenon. - A researcher may initially use content validity
to develop test items.
24Criterion-related Validity
- The validity of a researcher-made instrument can
be assessed concurrently by a correlation with a
standardized published instrumentthe criterion. - Subjects are tested twice, using both the
researcher-made test and a standardized test
instrument. - If the researcher-made test is in fact a valid
test, scores from subjects on this test should be
closely related, i.e., statistically correlated,
to test scores derived from the standardized
published instrument.
25Criterion-related Validity
- Predictive validity is the extent to which a
measure accurately forecasts how a person will
think, act, and feel in the future. - Consider the situation in which the same
researcher collects data over a series of months
and notices that subjects that do well on the
researcher-made test tend to also do well on
another test. - The closer the predictive correlational
coefficient value is to 1.00, the higher, and
better, the predictive validity of the
instrument.
26Criterion-related Validity
- Reasonable validity coefficient ranges for
criterion-related validity measures are as
follows - 1.00 to .90 (excellent)
- .89 to .85 (good)
- .80 to .84 (fair)
- .79 or less (poor).
- The quality of all criterion-related validity
measures relies heavily upon the quality of the
criterion measure. - In other words, an high concurrent validity
between a researcher-made test and published
standardized test, could be indicating that both
tests are equally poor as opposed to equally
good!
27Construct Validity
- The highest form of validity is construct
validity. - It seeks an agreement between a theoretical
concept and a specific measuring device, such as
observation. - Construct validity utilizes multivariate factor
analysis to develop factorsconstructswithin
each test or survey instrument.
28Construct Validity
- For example, in a instrument measuring
self-esteem, 150 original survey questions may be
factor-analyzed into five specific factors of
self-esteem general, personal, social, academic,
professional. - In other words, all the instrument items could be
categorized under five constructs. - Construct validity is a powerful and
sophisticated approach to instrument validity.
29Construct Validity
- The factor analytic procedure generally allows
the researcher to view new insights into the
quality of the test items and the interrelations
between questions. - The approach is best used when numerous questions
are involved (all having the same response scale)
and many subjects are used in the research.
30Reliability
- Reliability evaluates the consistency of the
measurements. - Reliability measurements are presented as
correlational coefficients. - The higher the correlation value, the more
reliable the instrument. - As with validity, correlations may range from 0.0
to 1.00. - A correlation of 1.00 represents perfect
reliability within an instrument.
31Reliability
- A correlation of .20 would reflect a quite
unreliable instrument. - Three techniques are used to assess reliability
test-retest, split-half, and equivalent forms. - Test-retest reliability is established by
administering a test or a survey twice to the
exact same group of subjects with a short time
lapse between testing. - The correlation can be done either by item, or,
more commonly, by total test score.
32Reliability
- A correlational coefficient is calculated to
measure the amount of relationship between
subjects first and second test answer or test
score. - Theoretically, the subjects should receive the
identical score both times, if the test is
consistent (i.e., reliable). - Practice effects may create spurious results, so
test-retest is only recommended for use when
other reliability methods are not feasible.
33Reliability
- Split-half reliability is an improved variation
on test-retest reliability. - Test items are put in order of difficulty (if a
cognitive test) or by subject matter (if
attitudinal test) and then the test items are
split version A with odd questions version B
with even questions. - The theory is that if the total test is reliable,
subjects should have highly correlated scores
between the two versions, even and odd.
34Reliability
- Split-half reliability is a reasonable method of
evaluating reliability depending upon how equally
the total test can be divided. - It is not an appropriate method for timed tests.
- Another reliability method involves developing
equivalent forms two completely separate but
equal tests are created. - The subject group is tested twice, once with each
form of the test (e.g., the PPVT Form L, Form M). - A correlation coefficient calculated from both
test scores on all subject will indicate the
reliability of the tests.
35Reliability
- The success of this method depends greatly on the
true equivalency of the two test versions. - Writing test items to match so closely can be
much more difficult than it sounds. - Also, equivalent forms reliability requires two
separate administrations of the instrument which
all takes time and money.
36Item Analysis
- Item analysis is a powerful evaluative tool that
can be applied to either cognitive or attitudinal
instruments for recognizing instrument
weaknesses, for test scoring, and for calculating
internal consistency reliability measures. - It is done to determine each test items ability
to discriminate high scoring and low scoring
subjects. - This analysis involves a correlational
calculation between the total test score and the
item score.
37Item Analysis
- Computing correlational coefficients for each
test item allows the researcher to evaluate each
items effectiveness and consistency in relation
to the total test. - A high scoring respondent should answer test
questions consistently in a high scoring
direction. - So if the correlation between scoring high on the
total test and scoring high on one particular
question is strong, the question must be a good
one.
38Instrument Selection
- When collecting data, you need to consider
whether to select a published instrument or
develop your own. - You might select a published instrument because
of its professional acceptance and because
validity, reliability, and perhaps even item
analysis data have already be established and
acknowledged in the test manual. - The instrument has probably also been piloted and
revised throughout the years.
39Instrument Selection
- To review published instruments, go directly to a
test in print source like Buros Institute of
Mental Measurements. - Read the test review to see if test that is one
that might be appropriate for your study. - Also consider reviewing how a published
instrument is perceived by professional journals.
40Instrument Selection
- The time spent investigating the instrument must
be measured against the time period involved in
developing an original test, which is generally
the only remaining alternative. - Many times numerous published instruments fit
closely to the topic being studied. - In such cases, each instrument must be evaluated
individual across a consistent array of
parameters defining the ideal instrument.
41Instrument Selection
- Only consider published instruments that have
validity and reliability measures available. - If such data is unavailable, be suspicious.
- If a published instrument can be located for your
topic of interest, it is worth the effort to
consider it strongly, but not blindly. - Although rare, a research project may be so
unusual, creative, or innovative, that few, if
any published instruments are appropriate for
use. - In this case, the researcher must devise an
original instrument.
42Instrument Selection
- The following steps may help you develop a
scientific, original instrument - Review the most similar existing instruments.
- Available instruments may not even measure the
specific topic of your study, but they may yield
some new ideas regarding question form, response
scales, test length, calibration, etc. - In writing original items, start by first listing
one to five major (autonomous) concepts that are
to be investigated by the instrument.
43Instrument Selection
- Weigh each major concept in importance assign
numerical values to these categories if possible. - Decide on the total number of questions desired.
- If in doubt, use a conservative estimate, usually
20 to 50 items. - Remember to consider the respondents interest
level, attention span, and fatigue factors when
deciding on a questionnaires or tests length.
44Instrument Selection
- Estimate using the weights assigned in step c)
how many questions need to be developed for each
major concept within the instrument. The more
important concepts require more questions. - Develop response scales for each item. Try to
stay consistent in types of response scales
utilized. - If using a Likert-type (e.g., five-point response
scale), use it consistently through out the
instrument or test section.
45Instrument Selection
- Categorize all test items and read them to a
peer. - Make corrections to eliminate ambiguous working
combine similar items which ask the same or
similar questions. - Do not overestimate or under estimate the
respondents reading aptitude. - Over estimation of respondent reading skills can
cause item misunderstanding, misinterpretation,
and lowered validity and reliability.
46Instrument Selection
- Underestimation of respondent reading skills can
cause levity, insult, or even resentment reading
the instrument and even the entire test. - Refine the number of items down to the original
estimate of 20-50. - Consider revising the wording of a few (10-20)
final items to reduce the halo effect. - At this point, your instrument should possess at
least defendable content validity. - Informally test the items for clarity with a very
small group similar to the project respondent
group.
47Instrument Selection
- Formally pilot test the instrument.
- Typically, in cognitive studies, open-ended,
write-in answers, true/false questions, or
multiple-choice formats are utilized. - If questions are attitudinal, the Likert-type
scales are commonly used. - Likert-type scales generally have five to seven
response choices in degrees of progressive
feelings (e.g., 1-strong agree 2-agree
3-neural 4-disagree etc.).
48Hazards in Testing
- After item writing and response scale selection,
there are some specialized hazards to consider
with your research design. - The good subject syndrome and the
self-fulfilling prophecy are hazards
encountered with attitudinal surveys. - The good subject is the respondent who is
genuinely attempting to help the researcher by
answering an attitudinal question as it should
be for the researchs sake, but not has he/she
really feels.
49Hazards in Testing
- Watch for responses which are aimed at satisfying
the research or project goals instead of
providing accurate and sincere evaluative data. - The self-fulfilling prophecy occurs when the
respondent answers questions the way he/she would
like to see him/herself, instead of how he/she
really sees him/herself. - This can be hard or impossible to detect for
certain, but be aware of its possibility.
50Hazards in Testing
- In studies involving psychological motivations or
controversial subjects (e.g., sex, religion,
politics), the self-fulfilling prophecy can
emerge easily and weaken the data tremendously. - The halo effect can be a common problem in
studies which involve long checklists of
evaluative questions. - The respondent may get into the habit of
evaluating all items as agree regardless of
his/her attitude toward the question.
51Hazards in Testing
- Do not design an instrument in which the
respondent will need to assess numerous attitudes
over a vary large number of question using an
identical response scale. - The problem can usually be remedied by reversing
the wording on various items at strategic
locations in the instrument. - Also, use subparts within the instrument or allow
short rest periods to help break up the test
administration.
52Hazards in Testing
- The Hawthorne effect involves how subjects
react in a study if they know they are being
watched. - In an experiment years ago, workers demonstrated
different skills and abilities simply by virtue
of the fact they were being studied. - The experiment was conducted at the Hawthorne
plant of Western Electric Company where the
effect was first recognized. - Particularly in behavioral research, respondents
may later their normal pattern or responses due
merely to their knowledge of themselves being
studied, and not to the experimental treatment.
53Statistical Analysis
- Statistical analysis provides an objective tool
for researchers to use in measuring their
findings and comparing them to their previous
expectations. - The first step to locating the right statistic
for your research hypothesis and design is to
consider the nature of the data you are
eliciting/collecting.
54Data
- Continuous data are comprised of ongoing, varying
values. - Among the many examples are number of years at a
residence, age, distances, test scores, scaled
scores, IQs, yearly income, height, and weight. - Continuous data permit assessments of mean,
range, standard deviation, variance, as well as
other statistical options.
55Data
- Categorical (discrete, discontinuous) data are
data which fall into groupings or divisions. - Common examples of categorical data include
gender, political affiliation, blood type,
favorite color, etc. - Continuous data can be transformed later into
categorical data, but categorical data can never
be later made continuous.
56Measurement Scales
- Data falls into one of four measurement scales
- nominal
- ordinal
- interval or
- ratio.
- Remember the acronym NOIR for the correct order
from the lowest and weakest measurement scale
(nominal) to the highest and strongest
measurement scale (ratio).
57Measurement Scales
- Nominal and ordinal measurements are common in
social and behavioral sciences. - Data measured by either nominal or ordinal scales
must be analyzed by nonparametric methods. - Data measured on the interval or ratio scales may
be analyzed by parametric methods if the
statistical model is valid for the data.
58Nominal Data
- A nominal variable is simply a named category.
- For example, the psychiatric system of diagnostic
groups constitutes a nominal scale. - When a diagnostician identifies a person as
schizophrenic or paranoid or neurotic, s/he
is using a categorical label to represent the
class of people to which the person belongs. - Measurement at its weakest level exists when
naming is used simply to classify an object,
person, or characteristic. - In a nominal scale, the scaling operations
partition a given class into a set of mutually
exclusive subclasses.
59Nominal Data
- Whenever a sample of data is collected in such a
way that each observation is assigned to a
category (i.e., number of no responses versus
number of yes responses), frequency counts are
involved. - Nominal data has no intrinsic measure of quantity
attached to it. - We could calculate the percentage of nos in the
sample and the percentage of yeses. - We could report which category had the largest
frequency, but we could not add the no and the
yes categories to form a third category, since
the responses would no longer fall into a unique
subclass.
60Ordinal Data
- Ordinal data, as its name implies, sets
categories into some rank order, from highest to
lowest. - It may happen that the objects in one category of
a scale are not only different from the objects
in other categories, but also stand in some kind
of relation to them. - Ordinal measurements communicate the relative
standings of categories, but not the amount of
the differences among them. - For example, letter grades are usually assigned
A, B, C, D, and F.
61Ordinal Data
- These constitute an ordering of performance A is
better than B which is better than C which is
better than D which is better than F. - Any numbers may be assigned to these letter
grades A4 B3 C2 D1, and F0, as long as
the preserve the intended order, or as long as we
assign a higher number to the member of the class
which is greatest or more preferred.
62Interval Data
- When a scale has all the characteristics of an
ordinal scale, and when the distances or
differences between any two numbers on the scale
have meaning, then measurement is considerably
stronger. - An interval scale is characterized by a common
and constant unit of measurement which assigns a
number to all pairs of objects in the ordered
set. - For an interval scale, the zero point and the
unit of measurement are arbitrary.
63Interval Data
- Temperature is measured on an interval scale.
- If youre Canadian, you use the Celsius scale,
but if youre American you use the Fahrenheit
scale. - The unit of measurement and the zero point in
measuring temperature are arbitrary they are
different for the two scales. - For instance, freezing occurs a 0 degrees on
the Celsius scale but at 32 degrees on the
Fahrenheit scale, while boiling occurs at 100
degrees Celsius and at 212 degrees Fahrenheit.
64Interval Data
- However, both scales contain the same amount and
the same kind of information because they are
linearly related. - That is, a reading on one scale can be
transformed to the equivalent reading on the
other scale by means of a linear transformation. - The operations and relations which give rise to
the structure of an interval scale are such that
numbers associated with the positions of the
objects on the interval scale can be manipulated
arithmetically.
65Interval Data
- Thus, the interval scale is the first truly
quantitative scale we have encountered. - Means, standard deviations, correlations, etc.
are applicable to interval scale data.
66Ratio Data
- Ratio data, the highest measurement scale, also
sets a true quantity value on numbers, but now
with regard to a true zero-point. - Common examples of ratio data include age,
weight, height, and most test scores. - On a classroom test of 10 questions, a student
getting 7 correct answers receives a score of 7.
- This score is of the ratio measurement type,
since a true zero-point of zero correct does in
fact exist and is the fundamental base from which
the score of 7 is derived.
67Descriptive vs. Inferential Statistics
- Descriptive statistics summarize data.
- Data are described using standard methods to
determine the average value, the range of data
around the average, and other characteristics. - Examples of descriptive statistics include the
mean, mode, median, standard deviation, variance,
and response percentages. - Often times graphs and charts are presented with
regard to descriptive data to assist in
explaining the statistics.
68Descriptive vs. Inferential Statistics
- Suppose you have the scores on a standardized
test for 500 subjects. - Instead of presenting a list of the 500 scores in
a research report, you might present an average
score, which describes the performance of the
typical subject. - A set of data does not always consist of scores.
69Descriptive vs. Inferential Statistics
- For instance, you might have data on the
political affiliations of the residents of a
community. - To summarize these data, you might count how many
are Democrats, Republicans, Independents, etc.,
and then calculate the percentages of each. - A percentage is a descriptive statistic that
indicates how many units per 100 have a certain
characteristic. - Thus, if 42 of a group of people are Democrats,
42 out of each 100 people in the group are
Democrats.
70Descriptive vs. Inferential Statistics
- The summaries provided by descriptive statistics
are usually much more concise than the original
data set (e.g., an average is much more concise
than a list of 500 scores). - In addition, descriptive statistics help us
interpret sets of data (e.g., an average helps us
understand what is typical of a group). - The objective of descriptive statistics is simply
to communicate the results without attempting to
generalize beyond the sample of individuals to
any other group.
71Descriptive vs. Inferential Statistics
- Descriptive statistics assume that data have an
underlying normal distribution. - However, the properties of nominal and ordinal
data to not correspond with the arithmetic
system. - Moreover, descriptive tools such as averages and
percentages for population data should be called
parameters not statistics.
72Descriptive vs. Inferential Statistics
73Descriptive vs. Inferential Statistics
74Descriptive vs. Inferential Statistics
75Descriptive vs. Inferential Statistics
- A standard normal curve has the mean, median,
mode equal to each other. - The range measures the entire width of the
distribution. - The kurtosis measures the flatness or peakedness
of the curve. - The skewness addresses the amount of curve
imbalance between right and left halves of the
distribution.
76Descriptive vs. Inferential Statistics
- Inferential statistics are tools that tell us how
much confidence we can have when we generalize
from a sample to a population. - The goal of inferential statistics is to
determine the likelihood that these differences
could have occurred by chance as a result of the
combined effects of unforeseen variables not
under direct control of the experimenter. - An inferential test of a null hypothesis yields,
as its final result, and probability that the
null hypothesis is true.
77Descriptive vs. Inferential Statistics
- The symbol for probability is a lower-case p.
- Thus, if we find that the probability that the
null hypothesis is true in a given study is less
than 5 in 100, this result would be expressed as
p lt .05. - In other words, the null hypothesis would only be
true five percent of the time, so it is probably
not true. - There is always some probability that the null
hypothesis is true, so researchers have settled
on the .05 level as the level at which it is
appropriate to reject the null hypothesis.
78Descriptive vs. Inferential Statistics
- When an alpha of .05 is used, we are, in effect,
willing to be wrong 5 times in 100 in rejecting
the null hypothesis. - We are taking a calculated risk that we might be
wrong 5 of the time. - This type of error is known as a Type I Errorthe
error of rejecting the null hypothesis when it is
correct. - When the probability is low that the null
hypothesis is correct, we reject the null
hypothesis by declaring the result to be
statistically significant.
79Descriptive vs. Inferential Statistics
- You will also frequently see p values of less
than .05 reported. - The most common are p lt .01 (less than 1 in 100)
and p lt .001 (less than 1 in 1000). - When a result is statistically significant at
these levels, investigators can be more confident
of not committing a Type I error.
80Descriptive vs. Inferential Statistics
- To review
- .06 level not significant do not reject the
H0. - .05 level significant reject the H0.
- .01 level more significant reject the H0 with
more confidence than at the .05 level. - .001 level highly significant reject the H0
with even more confidence than at the .01 or .05
levels.
81Descriptive vs. Inferential Statistics
- Should you decide to use some level other than
.05, you should decide that in advance of
examining the data. - When you require a lower probability before
rejecting the null hypothesis (e.g., .01 instead
of .05), you are increasing the likelihood that
you will make a Type II Error. - A type II error is the error of failing to reject
the null hypothesis when in it is false. - This type of error can have serious consequences.
82Descriptive vs. Inferential Statistics
- Although either decision we make about the null
hypothesis (reject or fail to reject) may be
wrong, by using inferential statistics and
reporting the probability level, we are
informing others of the likelihood that we
incorrect when we decided to reject the null
hypothesis.
83Parametric Statistics
- The nature of the data, continuous versus
categorical, is an important consideration in
deciding whether to use parametric or
nonparametric statistical tests. - A parametric statistical test specifies certain
conditions about the distribution of responses in
the population from which the research sample was
drawn. - Specifically, the data must satisfy the following
assumptions
84Parametric Statistics
- The assumption of normalitythat the samples upon
which the research is done was selected from
populations which are normally distributed - The assumption of homogeneity of variancethat
the spread (variance or standard deviation) of
the dependent variable (e.g., score) within the
group tested must be statistically equal. That
is, the shape of each groups distributional
curve should be equal and - The assumption that the nature of the data is
continuous.
85Parametric Statistics
- If the data collected satisfies all three
assumptions, parametric procedures are
recommended. - If any of these three assumptions are violated by
the data, then non-parametric statistical tests
should be used.
86t-Test
- T-test are used to compare the means of two
samples for statistical significance. - The t-test can be used to test two groups on a
pre-test only two groups on a post-test only
one group on a pre-test versus post-test or two
groups on gain scores (e.g., post-test minus
pre-test). - The t-test is useful when the sample has the
following characteristics - If the sample is large, the less likely the
difference between two means was created by
sampling errors
87t-Test
- If the difference between the two means is large,
the less likely that the difference was created
by sampling errors and - If the variance among the subjects is small, the
less likely that the difference between two means
was created by sampling errors. - There are two types of t tests.
- One is for independent (uncorrelated) data and
the other is for dependent (correlated) data. - Independent data are obtained when there is no
matching or pairing of subjects across groups
dependent data are obtained when each score in
one set of scores is paired with a score in
another set.
88ANOVA
- Closely related to the t test is analysis of
variance (ANOVA). - The ANOVA is the most traditionally and widely
accepted form of statistical analysis. - ANOVA is used to test the difference(s) among two
or more means utilizing a single statistical
operation. - ANOVA accomplishes its statistical testing by
comparing variances between the groups to the
variance within the group. - A resulting F-ratio (variance between groups
divided by variance within group) and an
associated significance level is found.
89ANOVA
- One-way, or single-factor ANOVA, is used when
subjects are classified only according to one
categorical group (e.g., drug group method of
instruction). - Two-way, or two-factor ANOVA, is used when each
subject is classified in two ways such as 1) drug
group and 2) gender. - A two-way ANOVA examines two main effects (drug
level and gender) and one interaction (drug level
x gender). - This is done by computing three values of F (one
for each of the three null hypotheses) and
determining the probability associated with each.
90Pearson r
- The Pearson product-moment linear correlation
coefficient r is a very popular parametric
statistical measure of the relationship between
two continuous data variables. - Pearson r is used when the researcher wishes to
study how a change in one variable may tend to be
related to a change in a second variable. - Since Pearson r is a measure of relationship,
data on two variables are collected from the same
group of subjects and paired.
91Pearson r
- A resulting r value and an associated
significance level would assess both the
direction ( or direct or - inverse) and the
strength (between o and 1.00) of the relationship
between the two variables.
92Non-Parametric Statistics
- A non-parametric statistical test is based on a
model that specifies only very general conditions
and none regarding the specific from of the
distribution from which the sample was drawn. - Non-parametric tests need none of the three
parametric assumptions satisfied for their proper
application. - Non-parametric tests can be applied in almost any
research situation. - Usually, the non-parametric methods serve as the
statistical tests for nominal or ordinal scaled
data.
93Chi-Square
- The most popular of all non-parametric
inferential statistical methods is the chi-square
(?2). - ?2 tests for differences between categorical
variables (nominal or ordinal). - Such data do not permit the computation of means
and standard deviations - Instead, we normally report the number of
subjects who were found in each category (the
frequency) and the corresponding proportions or
percentages.
94Chi-Square
- There are both one-way and two-way chi-square
procedures. - A one-way ?2 (also known as a goodness of fit
chi-square) is used if one categorical variable
is involved, say political affiliation. - The one-way chi-square would test for differences
in popularity between the political party
candidates - Candidate Smith n 110
- Candidate Doe n 90
95Chi-Square
- The two-way chi-square is used when two
categorical variables are to be compared
(political candidate and gender) - Candidate Jones Candidate Black
- Males n80 n120
- Females n120 n80
- There are two types of chi-square tests.
- A chi-square test of homogeneity involves two or
more populations, as above, on one outcome
variable.
96Chi-Square
- A chi-square test of independence involves one
population, classified in two ways. - For example, a random sample of college student
was asked whether they think that IQ tests
measure innate intelligence and whether they had
taken a tests and measurements course. - The data would render two categories of
information (innate opinion and course taking).
97Wilcoxon Matched Pairs Sign Test
- The Wilcoxon Matched Pairs Signed-Ranks test is a
commonly used non-parametric analog of the the
paired t test that utilizes information about
both the magnitude and direction of difference
for pairs of scores. - In the behavioral sciences, it is the commonly
used non-parametric test of the significance of
difference between dependent samples. - This test is appropriate for studies involving
repeated measures, as in the pre-test and
post-test designs in which the same subjects
serve as their own controls or in cases which use
matched pairs.
98Wilcoxon Matched Pairs Sign Test
- Suppose we wish to determine whether preschool
children with impairments in both grammar and
phonology will make more speech sound errors when
imitating grammatically complete sentences than
when imitating relatively simples sentences that
are comparable in length. - We are using one set of children and looking at
the correlation between grammar complexity and
speech sound errors.
99Wilcoxon Matched Pairs Sign Test
- When testing dependent or correlated samples, the
Wilcoxon matched pairs sign test will determine
which member of a pair of scores is larger or
smaller than the other (as denoted by or -,
respectively), and the ranking of such size
differences. - Paired scores are organized into a table, and the
difference if found ( if first of pair is
larger - if first in pair is smaller). - The sign of a number has no real mathematical
significance, it just serves to mark the
direction of the difference between the pairs of
scores.
100Wilcoxon Matched Pairs Sign Test
- Then the differences are ranked according to
their relative magnitude, assigning an average
rank score to each tie irrespective of whether
the sign is positive or negative. - Zero difference scores between pairs (d0) are
dropped from the analysis. - Therefore, the total number of signed ranks (n)
is used in determining the criterion for
rejecting the null hypothesis. - Finally, the absolute value of the ranked
difference score having the least frequent sign
are summed (T).
101Mann-Whitney U Test
- The Mann-Whitney U test looks at whether the
distribution of scores for one random sample is
significantly different from the distribution of
another independent random sample. - It is concerned with the equality of medians
rather than means. - It is commonly used when the parametric t tests
assumptions of normality and homogeneity of
variance are violated.
102Mann-Whitney U Test
- Suppose we are interested in knowing whether the
physical status of newborns is related to their
subsequent development of receptive language. - For this purpose, we conduct a prospective study
in which Apgar scores are collected on a random
sample. - Such scores are used to denote the general
condition of the infant shortly after birth based
on five physical indices including skin color,
heart rate, respiratory effort, muscle tone, and
reflex irritability.
103Mann-Whitney U Test
- The maximum score of 10 is indicative of
excellent physical condition. - Using these numerical values as our independent
variables, we divide our sample into two groups
10 children with high Apgar scores (greater than
6) and 8 children with low Apgar scores (less
than 4). - Composite language scores, obtained from these
same children at ages 3 to 3.5 years on the
appropriate subtests of the CELF-P serve as the
dependent variable.
104Mann-Whitney U Test
- Our research hypothesis is that there is a
difference in the receptive language ability of
children who scored high on the Apgar scale
versus those who scored low. - Data is organized by category (language score
high Apgar group and language score low Apgar
group) and then the rank of those language scores
is completed, just like with the Wilcoxon matched
pairs sign test.
105Mann-Whitney U Test
- This time, though, the ranks are summed for each
category. - A calculation is performed and the smaller of U1
and U2 serves as the observed value which is
compared to a tabular critical value for
rejecting or maintaining the null hypothesis.