Title: IT 223
1IT 223
2About me
- Nedjla Ougouag
- Office Room 701 CTI building
- Ph (312) 362-5166
- Email ntiouririne_at_cs.depaul.edu
- Homepage
- http//condor.depaul.edu/ntiourir/Homepage.html
- Office hours
- Without appointment Tuesday 330-500
- With appointment After class.
3About this course
- The course will discuss simple statistical
methods and basic concepts of probability theory.
- The topics of the course are
- descriptive statistics and representing data
using graphs. - Linear regression models.
- Sampling and experimental design.
- An introduction to statistical inference
- confidence intervals and
- hypothesis testing.
-
- We will use the statistical package SAS
(Statistical Analysis System)
4About this course
- The statistical software SAS runs
- on UNIX
- (accounts on Hawk are available to students)
- on PC's with Windows
- (available in the computer labs)
5About this course
- Required textbook
- D.S. Moore and G.P. McCabe, "Introduction to the
Practice of Statistics", Fourth Edition, 2003.
ISBN0-7167-9657-0 - Optional
- Michael Evans, "SAS Manual for Moore and
McCabe's Introduction to the Practice of
Statistics" , Third Edition, 1999. ISBN
0-7167-3657-8
6Grading
- Homework assignments 40,
- Midterm 25,
- Final 35.
- Assignment grading is based on meeting the goals
defined as well as on effort. - Syllabus (online)
7Assignment rules
- Only legible, organized homework will be graded.
Always include your name, date, section, and
homework number at the top of your assignment. - How to submit your assignments
- via the DLWeb (collate into ONE Word file).
- in class (staple pages together).
- No e-mail submissions will be accepted.
8Class time
- Attendance
- my notes are not enough to get by. Plan on
attending to do well. - Since youre here - participate!
- Your homework assignments are the most important
way for me to view your progress.
9Important information
- Update your email on Campus Connect
- Go to ID Services to apply for your student
computer account if you do not already have one.
For more information go to the link
http//is.depaul.edu/communication/web/personal_st
udents.asp
10Course material
- All lectures will be posted on the Distance
Learning Web (DLWeb) https//dlweb.cti.depaul.edu
/login/login.asp - Assignments and grades will be posted on the
DLWeb. - A class discussion forum is available for you on
the DLWeb.
11About you
- Your contact info Name, e-mail
- Major. Concentration if CS major. Graduate or
undergraduate? - Have you ever taken a statistics or probability
course? If so, which ones and how long ago? - Have you ever used any statistical software like
SAS before? If so, which tool(s) and how familiar
are you with it? What do you hope to get from
this class? - How familiar are you with Unix?
This questionnaire online Fill out and e-mail to
me
12Lecture 1 Exploratory Data Analysis
13Outline
- Quick Math assessment/review
- Exploratory data analysis (Sec. 1.1, 1.2)
- Discovering information from the data through
graphs and numbers.
14Math review
- 300 is what of 2,000?
- 100,000 families -gt 0.1 of 1 of these have
income greater than 75K. How many families is
that? - There are 100 millions eligible voters in the US.
The Gallup poll interviews 5,000 of them. This
is equivalent to 1 out of every ? - In the US (hypothetically), 1 in 500 is in the
Army and 3 in 10,000 are officers. What of
Army personnel are officers? - 1 out of 1500 people is a Marine and 1/10 of
Marines are officers. What of population are
officers in the Marine? - Without calcuator Sqrt(100,000) is closest to
30, 100, 300, 1000? - Is Sqrt(0.5) lt 0.5 ?
- Is Sqrt(2) lt 2?
15- 9. Solve for x y
- x 3y 1
- 2x y -3
- A quart of Vodka is 40 alcohol. A drink is made
of OJ and Vodka (V quarts of Vodka for J quarts
of OJ). What is the alcohol in the drink?
(formula function of V and J) - Tom's mother is 3 times as old as he is. Next
year, their ages will add up to 50. How old is
Tom? - Probability
- Throw one die. What is the chance of getting a
1? - Throw two dice. What is the chance of getting
two 1's? - Consider two situations
- A) a Coin is tossed 100 times. If it comes up
heads 60 times or more you win. - B) It is tossed 1,000 times. If it comes up
heads 600 times or more, you win. Which is
better A or B?
16Summary of course content
- Statistics Science of assembling, organizing,
and analyzing data - First gather data (available or to be sampled)
- Then analyze the data (graphs, numerical
analysis) - Probability theory
- Distributions
- Making predictions Inference
- Assessing prediction accuracy Hypothesis testing
17Exploratory Data Analysis
- The goal of statistics is to gain information
from the data. - First Data are collected.
- Data come from several sources
- Available data
- Census data, Federal agencies, Governmental
Statistical Offices (www.fedstats.gov), General
Social Survey at the University of Chicagos
NORC . - Several databases are available on the Internet
or at DePaul library!! - New Data
- Sampling from population of interest
Observational studies - Conducting statistical experiments medical
trials, controlled experiments. When well
designed, provide most reliable source of
information!!
18- Next step after data collection?
- Long listings of data are of little value.
- Statistical methods come to help us.
- Exploratory data analysis set of methods to
display and summarize the data. - In this course we will deal with data on one
variable at a time. - The distribution of the observations is analyzed
by - Displaying the data in a graph that shows overall
patterns and unusual observations (histogram, box
plot, density curve) - Computing descriptive statistics that summarize
specific aspects of the data (center and spread).
19To designate data with values Random variables
- Data contain information about group of
individuals / subjects - A variable is a characteristic of an observed
individual which takes different values for
different individuals -
- Quantitative variable (continuous) takes
numerical values. - Ex. Height, Weight, Age, Income, Measurements
- Qualitative/Categorical variable classifies an
individual into categories or groups. - Ex. Sex, Religion, Occupation, Age (in classes
e.g. 10-20, 20- 30, 30-40) - The distribution of a variable tells us what
values it takes and how often it takes those
values - Different statistical methods are used to analyze
quantitative or categorical - variables.
20Graphs for categorical variables
- The values of a categorical variable are labels.
- The distribution of a categorical variable lists
the count or percentage of individuals in each
category.
Counts 212 168
20
A sample of 400 wireless internet users.
21(No Transcript)
22Example On the morning of April 10, 1912 the
Titanic sailed from the port of Southampton (UK)
directed to NY. Altogether there were 2,201
passengers and crew members on board. This is the
table of the survivors of the famous tragic
accident.
Define the categorical variables
23Bar chart representing the data in the table
above (in percentages)
24Graphs for quantitative variables the histogram
Example CEO salaries Forbes magazine published
data on the best small firms in 1993. These were
firms with annual sales of more than five and
less than 350 million. Firms were ranked by
five-year average return on investment. The data
extracted are the age and annual salary of the
chief executive officer for the first 60 ranked
firms. (Data at http//lib.stat.cmu.edu/DASL/DataA
rchive.html )
Salary of chief executive officer (including
bonuses), in thousands 145 621 262 208
362 424 339 736 291 58 498 643
390 332 750 368 659 234 396 300
343 536 543 217 298 1103 406 254
862 204 206 250 21 298 350 800 726
370 536 291 808 543 149 350 242
198 213 296 317 482 155 802 200
282 573 388 250 396 572
25- Drawing a histogram
- Construct a distribution table
- Define class intervals or bins (Choose intervals
of equal width!) - Count the percentage of observations in each
interval - End-point convention left endpoint of the
interval is included, and the right endpoint is
excluded, i.e. a,b - Draw the horizontal axis.
- Construct the blocks
- Height of block percentages!
- The total area under an histogram must be 100
26(No Transcript)
2730.50
23.73
3.39
1.70
The area of each block represents the percentages
of cases in the corresponding class interval (or
bin).
28- Remarks
- A histogram represents percent by area. The area
of each block represents the percentages of cases
in the corresponding class interval. - The total area under a histogram is 100
- There is no fixed choice for the number of
classes in a histogram - If class intervals are too small, the histogram
will have spikes - If class intervals are too large, some
information will be missed. - Use your judgment!
- Typically statistical software will choose the
class intervals for you, but you can modify them. - Let's try various binning levels.
29 - Example Smoking
- In a Public Health Service study, a histogram was
plotted showing the - number of cigarettes smoked per day by each
subject (male current smokers), - as shown below. The density is marked in
parentheses. The class intervals - include the left endpoint, but not the right.
- The percentage who smoked less than two packs a
day but at least a pack, is around (There are 20
cigarettes in a pack.) - 1.5 15 30 50
- The percent who smoked at least a pack a day is
around - 1.5 15 30 50
- The percent who smoked at least 3 packs a day is
around - 0.25 of 1 0.5 of 1 10
- The percent who smoked 20 cigarettes a day is
around - 0.35 of 1 0.5 of 1 1.5 3.5 10
30- Answers
- The percentage who smoked less than two packs a
day but at least a pack, is given by (note there
are 20 cigarettes in a pack.) the area of the
third block 1.5x(40-20)1.5x2030 - The percent who smoked at least a pack a day is
given by the area of the third and fourth blocks
300.5x4050 - The percent who smoked at least 3 packs a day is
the area of the block for number of cigarettes
greater or equal to 60. This is half of the
fourth block 10 - The percent who smoked 20 cigarettes a day use
the left endpoint convention, so 20 belongs to
the third block. The answer is 1.5.
31Using histograms for comparisons
Fuel economy for model year 2001 compact and
two-seater cars (Table 1.8 pg 38) City
Consumption Highway consumption
32(No Transcript)
33(No Transcript)
34Describing distributions with numbers
- A distribution can be described through the
measures of its center and of its spread. - Measuring the center
- The most common measures are the mean or average
and the median. - The Mean or Average
- To calculate the average of a set of
observations, add their value and divide by the
number of observations
Data Number of home runs hit by Babe Ruth as a
Yankee 54, 59, 35, 41, 46, 25, 47, 60, 54, 46,
49, 46, 41, 34, 22 The mean number of home runs
hit in a year is
35- The median
- The median M is the midpoint of a distribution,
the number such that half the observations are
smaller and the other half are larger. - To find the median
- Sort all the observations in order of size from
smallest to largest - If the number of observations n is odd, the
median M is the center observation in the ordered
list I.e. M(n1)/2-th obs. - If the number of observations n is even, the
median M is the mean of the two center
observations in the ordered list.
Example 1 Ordered list of home run hits by Babe
Ruth 22 25 34 35 41 41 46 46 46 47 49 54 54 59
60 N15 Median 46
8th
Example 2 Ordered list of home run hits by Roger
Maris in 1961 8 13 14 16 23 26 28 33 39 61
N10 Median (2326)/224.5
36Symmetric distribution
- The mean and median of a symmetric distribution
are close together
50
Mean Median
- In skewed distributions, the mean is farther out
in the long tail than is the median. The mean is
more sensitive to extreme values.
Left-skewed distribution
Right-skewed distribution
50
50
Median
Mean
Median
Mean
37Mean or median?
- The mean is a good measure for the center of a
symmetric distribution - The median is a resistant measure and should be
used for skewed distributions. Its value is only
slightly affected by the presence of extreme
observations, no matter how large these
observations are.
38Example Shopping in a supermarket
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855
? ?
The mean does not say much The median says that
about 50 of the shoppers spent less than 28
dollars What else would you like to know?
39Spread of a Distribution
Two measures of spread 1. The Quartiles First
quartile Q1 the value such that 25 of the
observations fall at or below it, (Q1 is often
called 25th percentile). The third quartile Q3
the value such that 75 of the observations
fall at or below it, (Q3 is often called 75th
percentile). Typically used if the distribution
of the observations is skewed.
The Inter-Quartile Range IQR is defined as the
distance between the two quartiles IQR Q3 Q1
Q1 M Q3
IQR
40Example Shopping in a supermarket (continued)
A marketing consultant observed 50 consecutive
shoppers at a supermarket. The histogram below
shows how much each shopper spent in the store.
Summary statistics Mean 34.70 Median
27.855 Q1 19.27 Q3 45.40 IQR
45.40-19.27 26.13
About 50 of the shoppers spent less than 28
dollars, 25 spent less than 20 dollars and 25
of the customers of the store spent more that 45
dollars. Moreover, 50 of the customers spent
between 20 and 45 dollars! Extreme values for
purchases gt Q3 1.5xIQR84.59