Title: Welcome to IST 380 !
1Welcome to IST 380 !
Data Science Programming
We don't have strong enough words to describe
this class.
- US News and Course Report
When the course was over, I knew it was a good
thing.
an advocate of concrete computing and HMC's
mascot
- New York Times Review of Courses
We give this course two thumbs!
- Ebert and Roeper
2Welcome to IST 380 !
Data Science Programming
an advocate of concrete computing and HMC's
mascot
3About myself
Who
Zach Dodds
Where
Harvey Mudd College
What
Research includes robotics and computer vision
When
Mondays 7-10pm here in ACB 119
dodds_at_cs.hmc.edu 909-607-0867 Office Hours
Contact Information
Friday mornings, 9-11 am
or set up a time...
HMC Beckman B111
4TMI?
fan of low-tech games
fan of low-level AI
5IST 380 the big picture
What is it?
Why me?
6IST 380 the big picture
What is it?
Data Science Venn Diagram
Hmmm where am I on this diagram?
7Data?!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science"
background? (statistics, machine learning, CS)
Where?
8state reminders
9Data!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science"
background? (statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
mostly CS for me
10Data!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science"
background? (statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
This class is truly seminar-style I'm here, as
you are, in order to gain insights into this very
new field .
Harvey Mudd
Where?
44
MMs
mostly CS for me
be sure to set up your login profile for the
submission site
11Data Science concerns
Is "Data Science" important or just trendy?
12Data Science concerns
Hmmm
13the companies are expanding as fast as the data!
14There's certainly a lot of it!
Data, data everywhere
1 Zettabyte
1.8 ZB
8.0 ZB
800 EB
logarithmic scale
Data produced each year
161 EB
5 EB
1 Exabyte
120 PB
100-years of HD video audio
60 PB
Human brain's capacity
1 Petabyte
14 PB
2015
2002
2009
2006
2011
1 Petabyte 1000 TB
1 TB 1000 GB
References
(2002) 5 EB http//www2.sims.berkeley.edu/researc
h/projects/how-much-info-2003/execsum.htm
(2015) 8 ZB http//www.emc.com/collateral/analyst
-reports/idc-extracting-value-from-chaos-ar.pdf
(2011) 1.8 ZB http//www.emc.com/leadership/progr
ams/digital-universe.htm
(life in video) 60 PB in 4320p resolution,
extrapolated from 16MB for 121 of 640x480 video
(w/sound) almost certainly a gross
overestimate, as sleep can be compressed
significantly!
(2009) 800 EB http//www.emc.com/collateral/analy
st-reports/idc-digital-universe-are-you-ready.pdf
(brain) 14 PB http//www.quora.com/Neuroscience-
1/How-much-data-can-the-human-brain-store
(2006) 161 EB http//www.emc.com/collateral/analy
st-reports/expanding-digital-idc-white-paper.pdf
15I'd call it data, not information
wisdom
knowledge
information
data
16Big Data?
I agree with this
17Make data easier to use by using it!
It may be true that Data Science isn't a science
but that doesn't mean it's not useful!
18IST 380 the big picture
What?
Why?
Data Science Programming
Data Rules
All of our insights large and small, permanent
and ephemeral, natural and artificial come
about through the integration of lots of
data. Data Science simply recognizes that the
rules and skills behind those insights are widely
applicable
19A few examples
Make3d
Andrew Ng Computers and Thought award, 2009
How is this being done?
and how do we succeed?
Data Science is at the heart of computer science
20A few examples
Learning to Powerslide
Stanford's Autonomous Vehicles project (Thrun et
al.)
Data Science is at the heart of computer science
21A few examples
Learning ground from obstacles
"my summer was finding that red line"
Data Science is at the heart of computer science
22A few examples
classification
segmentation
Learning ground from obstacles
23Insights beyond science
24Marketing
25Visualization
Motivation
26(No Transcript)
27Recommender Systems
predicting movie ratings
28Netflix Prize
Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
1.22 .75
?? ??
Napoleon Dynamite Batman Begins
Finding Nemo Lord of the Rings
Some films are difficult to predict
29Netflix Prize
Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
1.22 .75
.67 .42
Napoleon Dynamite Batman Begins
Finding Nemo Lord of the Rings
Some films are difficult to predict and others
are easier!
30Why IST 380 ?
Specific skills
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
31Why IST 380 ?
Specific skills
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
Broad background
Final project open-ended with datasets of your
choice
You'll be confident and capable with whatever
datasets you encounter in the future on your
own or as part of a team.
32About IST 380
33Details
Web Page
http//www.cs.hmc.edu/dodds/IST380
Assignments, online text, necessary files,
lecture slides are linked
First week's assignment Getting started with R
Textbook
An introduction to Data Science
jsresearch.net/groups/teachdatascience/
freely available online
and many online resources
Grab both of these now
Programming R
www.r-project.org/
34Homepage
Go to the course page
http//www.cs.hmc.edu/dodds/IST380/
Grab R and the text from these two links
35Homework
Assignments
2-5 problems/week 100 points
extra credit, often
Due Tuesday of the following week by 1159 pm.
Assignment 1 due Tuesday, February 5.
1 week 1 day
36Homework
Assignments
2-5 problems/week 100 points
extra credit, often
Due Tuesday of the following week by 1159 pm.
Assignment 1 due Tuesday, February 5.
On your own or in groups of 2.
Working on programs
Divide the work at the keyboard evenly!
Submitting programs at the submission website
install software ensure accounts are working
Today's Lab
try out R - the first HW is officially due on
2/5
37Outline
using R
approximate!
descriptive statistics
Weeks 1-5
predictive statistics
probability distributions
"Data Science"
statistical modeling
support vector machines (SVMs)
Weeks 6-10
nearest neighbors (NN)
random forests
"Machine Learning"
No breaks?!
k-means algorithm
Weeks 11-15
Final Project
38Grading
Grades
if score gt 0.95 grade "A" if score gt 0.90
grade "A-" if score gt 0.86 grade "B"
Based on points percentage
800 points for assignments
see the course syllabus for the full list...
400 points for the final project
Final project
- the last 4 weeks will work towards a larger,
final project
- there will be a short design phase and a short
final presentation
- choose your own problem to study (I'll have some
suggestions, too.)
- I'd encourage you to connect R and our Data
Science techniques to other datasets or projects
that you use/need/like, etc.
39Academic Honesty
- This course operates under CGU's (and all of
Claremont Schools') Academic Honesty policies - Your work must be your own. This must be true for
the whole team, if you're working in a pair. - Consulting with others (except team members or
myself) is encouraged, but has to be limited to
discussion and debugging of problems. Sharing of
written, electronic, or verbal solutions/files/cod
e is a violation of CGUs academic honesty
policy. - A reasonable guideline Work is your own if you
could delete all of it and recreate it yourself.
40Thoughts?
41Getting to know R
42Getting to know R
R is the programmer's toolkit for statistics
SAS, Stata, SPSS are preferred by those in
business intelligence
http//lang-index.sourceforge.net/categ
43Getting to know R
Free and very well supported online
44Getting to know R
R is responsive, up-to-date, and flexible Data
Science vs. Statistics
45Getting to know R
Try it!
1) Find the IST 380 course webpage
www.cs.hmc.edu/dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the
prompt
6 7
rnorm(10)
x lt- 380
46Getting started!
1) Open Matloff's Why R? notes 2) Skip ahead to
page 7, the "5 minute example session" 3) Try
out the commands in section 2.2 to get
started 4) When you finish, save your session
and submit it!
This is problem 1 this week
47Saving your session
1) Create a folder named hw1, perhaps on your
desktop
2) Use the Save to file (Windows) or Save as
(Mac) in order to save your current console
session into hw1
3) Name that file pr1.txt
4) From your operating system, open up that file
in order to confirm it contains your whole
session!
This is problem 1 this week
48Submitting your work
1) Zip up hw1 into hw1.zip
2) From the course webpage, click on the
submission site link.
3) Choose a submission site login name let me
know!
4) Once your account is made, login, change your
password to something you know, and submit
hw1.zip
5) You can submit again all copies are saved
troubles? email me!
This webserver can be spacey -- I should know!
You've completed Problem 1!
49Reflection
Assignment?
Creating a vector?
Printing?
Average and standard deviation?
Comments?
Comments?
50R types
You can use mode() to view the type of a variable.
51Where's the big data?
c concatenate
Vectors are R lists of a single type of element
52Where's the big data?
c concatenate
the colon also creates vectors
Vectors are R lists of a single type of element
53Analyzing vectors try these
Square brackets can "subset" (or "slice")
vectors
54Analyzing vectors
you can use a boolean vector to subset another
vector
Square brackets can "subset" (or "slice")
vectors
55NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
56NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
This uses subsetting to remove NA values!
57Data frames
R's fundamental data structures are data frames
The next tutorial will introduce them
58Irises
setosa
virginica
data() yields many built-in data files. This is
iris
59Subsetting iris data
dfrows,cols
As with vectors, you can "subset" data frames.
60Lab
The 2nd part of each class meeting dedicated to
lab work.
I welcome you to stay for the lab, but it is not
required.
Today's lab
Work through Santorico and Shin's Tutorial for
the R Statistical Package and submit the console
sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt,
pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors,
introduction to data frames, and a look at the
graphics that R supports.
61Homework
Problem 3 Challenge exercises in R
These will reinforce the "subsetting" and
data-analysis introduction from pr2's tutorial.
Problem 4 Introduction to Data Science, early
chapters
This is a fuller background on R and the field of
data science
(submit your console session for both of these)
62Lab !
63CS vs. IS and IT ?
greater integration system-wide issues
smaller details machine specifics
www.acm.org/education/curric_vols/CC2005_Final_Rep
ort2.pdf
64CS vs. IS and IT ?
Where will IS go?
65CS vs. IS and IT ?
66IT ?
Where will IT go?
67IT ?
68(No Transcript)
69The bigger picture
Weeks 10-12 Objects
Weeks 13-15 Final Projects
Week 10
Week 13
classes vs. objects
final projects
Week 11
Week 14
methods and data
final projects
Week 12
Week 15
inheritance
final exam
70Data?!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science" (statistics,
machine learning, CS) background?
Where?
71state reminders
72Data!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science" (statistics,
machine learning, CS) background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
mostly CS for me
73Data!
- Neighbor's name
- A place they consider home
- Are they working at a company now?
- How many U.S. states have they visited?
- Their favorite unhealthy food ?
- Do they have any "Data Science" (statistics,
machine learning, CS) background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
This class is truly seminar-style we're
devloping expertise in this field together.
mostly CS for me
be sure to set up your login profile for the
submission site