Welcome to IST 380 ! - PowerPoint PPT Presentation

About This Presentation
Title:

Welcome to IST 380 !

Description:

Welcome to IST 380 ! Data Science Programming We don't have strong enough words to describe this class. - US News and Course Report When the course was over, I knew ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 74
Provided by: hmc84
Learn more at: https://www.cs.hmc.edu
Category:

less

Transcript and Presenter's Notes

Title: Welcome to IST 380 !


1
Welcome to IST 380 !
Data Science Programming
We don't have strong enough words to describe
this class.
- US News and Course Report
When the course was over, I knew it was a good
thing.
an advocate of concrete computing and HMC's
mascot
- New York Times Review of Courses
We give this course two thumbs!
- Ebert and Roeper
2
Welcome to IST 380 !
Data Science Programming
an advocate of concrete computing and HMC's
mascot
3
About myself
Who
Zach Dodds
Where
Harvey Mudd College
What
Research includes robotics and computer vision
When
Mondays 7-10pm here in ACB 119
dodds_at_cs.hmc.edu 909-607-0867 Office Hours
Contact Information
Friday mornings, 9-11 am
or set up a time...
HMC Beckman B111
4
TMI?
fan of low-tech games
fan of low-level AI
5
IST 380 the big picture
What is it?
Why me?
6
IST 380 the big picture
What is it?
Data Science Venn Diagram
Hmmm where am I on this diagram?
7
Data?!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science"
    background? (statistics, machine learning, CS)

Where?
8
state reminders
9
Data!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science"
    background? (statistics, machine learning, CS)

Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
mostly CS for me
10
Data!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science"
    background? (statistics, machine learning, CS)

Zachary Dodds
Pittsburgh, PA
This class is truly seminar-style I'm here, as
you are, in order to gain insights into this very
new field .
Harvey Mudd
Where?
44
MMs
mostly CS for me
be sure to set up your login profile for the
submission site
11
Data Science concerns
Is "Data Science" important or just trendy?
12
Data Science concerns
Hmmm
13
the companies are expanding as fast as the data!
14
There's certainly a lot of it!
Data, data everywhere
1 Zettabyte
1.8 ZB
8.0 ZB
800 EB
logarithmic scale
Data produced each year
161 EB
5 EB
1 Exabyte
120 PB
100-years of HD video audio
60 PB
Human brain's capacity
1 Petabyte
14 PB
2015
2002
2009
2006
2011
1 Petabyte 1000 TB
1 TB 1000 GB
References
(2002) 5 EB http//www2.sims.berkeley.edu/researc
h/projects/how-much-info-2003/execsum.htm
(2015) 8 ZB http//www.emc.com/collateral/analyst
-reports/idc-extracting-value-from-chaos-ar.pdf
(2011) 1.8 ZB http//www.emc.com/leadership/progr
ams/digital-universe.htm
(life in video) 60 PB in 4320p resolution,
extrapolated from 16MB for 121 of 640x480 video
(w/sound) almost certainly a gross
overestimate, as sleep can be compressed
significantly!
(2009) 800 EB http//www.emc.com/collateral/analy
st-reports/idc-digital-universe-are-you-ready.pdf
(brain) 14 PB http//www.quora.com/Neuroscience-
1/How-much-data-can-the-human-brain-store
(2006) 161 EB http//www.emc.com/collateral/analy
st-reports/expanding-digital-idc-white-paper.pdf
15
I'd call it data, not information
wisdom
knowledge
information
data
16
Big Data?
I agree with this
17
Make data easier to use by using it!
It may be true that Data Science isn't a science
but that doesn't mean it's not useful!
18
IST 380 the big picture
What?
Why?
Data Science Programming
Data Rules
All of our insights large and small, permanent
and ephemeral, natural and artificial come
about through the integration of lots of
data. Data Science simply recognizes that the
rules and skills behind those insights are widely
applicable
19
A few examples
Make3d
Andrew Ng Computers and Thought award, 2009
How is this being done?
and how do we succeed?
Data Science is at the heart of computer science
20
A few examples
Learning to Powerslide
Stanford's Autonomous Vehicles project (Thrun et
al.)
Data Science is at the heart of computer science
21
A few examples
Learning ground from obstacles
"my summer was finding that red line"
Data Science is at the heart of computer science
22
A few examples
classification
segmentation
Learning ground from obstacles
23
Insights beyond science
24
Marketing
25
Visualization
Motivation
26
(No Transcript)
27
Recommender Systems
predicting movie ratings
28
Netflix Prize
Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
1.22 .75
?? ??
Napoleon Dynamite Batman Begins
Finding Nemo Lord of the Rings
Some films are difficult to predict
29
Netflix Prize
Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
1.22 .75
.67 .42
Napoleon Dynamite Batman Begins
Finding Nemo Lord of the Rings
Some films are difficult to predict and others
are easier!
30
Why IST 380 ?
Specific skills
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
31
Why IST 380 ?
Specific skills
R statistical environment (and the S programming
language)
Experience with several statistical analyses
(descriptive statistics)
Experience with predictive statistics (modeling)
and machine learning algorithms
Broad background
Final project open-ended with datasets of your
choice
You'll be confident and capable with whatever
datasets you encounter in the future on your
own or as part of a team.
32
About IST 380
33
Details
Web Page
http//www.cs.hmc.edu/dodds/IST380
Assignments, online text, necessary files,
lecture slides are linked
First week's assignment Getting started with R
Textbook
An introduction to Data Science
jsresearch.net/groups/teachdatascience/
freely available online
and many online resources
Grab both of these now
Programming R
www.r-project.org/
34
Homepage
Go to the course page
http//www.cs.hmc.edu/dodds/IST380/
Grab R and the text from these two links
35
Homework
Assignments
2-5 problems/week 100 points
extra credit, often
Due Tuesday of the following week by 1159 pm.
Assignment 1 due Tuesday, February 5.
1 week 1 day
36
Homework
Assignments
2-5 problems/week 100 points
extra credit, often
Due Tuesday of the following week by 1159 pm.
Assignment 1 due Tuesday, February 5.
On your own or in groups of 2.
Working on programs
Divide the work at the keyboard evenly!
Submitting programs at the submission website
install software ensure accounts are working
Today's Lab
try out R - the first HW is officially due on
2/5
37
Outline
using R
approximate!
descriptive statistics
Weeks 1-5
predictive statistics
probability distributions
"Data Science"
statistical modeling
support vector machines (SVMs)
Weeks 6-10
nearest neighbors (NN)
random forests
"Machine Learning"
No breaks?!
k-means algorithm
Weeks 11-15
Final Project
38
Grading
Grades
if score gt 0.95 grade "A" if score gt 0.90
grade "A-" if score gt 0.86 grade "B"
Based on points percentage
800 points for assignments
see the course syllabus for the full list...
400 points for the final project
Final project
  • the last 4 weeks will work towards a larger,
    final project
  • there will be a short design phase and a short
    final presentation
  • choose your own problem to study (I'll have some
    suggestions, too.)
  • I'd encourage you to connect R and our Data
    Science techniques to other datasets or projects
    that you use/need/like, etc.

39
Academic Honesty
  • This course operates under CGU's (and all of
    Claremont Schools') Academic Honesty policies
  • Your work must be your own. This must be true for
    the whole team, if you're working in a pair.
  • Consulting with others (except team members or
    myself) is encouraged, but has to be limited to
    discussion and debugging of problems. Sharing of
    written, electronic, or verbal solutions/files/cod
    e is a violation of CGUs academic honesty
    policy.
  • A reasonable guideline Work is your own if you
    could delete all of it and recreate it yourself.

40
Thoughts?
41
Getting to know R
42
Getting to know R
R is the programmer's toolkit for statistics
SAS, Stata, SPSS are preferred by those in
business intelligence
http//lang-index.sourceforge.net/categ
43
Getting to know R
Free and very well supported online
44
Getting to know R
R is responsive, up-to-date, and flexible Data
Science vs. Statistics
45
Getting to know R
Try it!
1) Find the IST 380 course webpage
www.cs.hmc.edu/dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the
prompt
6 7
rnorm(10)
x lt- 380
46
Getting started!
1) Open Matloff's Why R? notes 2) Skip ahead to
page 7, the "5 minute example session" 3) Try
out the commands in section 2.2 to get
started 4) When you finish, save your session
and submit it!
This is problem 1 this week
47
Saving your session
1) Create a folder named hw1, perhaps on your
desktop
2) Use the Save to file (Windows) or Save as
(Mac) in order to save your current console
session into hw1
3) Name that file pr1.txt
4) From your operating system, open up that file
in order to confirm it contains your whole
session!
This is problem 1 this week
48
Submitting your work
1) Zip up hw1 into hw1.zip
2) From the course webpage, click on the
submission site link.
3) Choose a submission site login name let me
know!
4) Once your account is made, login, change your
password to something you know, and submit
hw1.zip
5) You can submit again all copies are saved
troubles? email me!
This webserver can be spacey -- I should know!
You've completed Problem 1!
49
Reflection
Assignment?
Creating a vector?
Printing?
Average and standard deviation?
Comments?
Comments?
50
R types
You can use mode() to view the type of a variable.
51
Where's the big data?
c concatenate
Vectors are R lists of a single type of element
52
Where's the big data?
c concatenate
the colon also creates vectors
Vectors are R lists of a single type of element
53
Analyzing vectors try these
Square brackets can "subset" (or "slice")
vectors
54
Analyzing vectors
you can use a boolean vector to subset another
vector
Square brackets can "subset" (or "slice")
vectors
55
NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
56
NA
R uses NA to represent data that is "not
available"
The function is.na( ) tests for NA
What is going on here?
This uses subsetting to remove NA values!
57
Data frames
R's fundamental data structures are data frames
The next tutorial will introduce them
58
Irises
setosa
virginica
data() yields many built-in data files. This is
iris
59
Subsetting iris data
dfrows,cols
As with vectors, you can "subset" data frames.
60
Lab
The 2nd part of each class meeting dedicated to
lab work.
I welcome you to stay for the lab, but it is not
required.
Today's lab
Work through Santorico and Shin's Tutorial for
the R Statistical Package and submit the console
sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt,
pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors,
introduction to data frames, and a look at the
graphics that R supports.
61
Homework
Problem 3 Challenge exercises in R
These will reinforce the "subsetting" and
data-analysis introduction from pr2's tutorial.
Problem 4 Introduction to Data Science, early
chapters
This is a fuller background on R and the field of
data science
(submit your console session for both of these)
62
Lab !
63
CS vs. IS and IT ?
greater integration system-wide issues
smaller details machine specifics
www.acm.org/education/curric_vols/CC2005_Final_Rep
ort2.pdf
64
CS vs. IS and IT ?
Where will IS go?
65
CS vs. IS and IT ?
66
IT ?
Where will IT go?
67
IT ?
68
(No Transcript)
69
The bigger picture
Weeks 10-12 Objects
Weeks 13-15 Final Projects
Week 10
Week 13
classes vs. objects
final projects
Week 11
Week 14
methods and data
final projects
Week 12
Week 15
inheritance
final exam
70
Data?!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science" (statistics,
    machine learning, CS) background?

Where?
71
state reminders
72
Data!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science" (statistics,
    machine learning, CS) background?

Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
mostly CS for me
73
Data!
  • Neighbor's name
  • A place they consider home
  • Are they working at a company now?
  • How many U.S. states have they visited?
  • Their favorite unhealthy food ?
  • Do they have any "Data Science" (statistics,
    machine learning, CS) background?

Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
MMs
This class is truly seminar-style we're
devloping expertise in this field together.
mostly CS for me
be sure to set up your login profile for the
submission site
Write a Comment
User Comments (0)
About PowerShow.com