Title: Introduction to Data Mining
1Introduction to Data Mining
- Jiang Li
- Department of Computer Science Information
Technology - Austin Peay State University
2Outline
- Data Collected
- Knowledge Discovery An Iterative Process
- Data Mining Examples
- Data Mining Functions and Algorithms
3Data Collected
- Business
- Wal-Mart
- 20 million transactions a day
- Mobile Oil Corporation
- A 100 terabytes data warehouse
- Science
- The human genome database project
- Gigabytes of data
- NASA Earth Observing System (EOS)
- 50 gigabytes data per hour
- Radio, Television, and Film Studios
- Multimedia databases
- WWW the infinite resources
- Email huge digital libraries
4Data vs. Knowledge
- Technology is available to help us collect data
- Bar code, cameras, scanners, Radars, satellites,
etc. - Technology is available to help us store data
- Databases, data warehouses, variety of
repositories - We are swamped by data that pours on us
- We need to interpret this data in search for new
knowledge
We are drowning in information, but starving for
knowledge. John Naisbitt
Our need is to extract interesting knowledge
(rules, regularities, patterns, constraints) from
data in large collections.
5Evolution of Database Technology
- 1960s
- Data collection, database creation (hierarchical
and network models) - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- Ubiquitous RDBMS, advanced data models
(extended-relational, Object-Oriented, deductive,
etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.) - 1990s
- Data mining and data warehousing, multimedia
databases, and Web-based database technology
6Knowledge Discovery
7Data Mining
- In theory, data mining is a step in the knowledge
discovery process. It is the extraction of
implicit information from a large dataset. - In practice, data mining and knowledge discovery
are becoming synonyms. - KDD Knowledge Discovery and Data Mining
Notice the misnomer for data mining. Shouldnt
it be knowledge mining?
8Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Gathering and integrating of data
- Cleaning and preprocessing data (may take 60 of
effort!) - Reducing and projecting data
- Find useful features, dimensionality/variable
reduction, - Choosing mining functions and algorithms
- summarization, classification, regression,
association, - Data mining search for patterns of interest
- Evaluating results
- Interpretation analysis of results
- visualization, alteration, removing redundant
patterns, - Use of discovered knowledge
9Data Mining On What Kind of Data?
- Flat Files
- Generic Data
- Relational Object-Relational Databases
- Object-Oriented Databases
- Multimedia Data
- Text Databases
- Audio, Image, and Video Databases
- Business Data
- Transactional Databases
- Engineering Data
- Spatial databases
- Temporal and Time-series databases
- WWW Data
10Data Mining Examples
- Data mining is primarily used today by companies
with a strong consumer focus - retail, financial,
communication, and marketing organizations. - It enables these companies to determine
relationships among "internal" factors such as
price, product positioning, or staff skills, and
"external" factors such as economic indicators,
competition, and customer demographics. - And, it enables them to determine the impact on
sales, customer satisfaction, and corporate
profits. - Finally, it enables them to "drill down" into
summary information to view detail transactional
data.
11Data Mining Examples
- With data mining, a retailer could use
point-of-sale records of customer purchases to
send targeted promotions based on an individual's
purchase history. - By mining demographic data from comment or
warranty cards, the retailer could develop
products and promotions to appeal to specific
customer segments. - Blockbuster Entertainment mines its video rental
history database to recommend rentals to
individual customers. - American Express can suggest products to its
cardholders based on analysis of their monthly
expenditures.
12Data Mining Examples
- WalMart is pioneering massive data mining to
transform its supplier relationships. - WalMart captures point-of-sale transactions from
over 2,900 stores in 6 countries and continuously
transmits this data to its massive 7.5 terabyte
Teradata data warehouse. - WalMart allows more than 3,500 suppliers, to
access data on their products and perform data
analyses. - These suppliers use this data to identify
customer buying patterns at the store display
level. - They use this information to manage local store
inventory and identify new merchandising
opportunities.
13Business Data Mining Examples
- The NBA is exploring a data mining application
that can be used in conjunction with image
recordings of basketball games. - The Advanced Scout software analyzes the
movements of players to help coaches orchestrate
plays and strategies. - For example, an analysis of the play-by-play
sheet of the game played between the New York
Knicks and the Cleveland Cavaliers on January 6,
1995 reveals that when Mark Price played the
Guard position, John Williams attempted four jump
shots and made each one! - A coach can automatically bring up the video
clips showing each of the jump shots attempted by
Williams with Price on the floor, without needing
to comb through hours of video footage. - Those clips show a very successful pick-and-roll
play in which Price draws the Knick's defense and
then finds Williams for an open jump shot.
14Data Mining Functions and Algorithms
- Association Rules
- Data can be mined to identify associations.
- The butter-gtbread example is an example of
associative mining. - To find rules like inside(x, city) Ã near(x,
highway). - Classification and Prediction
- Classify data based on the values in a
classifying attribute, e.g., - classify countries based on climate
- classify cars based on gas mileage
- Stored data is used to locate data in
predetermined groups. - A restaurant chain could mine customer purchase
data to determine when customers visit and what
they typically order. This information could be
used to increase traffic by having daily specials.
15Data Mining Functions and Algorithms
- Clustering
- Data items are grouped according to logical
relationships or consumer preferences. - Data can be mined to identify market segments or
consumer affinities. - To cluster houses to find distribution patterns.
- Sequential patterns
- Data is mined to anticipate behavior patterns and
trends. - An outdoor equipment retailer could predict the
likelihood of a backpack being purchased based on
a consumer's purchase of sleeping bags and hiking
shoes. - To find and characterize similar sequences and
deviation data, e.g., stock analysis. - To find segment-wise or total cycles or periodic
behaviors in time-related data.
16Data Mining Linear Classification
- A simple linear classification boundary for the
loan data set shaded region denotes class no
loan
17Data Mining - Confluence of Multiple Disciplines