Title: data mining
1 Data Mining-PART I By
M.Dhilsath Fathima
2Topics to cover..
- Introduction
- Types of Data
- Data Mining Functionalities
- Interestingness of Patterns
- Classification of Data Mining Systems
- Data Mining Task Primitives
- Integration of a Data Mining System with a Data
Warehouse - Issues
- Data Preprocessing.
3What is Database?
A database is any organized collection of data.
4Examples
Co-workers
5Examples
Patient Information
6Examples
Airline reservation system
7DATABASE
- Database Shared collection of logically related
data (and a description of this data), designed
to meet the information needs of an organization. - Database management System A software system
that enables users to define, create, and
maintain the database and that provides
controlled access to this database.
8Who and How to do it ?
- Database Management System (DBMS) does this job.
- Using Software tools Access, FileMaker, Lotus
Notes, Oracle or SQL Server, . - It includes tools to add, modify or delete data
from the database, ask questions (or queries)
about the data stored in the database and produce
reports summarizing selected contents.
9Why do we need a database?
- Keep records of our
- Clients
- Staff
- Volunteers
- To keep a record of activities.
- Keep sales records
- Develop reports
- Perform Querying
10Data vs. information
- What is data?
- Data is unprocessed information.
- What is information?
- Information is data that have been organized and
communicated in a logical and meaningful manner.
11Purpose of Database system/Stages of Database
System
- Data is converted into information, and
information is converted into knowledge. - Knowledge information evaluated and organized so
that it can be used purposefully.
Is to transform
Data (Unprocessed information)
Information (processed Data)
Knowledge (Evaluated Information using measures)
Action (Data Analysis Future Prediction)
12Data Mining works with Warehouse Data
- Data Warehousing provides the Enterprise with a
memory.
- Data Mining provides the Enterprise with
intelligence
13Data Mining works with Warehouse Data
14What is data Mining?
- Now a days, huge data sets have become available
due to advances in technology. - As a result, there is an increasing interest in
various scientific communities to explore the use
of emerging data mining techniques for the
analysis of these large data sets . - Data mining is the extraction of implicit,
previously unknown and potentially useful
information,patterns,associations from data . - Data mining is the Exploration analysis, by
automatic or semi-automatic means, of large
quantities of data in order to discover
meaningful patterns .
15- WHO USES DATAMINING?
- Banking
- future prediction
- Amazon.com (Online Stores)
- recommendation
- FacebookÂ
- prediction how active a user will be after 3
months.
16Datamining is
17- DATAMINING IS NOT
- Data warehousing
- SQL / Ad Hoc Queries / Reporting
- Online Analytical Processing (OLAP)
- Data Visualization
- DATAMINING IS
- Explores Data
- Find Patterns
- Performs Prediction
18KDD Process
- Knowledge discovery in databases (KDD) is a multi
step process of finding useful information and
patterns in data - Data Mining is the use of algorithms to extract
information and patterns derived by the KDD
process. - Many texts treat KDD and Data Mining as the same
process, but it is also possible to think of Data
Mining as the discovery part of KDD.
19Steps of KDD Process
20STEPS OF KDD PROCESS
- 1. Selection-
- Data Extraction -Obtaining Data from
heterogeneous data sources -Databases, Data
warehouses, World wide web or other information
repositories. - 2. Preprocessing-
- Data Cleaning- Incomplete , noisy,
inconsistent data to be cleaned- Missing data may
be ignored or predicted, erroneous data may be
deleted or corrected. - 3. Transformation-
- Data Integration- Combines data from
multiple sources into a coherent store -Data can
be encoded in common formats, normalized, reduced.
21Steps of KDD Process
- 4. Data mining
- Apply algorithms to transformed data an
extract patterns. -
- 5. Pattern Interpretation/evaluation
-
- Pattern Evaluation- Evaluate the
interestingness of resulting patterns or apply
interestingness measures to filter out discovered
patterns. -
- Knowledge presentation- present the mined
knowledge- visualization techniques can be used. -
22Types of Data /What kind of Data can be mined
- Â Data mining should be applicable to any kind of
information repository. However, algorithms and
approaches may differ when applied to different
types of data. - Relational Databases
- Data Warehouse
- Transaction Databases
- Advanced DB systems and information repositories
- Spatial databases
- Time-series data
- multimedia databases
- WWW
23Relational Databases
- A relational database consists of a set of
tables containing either values of entity
attributes, or values of attributes from entity
relationships. - Tables have columns and rows, where columns
represent attributes and rows represent tuples. - A tuple in a relational table corresponds to
either an object or a relationship between
objects and is identified by a set of attribute
values representing a unique key.
24Data Warehouse
- A data warehouse as a storehouse, is a repository
of data collected from multiple data sources
(often heterogeneous) and is intended to be used
as a whole under the same unified schema. A data
warehouse gives the option to analyze data from
different sources under the same roof.
25Transaction Databases
- A transaction database is a set of records
representing transactions, each with a time
stamp, an identifier and a set of items.
Associated with the transaction files could also
be descriptive data for the items. - Â Transactions are usually stored in flat files or
stored in two normalized transaction tables, one
for the transactions and one for the transaction
items. - Applications Airline reservation, Railway
reservation, Log records etc.
26MULTIMEDIA DATABASE
- Multimedia databases include video, images,
audio, Sound clips, and text data. They can be
stored on extended object-relational or
object-oriented databases, or simply on a file
system. - Ex Digital Music Player, Social Media,
Electronic publishing.
27Spatial Databases
- A spatial database is a database that is enhanced
to store and access spatial data that defines a
geometric space. - These data are often associated with geographic
locations and features, or constructed features
like cities. Data on spatial databases are stored
as coordinates, points, lines, polygons and
topology. - Ex store geographical information like maps, and
global or regional positioning.Â
28Time Series Database
- A Time-Series Database is a database that
contains data for each point in time. - Examples Weather Data, stock market data ,
Browser logged activities, ocean tides.
29Time Series Database-Example
30World Wide Web
- The World Wide Web is the most heterogeneous and
dynamic repository available. - Data in the World Wide Web is organized in
inter-connected documents. These documents can be
text, audio, video, raw data, and even
applications.Â
31Typical Architecture of Data Mining System
32Integration of a Data Mining System with a
Database/Data Warehouse System
- The list of Integration Schemes is as follows -
- No Coupling - In this scheme, the data mining
system does not utilize any of the database or
data warehouse functions. It fetches the data
directly from a particular source and processes
that data using some data mining algorithms. The
data mining result is stored in another file.(Ex
Collect data directly from Transactional
database) - Loose Coupling/Semi-tight Coupling - In this
scheme, the data mining system may use some of
the functions of database and data warehouse
system. It fetches the data from the data
respiratory managed by these systems and performs
data mining on that data or fetch directly from
particular sources. (Ex Taken from transactional
DB Database/DWH) - Tight coupling - In this scheme, the data mining
system is smoothly integrated into the database
or data warehouse system. The data mining
subsystem is treated as one functional component
of an information system.
33Integrated architecture of a Data Mining with
DWH/ AN OLAM SYSTEM ARCHITECTURE
34Data Mining Task Primitives
- We can specify a data mining task in the form of
a data mining query. - This query is input to the system.
- A data mining query is defined in terms of data
mining task primitives. - Note - These primitives allow us to communicate
in an interactive manner with the data mining
system. Here is the list of Data Mining Task
Primitives - - Kind of knowledge to be mined.
- Set of task relevant data to be mined.
- Representation for visualizing the discovered
patterns. - Background knowledge to be used in discovery
process. - Interestingness measures and thresholds for
pattern evaluation.
35Data Mining Task Primitives-Example of Data
mining query
- use database AllElectronics_db use state_
location_hierarchy for B.address mine
characteristics as customerPurchasing analyze
count in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold
S, branch B where I.item_ID S.item_ID and
P.cust_ID C.cust_ID and P.method_paid "AmEx"
and B.address "Canada" and I.price 100 with
noise threshold 5 display as table
36Data Mining Task Primitives-cont..
- Kind of knowledge to be mined
- It refers to the kind of functions to be
performed. These functions are - - Characterization
- Association and Correlation Analysis
- Classification
- Prediction
- Clustering
- Outlier Analysis
- Set of task relevant data to be mined
- This is the portion of database in which the user
is interested. This portion includes the
following - - Database Attributes
- Data Warehouse dimensions of interest
37Data Mining Task Primitives-cont..
- 3. Representation for visualizing the discovered
patterns - This refers to the form in which discovered
patterns are to be displayed. These
representations may include the following - - Rules
- Tables
- Charts
- Graphs
- Decision Trees
- Cubes
38Data Mining Task Primitives-cont..
- 4. Background knowledge
- The background knowledge allows data to be mined
at multiple levels of abstraction. For example,
the Concept hierarchies are one of the background
knowledge that allows data to be mined at
multiple levels of abstraction. - 5.Interestingness measures and thresholds for
pattern evaluation - This is used to evaluate the patterns that are
discovered by the process of knowledge discovery.
There are different interesting measures for
different kind of knowledge.
39Classification of Data mining System
40Classification of Data mining System(Cont..)
- Data to be mined
- Relational, data warehouse, transactional,
stream, object-oriented/relational, active,
spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW - Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation,
outlier analysis, etc. - Multiple/integrated functions and mining at
multiple levels - Techniques utilized
- Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc. - Applications adapted
- Retail, telecommunication, banking, fraud
analysis, bio-data mining, stock market analysis,
text mining, Web mining, etc.