Title: Unit Five The nature and sources of data
1Unit FiveThe nature and sources of data
- Data items about things, events, activities,
and transactions are recorded, classified and
stored but are not organized to convey any
specific meaning. - Data item can be numeric, alphanumeric, figures,
sounds, and images. - Information data that have been organized in a
manner that gives them meaning for the recipient.
2- They confirm thing that the recipient know, or
may have surprising value by something not known. - Knowledge consists of data items and /or
information organized and processed to convey
understanding, experience, accumulated learning
and expertise that are applicable to current
problem activity. - knowledge can be the application of data and
information in making decisions.
3- Internal data
- stored in one ore more places, they are about
people, products, services, and processes
(student data is stored in university DB). - External data
- Has many resources, commercial DB, collected data
by sensors, and satallite,
4- Available on CD, DVD, Internet, statistical
bureaus, banks, chamber of commerce. - Data collection problems and quality
- The need to collect data from internal and
external sources makes MSS building complicated. - In some cases it is necessary to collect row data
in the filed. - In other cases it is necessary to get data from
people or to find it on the Internet. - Data must be validated and filltered.
5- Methods for collecting row data
- 1- manually observations, questionnaires,
interviews, soliciting information from experts. - 2- sensors and scanners for biometrics.
- Data problems
- 1-data are not correct generated carelessly,
entered inaccurately. - 2- data is not timely methods for generating
data are not fast enough to meet needs for data.
6- 3-data are not measured properly gathered
inconsistently with the purposes of analysis. - 4- needed data do not exist no one ever stored
data needed now. - Data quality
- Quality determines usefulness of data as well as
the quality of the decisions based on them.
7- Data quality problems
- 1- contextual DQ relevancy, timeliness,
completeness. - 2-intrinsic????? DQ accuracy, objectivity,
believability, reputation. - 3- accessibility DQ access security.
- 4- representation DQ interpretability, ease of
understanding, consistent representation.
8Data Integrity
- Older filing system may lack integrity. If a
change is made in the file in one place, it may
not be made in the file in another laces or
department, which results in conflicting data. - Data integrity considers the following
- issues
- 1- uniformity during data capturing, uniformity
checks to ensure that data are within specific
limits.
9- 2- version checks are performed when the data
are transformed through the use of metadata to
ensure that the format of the original data has
not been changed. - 3- completeness check ensures that the summaries
are correct and that all values needed to create
the summary are included.
10- 4- conformity check ensures that during data
analysis and reporting, correlation are run
between the value reported and previous values
for the same numbers. - Sudden changes can indicate a basic change in the
business analysis is error or bad data. - 5- genealogy ??? ???????check drill down, trace
back to the data source through its various
transformation.
11Data Access and Integration
- How to reach data in its storage area?
- Data access can be done using one of the
following methods - relation Database tables, XML documents,
Electronic data messages, Cobol records, the
Internet which has thousands of databases all
over the world accessible through the
Web/Internet. - commercial data banks which are an online
databases services selling services to
specialized databases,
12- they can add external data to MSS in a timely
manner and reasonable cost. Example is the GIS.
13DBMS
- Supplements standard Operating system by allowing
for greater integration of data, complex files
structure, quick retrieval and changes, better
data security. - It is SW programs for adding information to DB
and updating, deleting, manipulating, storing and
retrieving information.
14DB types
- 1- relational 2-hierarchical 3-network
- 4- Object oriented DB
- MSS application may require accessibility to
complex data which may include pictures which can
not be handled by the previous types. - Graphical representation used (OODB) may be used
to handle pictures.
15- It is based on OOP by combining characteristics
of OOP such as UML with mechanism for data
storage and access. - OOBMS allows to analyze data at a conceptual
level that emphasize the natural relationships
between objects using encapsulation and
inheritance. - OODBMS defines data as objects and encapsulates
data a long with their relevant structure and
behavior.
16- 5- Multimedia-Based DB
- MMDBMS manage data in a variety of formats in
addition to text and numbers. - other formats include images such as digitized
photographs, forms of bit-mapped graphics such as
maps or .PIC files, hypertext images, videos,
clips, sounds, and virtual reality
(multidimensional images).
17Data Warehousing
- Is one or several databases which contain the
information that is needed for tactical or
strategic decisions. Collection of data designed
to support decision making. Contains data that
present a coherent picture of business conditions
at a single point in time.
18- Data Warehousing can be
- 1- utilized to support decision-making.
- 2- analyzing large amount of data from various
resources to provide fast results to support
critical process. - Organization (public and private) continuously
collect data, information and knowledge and store
them in computerized system.
19- As the amount of data increases
- 1- updating, retrieving, using, removing of
information becomes complicated. - 2- number of data uses increase as a result of
improved reliability and availability of network
access. - Warehouse gets data from external and internal
resources, organized in consistent with
organizations needs. - Data WH has access to all information relevant to
the organization which can come form internal or
external sources.
20- With meta data and metadata repository,
organization can improve their uses of
information and application development
processes. - Business benefit from metadata as follows
- 1- reduction of It- related problems.
- 2- increase system value to business.
- 3-improve business decisionmaking.
21- Business metadata comprise information that
increase or understanding to traditional data
(structured) reported. - Primary purpose is to provide context to the
data, enriching information leading to knowledge.
Context does not have to be the same for all
users. - It assist in conversion of data and information
into knowledge.
22Data Warehousecharacteristics
- 1- subject-oriented
- Data are organized be detailed subjects
(customer, policy type in insurance company) - Data contains only information relevant for
decision support. - Enables users to determine how their business is
performing and why it is performing that way.
23- It provides more comprehensive view of the
organization, than operational DB which is
oriented toward product and handles transactions. - 2- Integrated
- Data at different source locations may be
encoded differently. Example, person gender may
be encoded as 0,or 1 and in other places as F, M.
In data warehouse they are scrubbed ( cleaned)
into one format which makes them standard and
consistent.
24- 3- time variant (time series)
- Data do not provide current status.
- Data are kept for several years and are used for
trends, forecasting and comparison. - Time is the one important dimension that all data
warehouses must support. - Data for analysis from multiple sources contain
multiple time point (daily, weekly, monthly views)
25- 4-non volatile
- Once entered into the warehouse they are
read-only, they can not be changed or updated. - Obsolete data are discarded, and changes are
recorded as new data - 5- summarized operational data are aggregate
into summaries.
26- 6- not normalized
- data in data warehouse are not normalized and
highly redundant. - 7- sources all data are present both internal
and external. - 8- metadata data about data are includes in data
warehouse.
27Metadata
- Describes the structure of and some meaning about
the data which affect its effective or
ineffectiveness. - The key of making user comfortable with
technology. - Involves knowledge, and capturing and making them
accessible through the organization have become
important success factor.
28- With metadata and metadata repository,
organization can improve their uses of
information and application development
processes. - Business benefits from metadata as follows
- 1- reduction of IT-related problems.
- 2- increase system value to business.
- 3- improved business decision-making.
29- Business metadata comprises information that
increase our understanding to traditional
(structured) data reported. - Primary purpose is to provide context to the
data, enriching information leading to knowledge.
Context does not have to be the same for all
users. - It assist in conversion of data and information
into knowledge.
30- Data about data.
- Metadata describes how and when and by whom a
particular set of data was collected, and how the
data is formatted. Metadata is essential for
understanding information stored in data
warehouses and has become increasingly important
in XML-based Web applications.
31Data Ware Housearchitecture and process
- Could be of one, two, or three layers.
- DWH can be divided into three parts
- 1- the DWH itself, which contains the data
associated SW. - 2- data acquisition SW which extracts data from
legacy systems and external sources ,
consolidates and summarizes them, and loads them
into the DWH. - 3- client SW which allows users to access and
analyze data in the ware house.
32- In the three layer architecture contains
- 1- operational system contains in the data SW for
data acquisition in one server (layer). - 2- the DWH is another layer.
- 3- the third layer includes decision
support/business intelligence, business analytics
engine and the client. This has advantage it
separate functions of data WH eliminating
resources constraints and makes it possible to
create data marts easily.
33- In Two layer
- Dss engine is on the same platforms as the WH
which makes it more economical than the three
layer structure. - Some issues to consider when selecting an
architecture - 1- which DBMS to use? Most DWH built using
relational DBMS, oracle, SQL server which support
client-server and Web-server architecture. - 2- will parallel processing or partitioning be
utilized? - parallel processing enables multiple CPUs to
process data WH query request at the same time.
Partitioning the DB tables into smaller ones to
improve access efficiency.
34- 3- will data migration tools be used to load the
DWH? - 4- what tools will be used to support data
retrieval and analysis?
35Data Ware House Development
- The process of migration data to DWH involves
extraction of data from all relevant resourcesgt - Data sources consists of the following
- 1- files extracted from online transaction
processing (OLTP). - 2- spread sheet
- 3- personal DB (ms-Access)
- 4- external fles.
36- DWH contains a number of business rules that
define the following - 1- how the data will be used.
- 2-summarization rules.
- 3-standardization of encoded attributes.
- 4-calculation rules.
- And data quality issues need to be corrected
before its loaded into the DWH.
37- One of the well defined DWH benefits is that
these rules can be stored in a meta data
repository and applied to DWH centrally. - In OLTP, rules are scattered all over the system.
- Load process into DWH can be performed either by
- 1- data transformation tools which provide
Graphical User Interface (GUI) to help in
development and maintenance business rules. - 2- developing programs or utilities to load data
WH using programming languages such as PL/SQL,
C or .net.
38- There are several issues that affect whether to
build a data transformation tool or buy one ,
which are - 1- cost of transformation tool.
- 2-they may take time to learn how to use.
- 3- it is difficult to measure how the IT
organization is doing until it has learned to use
the tool.
39- Benefits of transformation tools
- 1- simplifying the maintenance of the
organization DWH. - 2-effective in detecting and scrubbing, removing
of bad data. -
40Star Schema
- DWH design is based on dimensional modeling.
- dimensional modeling is retrieval-based model, it
supports high amount of query access. - Star schema is how dimensional modeling
- Is implemented.
- Star schema contains a central fact table which
contains - 1- the attributes needed to perform decision
analysis. - 2- descriptive attributes used for query
reporting. - 3-foreign key to link to dimensional table.
- Decisional analysis attributes consists of
41- A- performance measure
- B- operational metrics
- C-aggregate measures
- D- other metrics needed to analyze org.
performance. - Fact table address what the DWH supports for
decision analysis. - Dimensional table contains attributes that
describe the data contained in the fact table. - Dimensional table address how data will be
analyzed.
42- Grain of data WH defines the highest level of
detail supported, grain indicates whether that
DWH is high summarized or include detailed
transaction data. - If the grain is defined too high, the WH may not
support detailed requests to drill down into
data. - Drill-down analysis is the process of probing
beyond a summarized value to investigate each of
the detail transaction that comprise the summary. - Low level granularity results in more data being
stored in DWH. - Larger amounts of detail may affect the
performance of query making response time longer.
43Implementing Data Ware House
- DWH projects can be identified as either data
centric or application centric. - Data centric WH
- -based on data model that is independent of any
application. - -designed to support variety of user needs and
applications. - -supports flexibility since organization
information constantly needs change. More dynamic
business means more data needs will change.
44- application centric
- -designed to support a single initiative or small
set of initiatives. - -preferred for independent data mart development.
- -provides more focused scope increasing the
success of DWH implementation. - Its disadvantage is that critical data needs may
be lost out during the initial development
therefore multiple iterations is necessary.
45- Factors that play a big role in the successful
implementation of DWH, can be categorized into
organizational issues, project issues and
technical issues, the factors are - 1-management support 2-champion
- 3-resources 4-user participation
- 5-team skills 6-source system
- 7-development technology
46- Implementation of Web-based DWH (Webhousing),
make it easier to have access to large amounts of
data, but it is difficult to determine the hard
benefits of DWH. - Hard benefits organization benefits that can be
expressed in Monterey terms (org. has priorities
when it comes to money). - Project champion helps ensuring that DWH project
will receive the necessary resources for
successful implementation. - Resources could be costly, require high
processors, and large increase in direct-access
storage devices, web-based WH need special
security requirements.
47- User participation
- -participation in data modeling and access
modeling. - during data modeling, expertise is required to
determine the following - 1-what data are needed?
- 2-define business rule of data
- 3-what aggregation and calculations needed?
- Access modeling is needed to determine
- 1-how data is to be retrieved from DWH?
- 2-help in physical definition of WH to help in
determining which data needs indexing. - 3- indicates whether data marts are needed to
facilitate information retrieval.
48- Team skills require in-dept knowledge of DB
technologies and development tools. - Source system and development technology refer to
many inputs and processes used to load and
maintain DWH. - Ubiquitous
49Best practices for DWH implementation
- The project must fit with corporate strategy and
business objectives. - There must be complete buy-in to the project
(executives, managers, users) - Manage expectations.
- DWH must be built incrementally.
- Project must be managed by IT and managers.
- Load cleaned data and of quality
- Do not overlook training requirements.
50DWH Risks
- Many risks is WH project, they are serious
because DWH are large-scale and expensive
projects. Some risks are - Quality of source data is not known.
- Skills are not in place.
- Inadequate budget
- Lack of supporting SW.
- Weak or loss of sponsor.
- Users are computer literate.
- Unrealistic users expectations.
- Key people may leave project.
- Too much new technology
- Team geography , language culture.
51Mistakes to avoid in developing a successful DWH
- Be aware of the following problems
- 1-setting expectations that you can not meet
- 2-loading WH with any available data.
- 3- DWH managers must be user oriented not
technology oriented. - 4-focusing in ad hoc data mining and periodic
reporting instead of alert. - The natural progression in a DWH is
- 1- extract data from legacy system, clean then
and input them into WH. - 2-support ad hoc reporting until you learn what
people want - 3-convert the ad hoc reports into regular
scheduled reports.
52Massive data WH and scalability
- DWH needs to be flexible and support scalability.
- Scalability deals with
- -the amount of data in WH.
- -how quickly the WH is expected to grow
- -number of concurrent users
- -the complexity of users queries
- DWH grows as function of data growth.
- The need to expand WH is to support new business
functions. - measures of data size is in betabyte, terabyte,
huge data sizes needs powerful computers and
smart indexing and searching methods.
53Users capabilities and benefits
- Users of DWH are
- Managers, analysts, executives, administrative
assistants, professionals. - DWH solution should provide
- -ready access to critical data.
- -insulate operational DB from ad hoc processing.
- -provide high level summary information and
drill-down capabilities. - These improve
- -business knowledge
- -provide competitive advantage
- -Enhance customer services and satisfaction.
- -facilitate decision making.
- -improve workers productivity.
- -help streamline business processes.
54Data Marts
- A subset of DWH
- Consists of a single subject area
- ( marketing, sales, operations etc)
- It can be either dependent or independent
- Dependent Data Marts
- Subset created directly from the data WH.
- Advantages
- -uses consistent data model.
- -provide quality data.
- -support the concept of a single enterprise wide
data model. - -ensures that end-user is viewing the same
version of data that is accessible by all other
DWH users. - -because of DWH high cost, their use is limited
to large organizations.
55- Independent Data Marts
- Lower cost, scaled down version of DWH.
- Designed for strategic business units
(departments) - Advantages of Data marts
- -low cost
- -shorter time to implement
- -controlled locally
- -contain less information decreases the response
time. - -allow the building of private DSS without
relying om\n centralized IS departmewnt.
56Business Intelligence
- Describes the basic components of business
intelligence environment ranging from business
process modeling and data modeling to business
rules systems, data profiling, information
compliance and data quality, data WH and data
mining. - It involves acquiring data and information from a
wide variety of sources and utilize their
decision making.
57- Business Analytics deals with models and solution
methods. - Business intelligence methods
- -provide charts and graphs of multidimensional
data. - -they access data from DWH and bring them to
local DB system, such as OLAP methods. - OLAP methods allow users to slice and dice data
and observing graphs and tables that reflect the
dimension being observed
58Data Mining Methods
- Apply statistical and deterministic models, AI
methods to data to identify hidden relationships
or discover knowledge among various data or text
elements. - Dash Boards
- -provide managers with exactly the information
they need in the correct format at the correct
time. - -the foundation of Dash Boards is Business
intelligence. - Provide real-time views of data (daily, weekly,
monthly)
59- Dashboard is a user-interface feature Apple
introduced with the release of Mac OS X 10.4
Tiger. It allows access to all kinds of "widgets"
that show the time, weather, stock prices, phone
numbers, and other useful data. With the Tiger
operating system, Apple included widgets that do
all these things, plus a calculator, language
translator, dictionary, address book, calendar,
unit converter, and iTunes controller. Besides
the bundled widgets, there are also hundreds of
other widgets available from third parties that
allow users to play games, check traffic
conditions, and view sports scores, just to name
a few.
60On Line Analytical ProcessingOLPA
- It concentrate on building mission-critical
system that - -support organization transaction processing.
- -fault-tolerant
- -provide fast response.
- OLAP is an example of such systems
- Concentrate on distributed relational DB
environment. - Refers to a variety of activities usually
performed by end-users in Online systems, such
as
61- -generating queries.
- -Requesting ad hoc reports and graphs.
- -conducting statistical analysis and building DSS
and multimedia applications. - In OLAP users ask specific, open-ended questions.
- OLAP tools
- -query tools, spread sheets, data mining tools,
data visualization tools.
62OLAP Tools
- Four types of processing that are performed by
analysts within an organization - 1- categorical analysis is a static based upon
historical data. It is base on the fact that past
performance is an indicator of the future. - 2- exegetical (??????) analysis is based on
historical data with the ability of drilling-down
analysis (the ability to query further into data
to determine the detail data used to determine
derived value). - 3-contemplative (?????) analysis allows a user
to change a single value to its impact. - 4-forulaic analysis permits changes to multiple
variables.
63Data Mining
- A term used to describe knowledge discovery in
databases. - Is a process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify useful
information and subsequent knowledge from large
data base. - It is the process of engineering mathematical
patterns from usually large set of data. Patterns
can be rules affinities, correlations, trends, or
prediction models. - Used when relationships between system variables
can not be expressed mathematically and modeling
is not possible.
64Data Mining Activities
- 1-knowledge extraction.
- 2-data archaeology
- 3- data exploration
- 4-data pattern processing
- 5-data dredging
- 6-information harvesting
65DM characteristics and Objectives
- 1-data are often buried deep within very large
DB, sometimes contain data from several years
that are cleaned and consolidated in DWH. - 2-DM environment is usually client-server or
Web-based architecture. - 3-new tools help to remove the information ore
buried in files or archive records. Finding it
requires synchronization of data to get the right
results. - 4- miner is an end user, uses powerful query
tools to ask ad hoc questions and obtain quick
answers quickly with little or no programming
skills . - 5-DM tools are combined with spreadsheets which
makes analyzing and processing data easy and
quick. - Because of large amount of data and massive
search efforts, it is necessary to use parallel
processing for DM.
66How DM works?
- Intelligent DM discovers information within DWH
that queries and reports can not effectively
reveal. - DM tools find patterns in data and may infer
rules from them. - Way to identify patterns in data
- 1-simple models SQL-based query, OLAP, human
judgment. - 2-intermediate models regression, decision trees,
clustering. - 3-complex models neural networks.
67Data Mining application classes
- Each class is supported by a set of algorithmic
approaches to extract the relevant relationships
in the data. These classes are - 1-classification infers the defining
characteristics of a certain group. Example
customers who have lost to competitors. - 2-clustering identifies groups of items that
share a certain characteristics (no predefining
characteristics are given). Example classes of
customers with certain needs to be met. - 3-association identifies relationship between
events that occur over one time. Example what
products sell with other ones, and to what degree.
68- 4-sequencing identifies relationship between
events that occur over a period of time.
Examplerepeat visits to supermarket - 5- regression used to map data to a prediction
value using linear and nonlinear techniques. It
is a form of estimation. - 6-forecasting estimates future values based on
patterns within large sets of data.
69- DM can either hypothesis driven or discovery
driven. - Hypothesis driven data mining begins with a
proposition by the user, then seek a validation
for the truthfulness of the proposition. Example,
a marketing manager may ask the proposition are
DVD players sales related to TV sets sales?. - Discovery driven data mining finds patterns,
relationships among the data. It can uncover
facts that are unknown by the organization.
70- DM uses many tools to discover patterns and
relationships in data to make accurate
predictions. - Steps of DATA Mining
- 1- define the business problem.
- 2-build (find or acquire) DM database.
- 3-explore the data.
- 4-prepare the data for modeling.
- 5-build or find models.
- 6-evaluate the models.
- Act on the result.
71Data Mining misconceptions
- Results of DM can increase revenues, decrease
costs, identify fraud, identify opportunities and
offer new competitive advantage. But there are a
number of misconceptions a bout DM which are - 1-DM provides instant, crystal-ball predictions
it is a multi-step process that requires
deliberate, proactive design and use. - 2-DM is not available for business applications
it ready to go for any business. - 3-requires a separate, dedicated database db are
not required but they are desirable. - 4-require professionals any trained user can use
it. - 5-it is only for large firms with huge data if
data is accurate then DM can be used by any firm
regardless of its size.
72Data Mining Tools and techniques
- 1-Statistical methods these include linear and
nonlinear regression, point estimation. - 2-decision trees which are used in
classification and clustering methods. - 3-genetic algorithms work on the principle of
expansion of possible outcomes. Given a fixed
number of possible outcomes, it seeks to define
new and better solutions.
73Text Mining
- It is the application of DM to nonstructural text
files. DM takes the advantage of the
infrastructure of stored data to extract
additional useful information. For example, by
data mining a customer DB an analyst may find
that customers who buy product A also buy product
B and C but six months later.
74- Text mining helps organization to
- 1- find the hidden content of documents and
additional useful information. - 2-relate documents across previous unnoticed
divisions, example finding that customers in
two different product divisions have the same
characteristics. - 3- group documents by common subjects. Example
all customers in an insurance firm who have the
same complaints and cancel their polices.
75Data Visualization
- It refers to technologies that support
visualization and interpretation of data and
information at several points. - It includes digital images, geographic
information systems, graphical user interface,
virtual reality, tables an graphs,
multidimensional, and animations.