Title: Metadata Management in Data Warehousing
1Metadata Management in Data Warehousing
- Prof. Hwan-Seung Yong
- Dept. of CSE, Ewha Womans Univ.
- http//dblab.ewha.ac.kr/hsyong
2Data Environment
- Data is becoming an abundant commodity
- we can get it anywhere and everywhere
- ???? ???? ??? ??? ????? ?? ??/?? ??
- ??? ???? ?? ??? ?!
- like trying to find one specific grain of sand on
a stretch of beach - Good business practice
- First, acquiring quality raw data
- second, combining and integrating the data to
make useful information, and then - analyzing the information and making high quality
decisions - The torrent of raw data has added more choice,
and therefore complexity, to the process. - What we need to do is put the data in context,
give the data meaning, relevance, and purpose,
and make it complete and accurate
3Data From Everywhere
- Corporations are coupling together in webs of
suppliers, partners, and customers, exchanging a
myriad of information through a spectrum of
technologies such as - Electronic Document Interchange (EDI) systems,
- Electronic Funds Transfer (EFT) systems,
- email,
- a host of other data acquisition and networking
applications - existing legacy systems within enterprises
continue to generate - data on orders, sales, revenues, employee
information, manufacturing schedules, inventory,
fleet status, and every other parameter
imaginable.
4What we know about the data generated by these
systems
- Each department, division, group, branch, section
or any other subdivision is today capable of
generating its own unique caches of data. - data in one group can have a different meaning in
another group in the same organization - This disparate data is exacerbated by
- readily available CASE tools,
- rapid application development tools,
- application and code generators,
- underutilized data models and definitions,
- database products,
- spreadsheets,
- other client friendly products,
- a lack of leadership in management.
5Approach to solve this disparity
- The concept of a data warehouse has emerged as a
technology by which management can get a single
comprehensive view of the state of the
organization. - a requirement to under-stand and manage the
properties of the data - Metadata attributes about the data, or data
about data. - data that sits behind the operational data,
- describes its origin, meaning, derivation, etc
- ?) What is gross sales?
- Dollars or French Francs,
- quarterly or annualized,
- what system does it come from,
- when is it extracted, etc.?)
- ??? ???? ?? ?? (? ??? ?? ?? ??) ?? ??? ????? ????
?????? ?! - range from a conceptual overview of the real
world to detailed physical specifications for a
particular database management system.
6What is Metadata
- 'chain of abstraction concept
- Ex) an actual chair is a pretty concrete item.
- Any specific chair has a host of characteristics
- it is made of wood,
- it has two arms,
- it is a particular shape,
- it has two cushions, and so on
- one step in the chain of abstraction
- Software is abstraction piled on abstraction
- LET BALANCE BALANCE CREDIT
- The source code statement in turn, is an
abstraction of a banking transaction. - Metadata is an abstraction from data.
- It is high-level data that describes lower-level
data.
7What is Metadata
- Software is full of metadata for example,
- Record descriptions in a COBOL program.
- Logical view descriptions in a data server's
catalog. - SQL Create statements.
- Entity-relationship diagrams in a CASE tool's
repository. - METADATA IS THE KEY TO DATA
- Metadata is instrumental in transforming raw data
into knowledge. - ?1) a given stream of bits can be interpreted as
- a customer's address,
- part of a photographic image,
- a code fragment in a given computer's machine
language. - ?2) In Windows 9x, file name extension is another
metadata - Should the metadata get mixed up or out of
alignment, none of the data will make sense.
8Metadata Status in an Organization
- rushed implementation
- little thought given to coordinating data
elements with other groups - so the problem continues
- the metadata situation is worse than the data
situation - the disparate data arriving from multiple sources
from within and outside the corporation, - multiple tools creating metadata in a variety of
formats. - Most companies
- want to rapidly implement data warehousing
systems - then discover themselves in a metadata dilemma.
9Metadata Management Historical Approaches
- Early in the days of Information Technology
- all data was defined and maintained within the
computer program itself. - Because the files' metadata was embedded in the
application's data definitions, no application
could make sense of another application's files - There was no need to share data between systems
- In the late 1960's and early '70s, when databases
were introduced - one of the greatest advantages they brought was
that the metadata was stored in the database
catalog - data "dictionaries" that were shared by programs
- early implementers of data warehouses
- the idea of a metadata repository was created,
similar to a data dictionary - The awareness of the need to manage metadata has
been an offshoot of the growth of data
warehousing.
10Growth of Data Warehousing
- Growth across platforms.
- spans from Microsoft NT, through the UNIX domain,
and on up into the giants such as NCR, DEC, and
IBM. - Growth across tools.
- Success has many parents, and failure is an
orphan. - From legacy data extraction tools to maintenance
and scheduling tools, and yes, even tools that
purported to handle metadata, vendors and
products proliferated. - Growth across departments.
- Success also breeds demand, which is exactly what
happens when department A gets a new data mart
and starts showing colleagues in department B how
easily they can now access consistent data. Soon,
department B has its own data mart, and
departments C, D, and E are not far behind.
11Metadata Management Problem in DW
- the subject of metadata management remained
fractured and dispersed - metadata is spread across different components of
the warehouse, - in the data extraction/cleansing tools
- in the loading tools,
- in the OLAP tools, which need to present metadata
to users in order to navigate. - Business rules are separated from technical
metadata, - they are kept by different systems in different
formats with different user interfaces. - In the case of multiple data marts spread across
the enterprise, - if there is a need for a user in department A to
use a data mart created by department B, then in
most cases, that user has to relearn the metadata
navigation for that system.
12METADATA USED IN THE DATA WAREHOUSE
- Results from the Data Warehouse
- 1. 739516 13238350 426615 800441 4912313
- 2. "An Inverness Group report dated 4/7/95 states
that the European market for repository tools
expanded by 33 in 1994". - 3. "Leading gadget vendors Protz Group 48,
Harris Goods 29, Zymurgy Inc. 13" - How we can interpret this?
- In the first case, The numbers may be
- a firm's sales for different regions,
- population of towns,
- the number of hairs on certain individuals'
heads. - a sequence of machine code for some computer.
- In the second case, seems to be self-describing
- the date is ambiguous. Does it mean '4th July
1995' (British convention) or '7th April 1995'
(US convention)?
13METADATA USED IN THE DATA WAREHOUSE
- Results from the Data Warehouse
- 3. "Leading gadget vendors Protz Group 48,
Harris Goods 29, Zymurgy Inc. 13" - The third example contains some metadata, but not
enough. - Stipulating that 'gadget' has a precise meaning
to someone in the gadget industry, we still do
not know whether the market referred to is US,
world, European, - what time period is referred to
- how the data was collected or even the source of
the information. - Under the circumstances, this example should be
filed in the data warehouse with a metadata - note that says 'source unknown, reliability
unknown, no further details'.
14METADATA USEDIN THE DATA WAREHOUSE
- In the operational environment,
- metadata is mostly valuable to software
developers and database administrators. - Operational databases are only accessed by
transaction processing applications. - The end-users - people like bank clerks, travel
agents and hospital staff - - do not need to know how information is held in
the database. - simply interact with the forms and screens
provided by the applications they use in the
course of their everyday work. - The decision support environment
- data analysts and executives are looking for
useful facts and correlations - they need to understand its structure and
meaning. - ?) A driver with a complete set of large-scale
and small-scale maps - A data warehouse without adequate metadata is
like a filing cabinet stuffed with papers, but
without any folders or labels.
15METADATA REQUIREMENTS
- What tables, attributes, and keys does the data
warehouse contain? - Where did each set of data come from?
- What transformation logic was applied in loading
the data? - How has the metadata changed over time?
- What aliases exist and how are they related to
each other? - What are the cross-references between technical
and business terms? - For instance, the field name XVT-351J presumably
meant something to a COBOL programmer in 1965,
but what does it mean today?) - How often does the data get reloaded?
- How much data is there?
- This helps end-users to avoid submitting
unrealistic queries. Given some means of
determining the size of tables, staff can tell
the end-users 'You can do anything you like with
15,000 rows, but if it turns out to be 15 million
rows - back off and ask for help!'
16METADATA COMPONENTS
- The mapping information records how data from
operational sources is transformed on its way
into the warehouse. - Identification of source fields.
- Simple attribute-to-attribute mapping.
- Attribute conversions.
- Physical characteristic conversions.
- Encoding/reference table conversions.
- Naming changes.
- Key changes.
- Defaults.
- Logic to choose from among multiple sources.
- Algorithmic changes.
17METADATA COMPONENTS (Contd)
- EXTRACT HISTORY
- Whenever historical information is analysed,
meticulous update records have to be kept. - The metadata history is a good place to start any
time-based report. - MISCELLANEOUS
- Aliases can make the warehouse much more
use-friendly - They also come in useful when different
departments wish to use their own names to refer
to the same underlying data. - Status information can be used to keep
information for instance, tables might be
classified 'in-design', 'in-test', 'inactive' or
'active'. - Often, parts of the same data warehouse may be in
different stages of development - Volumetrics could usefully include such
information as number of rows, growth rate, usage
characteristics, indexing, and byte
specifications. - T he criteria and timescales for purging old
data.
18METADATA COMPONENTS (Contd)
- SUMMARISATION ALGORITHMS
- The algorithms for summarising the detail data
- RELATIONSHIP ARTIFACTS AND HISTORY
- Data warehouses implement relationships in a
different way from production databases. - Metadata pertaining to related tables,
constraints, and cardinality with text
descriptions and ownership records. - OWNERSHIP/STEWARDSHIP
- to identify the originator of each set of data,
so that inquiries and corrections can be made to
the proper group. - It is useful to distinguish between 'ownership'
of data in the operational environment and
'stewardship' in the data warehouse. - ACCESS PATTERNS
- patterns of access to the warehouse to optimize
and tune performance. - Less frequently used data can be migrated to
cheaper storage media,
19METADATA COMPONENTS (Contd)
- REFERENCE TABLES/ENCODED DATA
- Reference data is data that is stored in an
external table and contains commonly used
translations of values. - To recover the original unencoded data, together
with 'effective from' and 'effective to' dates. - DATA MODEL - DESIGN REFERENCE
- Metadata describing the mapping between the data
model and the physical design - This allows an ambiguities or uncertainties to be
resolved.
20A Repository
- the vehicle of metadata where information
(metadata) about an organization's information
systems components (objects, table definitions,
fields, business rules and so on) is held. - also contains tools to facilitate the
manipulation and query of the metadata. - a repository can
- aid in the integration of the views of disparate
systems by helping understand how the data used
by those systems are related - support rapid change and assistance in building
systems quickly by impact analysis and provision
of standardized data - facilitate reuse by using object concepts and
central accessibility - assist in implementation for data warehousing
- support software development teams.
21Requirements for Enterprise Repository
- Nonproprietary Relational Database Management
System - should use an industry standard DBMS which
provides significant advantages over
vendor-developed DBMSs. - advanced tools and utilities for database
management (such as backups and performance
tuning) - maintainablity and accessibility are enhanced by
an "open" system. - Fully Extensible Meta Model
- a complete self-defining, extensible repository
based on a common entity/relationship diagram. - The repository should support the following meta
model extensions - adding or modifying an entity type,
- adding or modifying a linkage between entity
types (associations or relationships), - adding user views to entities or relationships,
- adding, deleting, or modifying attributes of
relationships or entities, - modifying the list of allowable values for an
attribute type, - adding or modifying commands or user exits,
22Requirements for Enterprise Repository(Contd)
- An API access to the repository
- can provide with the flexibility needed to create
a metadata management system which suits their
unique needs. - Central Point of Metadata Control
- The repository serves as a central point of
control for data - a single place of record about information assets
across the enterprise. - documents where the data is located,
- who created and maintains the data,
- what application processes it drives,
- what relationship it has with other data,
- how it should be translated and transformed.
- Impact Analysis Capability
- to provide the total impact of any change.
23Requirements for Enterprise Repository (Contd)
- Naming Standards Flexibility
- should provide a detailed map of data definitions
and elements, - allowing an organization to evaluate redundant
definitions and elements - decide which ones should be eliminated,
translated, or converted. - By enforcing naming standards,
- assists in reducing data redundancies and
increasing data sharing, - making the application development process more
efficient and therefore less costly. - Versioning Capabilities
- "versioning" can have many different definitions.
- version control as in test vs. production
(lifecycle phasing) - versions as unique occurrences
- versioning by department or business unit and
- version by aggregate or workstation ID.
- facilitate the application lifecycle development
process by allowing developers to work with the
same object concurrently.
24Requirements for Enterprise Repository (Contd)
- Robust Query and Reporting
- The end user tool should seamlessly pass queries
to its own tool or third party products for
automatic query generation and execution. - business users should be able to create detailed
reports from these tools, - Data Warehousing Support
- The repository provides information about the
location and nature of operational data which is
critical in the construction of a data warehouse.
- It acts as a guide to the warehouse data, storing
information necessary to define the migration
environment, mappings of sources to targets,
translation requirements, business rules, and
selection criteria to build the warehouse
25References
- Putting Metadata to Work in the Warehouse,
http//www.cai.com/products/platinum/wp/wp_meta.ht
m - ComputerWire Data Warehousing Tools Bulletin
Briefing Paper What is Metadata,
http//www.computerwire.com/bulletinsuk/212e_1a6.h
tm