Title: Data and Databases
1Data and Databases
2The Data Basics
- Data
- Facts concerning things such as people, objects,
or events - Information
- data that have been processed and presented in a
form suitable for human interpretation - Database
- a collection of interrelated, shared, and
controlled data
3Drawbacks of the Traditional Database System
- Data Redundancy
- Program-Data Dependence
- Inflexibility
- Poor Data Security
- Lack of Data Sharing
- Lack of Data Standards
4Modern Database Systems
Accounting Application Programs
Accounting
Integrated Database
Finance Application Programs
Finance
DBMS
Sales
Sales Application Programs
5Advantages of Modern Database Environments
- Minimal data redundancy
- Data consistency
- Integration of data
- Data sharing
- Ease of application development
- Security, privacy, and integrity controls
- Data accessibility and responsiveness
- Data independence
- Reduced program maintenance
6Drawbacks of Modern Database Environments
- Need for new specialized personnel
- Need for explicit backup
- because of minimal data redundancy
- Interference with shared data
- concurrent access is a problem
- Organizational conflict
7Data Storage
8Data Representation
- Binary digit (bit)
- String of bits (Byte)
- EBCDIC vs. ASCII
- Picture Element (Pixel)
9Data Storage
- In Web-era, data is piling up quickly space at a
premium - Storage solutions
- Server-hosted storage
- SCSI Arrays
- Network Attached Storage (NAS)
- Storage Area Networks (SAN)
10Server-hosted storage
- Both applications and storage on same server
- Advantage
- Server, OS, and storage all from the same vendor
- Easy to replicate
- Disadvantages
- Expansion limited by server architecture (may
need to replace existing media) - Free space on one server not easily accessed by
another server - Maintenance affects server and storage (CPUs
become obsolete before storage)
11SCSI Arrays Small Computer System Interface
(scuzzy)
- Scuzzy interfaces allow for faster data
transmissions than traditional serial and
parallel ports. - In a survey by InfoWorld, 67 were using SCSI
arrays for storage - Often used with RAID
- Advantages
- Embedded computer to manage configuration and
monitor performance - Can be made fault-tolerant
- SCSI cable offers good throughput
- Disadvantages
- Expansion difficult once space is used
- Significant costs of layout (SCSI cable limited
in distance)
12Network Attached Storage (NAS)
- A server that is dedicated to nothing more than
file sharing. - Devices can be plugged into LAN using standard
network cables and accessed by client PCs via a
NAS gateway - Advantages
- Easiest and cheapest
- Pre-configured with OS tailored for data handling
- Can be few GB to several TB
- Easy to connect
- Faulty components can be changed without downtime
- Disadvantages
- Adds burden to LAN traffic
- Access speed limited by bandwidth
- Each NAS device has to managed independently
13Storage Area Networks (SAN)
- Dedicated network of servers and storage devices
- Uses hubs and switches
- No limit to number of storage servers
- Uses fiber can extend long distances good
bandwidth - Easy to set up needs special adaptors
- Works with any OS
- Easy migration from old systems
14Storage Virtualization
- pooling of physical storage from multiple network
storage devices into what appears to be a single
storage device that is managed from a central
console. (source whatis.com) - For more information, visit http//www.storage.co
m
15Data Warehouses and Data Mining
16Data Requirements
- Organizations need access to
- operational data
- historical data
- legacy data
- subscription databases
- internet data
- Organizations need to
- combine data, slice and dice, do complex
analysis...
17Data Warehouses
- Aimed at supporting all levels of analysis and
information formats - DSS have existed for many years
- Labeled data warehouse in the 1990s and top
executives began top pay notice - Many different definitions (some relating to
data, others to people or processes)
18Simple Definition
A data warehouse is a collection of integrated,
subject-oriented databases designed to support
the decision support function, where each unit of
data is relevant to some moment in time.
19Four Defining Concepts
- Subject-oriented
- Integrated
- Time-variant
- Non-volatile
20Concepts....
- Subject-oriented
- requires database design
- revolves around specific business entities
- many companies simply pull together old files
- Integrated data
- data warehouse database designed using a proper
methodology - consistency in naming conventions for keys,
relationships etc. - warehouses require large design effort
21Concepts...
- Time-variant
- data warehouse design organizes data by different
time periods - fundamental to temporal analysis
- usually years or quarters or months
- Non-volatile
- not updated in real-time
- staged into warehouse on a nightly/weekly basis
- users cannot update the data (in DW) directly
22Data Mining
True genius resides in the capacity for
evaluation of uncertain, hazardous, and often
conflicting information - Sir Winston Churchill
23What is data mining?
- Large databases can be searched for relationships
patterns, and trends, which prior to the search
were not known to exist. - Data mining is the process of asking a processing
engine to show answers to questions that we do
not know how to ask.
24Data Mining techniques
- Four major types of processing algorithms (or
rules) - associations
- clustering
- classification
- sequential patterns
251. Associations (Link Analysis)
- Find correlations between one set of items or
events and another such set - eg 78 of all people who buy a desktop PC will
also buy add-ons - eg large percentage of buyers will buy potato
chips if they are stacked near the beverages
aisle...
26Clustering
- Used to discover hitherto unknown or unsuspected
class of data - Defect Analysis or Group affinity analysis
- Some particular common characteristic between
good customers that cancel their own credit cards
27Classification
- Identifies the process and must discover the
rules that whether an item belongs to a
particular subset of data (a subtype) - Eg Credit card approval
- do a variety of customer characteristics put
him/her in a subset of customers who can charge?
28Sequential Patterns
- Mostly used for pattern analysis
- uses historical data store of all transactions in
a warehouse - Eg Buyers who purchase window coverings and then
buy linens within three months will purchase
furniture within the next 12 months (new
residence furnishings buying pattern)