Title: Data Warehouse and the Web
1Chapter 10
Data Warehouse Fundamentals
- Data Warehouse and the Web
2Chapter - Objectives
- Understand what Web-enabled data warehouse means
and examine the reasons for doing so - Appreciate the importance of the convergence of
Web technology and those of the Data Warehouse - Probe into all the facets of Web-based
information delivery - Study how OLAP and the Web connect and learn the
different approaches to connecting them - Examine the steps for building a Web-enabled data
warehouse
2
3Three Information Delivery Mechanisms
- Internet Low cost, however security concerns
must be addressed. - Intranet - A Web site or group of sites belonging
to an organization, accessible only by the
members of an organization. - Extranet - An intranet that is partially
accessible to authorized outsiders. - Whereas intranet resides behind firewall and
is accessible only to people who are members of
same organization, extranet provides various
levels of accessibility to outsiders.
4Why the Web?
- World-Wide Web (Web, WWW, or W3) possibly most
popular and powerful networked information system
to date. - As architecture of Web was designed to be
platform-independent, can significantly lower
deployment and training costs. - Organizations using Web as strategic platform for
innovative business solutions, in effect becoming
Web-centric.
4
5Why the Web?
- Universal browsers will run on any systems.
- It is easy to keep the internet/extranet updated
so that there will be one source of information. - Open up your data warehouse to your business
partners over the extranet fosters and
strengthens the partnership.
5
6Why the Web?
- Worldwide collection of interconnected networks.
- Began in late 60s in ARPANET, a US DOD project,
investigating how to build networks that could
withstand partial outages. - Starting with a few nodes, Internet estimated to
have over 100 million users in 1997, and over 270
million users in over 100 countries in 1998, with
one million new users joining each month. - May be 199 million users of Web by year 2000.
6
7The Web as a Data Source
- Hypermedia-based system that provides a simple
point and click means of browsing information
on the Internet using hyperlinks. - Information presented on Web pages, which can
contain text, graphics, pictures, sound, and
video. - Can also contain hyperlinks to other Web pages,
which allow users to navigate in a non-sequential
way through information. - Web documents written using HTML.
8
8The Web
- Web consists of network of computers that can act
in two roles - as servers, providing information
- as clients (browsers), requesting information.
- Protocol that governs exchange of information
between Web server and browser is HTTP and
locations within documents identified as a URL. - Much of Webs success is due to its simplicity
and platform-independence.
9
9Browser Technology for the Data Warehouse
- At present, four technologies are commonly used
to - build Web-based user interfaces.
- HTML
- Java
- ADO
- Plug-In
10Architecture of a Web-enabled Data Warehouse
MDDB
Firewall
Internet
Internet Client
OLAP Server
Web Server
Network Backbone
Database Server
Source Systems
DW
11Web Processing Model
Web Server
HTML
CGI
Structured Data Contents
Web Browser
SQL
HTML
Unstructured Data Contents
Query Engine
12Basic Components of Web Environment
10
13HyperText Transfer Protocol (HTTP)
- Protocol used to transfer Web pages through
Internet. - Based on request-response paradigm
- Connection - Client establishes connection with
Web - server.
- Request - Client sends request to Web server.
- Response - Web server sends response (HTML
- document) to client.
- Close - Connection closed by Web server.
11
14HyperText Transfer Protocol (HTTP)
- HTTP/1.0 is stateless protocol - each connection
is closed once server provides response. - This makes it difficult to support concept of a
session that is essential to basic DBMS
transactions.
12
15HyperText Markup Language (HTML)
- Document formatting language used to design most
Web pages. - A simple, yet powerful, platform-independent
document language. - HTML is an application of Standardized
Generalized Markup Language (SGML), a system for
defining structured document types and markup
languages to represent instances of those
document types.
13
16HyperText Markup Language (HTML)
14
17HyperText Markup Language (HTML)
15
18Uniform Resource Locators (URLs)
- String of alphanumeric characters that
represents location or address of a resource on
Internet and how that resource should be
accessed. - Defines uniquely where documents (resources) can
be found. - Uniform Resource Identifiers (URIs) - generic set
of all Internet resource names/addresses. - Uniform Resource Names (URNs) - persistent,
location-independent name. Relies on name lookup
services.
16
19Uniform Resource Locators (URLs)
- URL consists of three basic parts
- protocol used for the connection,
- host name,
- path name on host where resource stored.
- Can optionally specify
- port through which connection to host should be
made, - query string.
- http//www.w3.org/WWW/MarkUp.html
17
20Static and Dynamic Web Pages
- HTML document stored in file is static Web page.
- Content of dynamic Web page is generated each
time it is accessed. - Thus, dynamic Web page can
- respond to user input from browser.
- be customized by and for each user.
- Requires hypertext to be generated by servers.
- Need scripts that perform conversions from
different data formats into HTML on-the-fly.
18
21Requirements for Web-DBMS Integration
- Ability to access valuable corporate data in a
secure manner. - Data and vendor independent connectivity to allow
freedom of choice in DBMS selection. - Ability to interface to database independent of
any proprietary browser or Web server. - Connectivity solution that takes advantage of all
the features of an organizations DBMS.
19
22Requirements for Web-DBMS Integration
- Open-architecture to allow interoperability with
a variety of systems and technologies. For
example - different Web servers
- Microsoft's (Distributed) Common Object Model
(DCOM/COM) - CORBA/IIOP (Internet Inter-ORB protocol)
- Java/Remote Method Invocation.
20
23Requirements for Web-DBMS Integration
- Cost-effective solution that allows for
scalability, growth, and changes in strategic
directions, and helps reduce applications
development costs. - Support for transactions that span multiple HTTP
requests. - Support for session- and application-based
authentication. - Acceptable performance.
21
24Requirements for Web-DBMS Integration
- Minimal administration overhead.
- Set of high-level productivity tools to allow
applications to be developed, maintained, and
deployed with relative ease and speed.
22
25Two-Tier Client-Server Architecture
23
26Three-Tier Client-Server Architecture
- Client side presented two problems preventing
true scalability - Fat client, requiring considerable resources on
clients computer to run effectively. - Significant client side administration overhead.
- By 1995, three layers proposed, each potentially
running on a different platform.
24
27Three-Tier Client-Server Architecture
- Advantages
- Thin client, requiring less expensive hardware.
- Application maintenance centralized.
- Easier to modify or replace one tier without
affecting others. - Separating business logic from database functions
makes it easier to implement load balancing. - Maps quite naturally to Web environment.
25
28Three-Tier Client-Server Architecture
26
29Advantages of Web-DBMS Approach
- DBMS advantages
- Simplicity
- Platform independence
- Graphical User Interface
- Standardization
- Cross-platform support
- Transparent network access
- Scalable deployment
- Innovation
27
30Disadvantages of Web-DBMS Approach
- Reliability
- Security
- Cost
- Scalability
- Limited functionality of HTML
- Statelessness
- Bandwidth
- Performance
- Immaturity of development tools
28
31Approaches to Integrating Web and DBMSs
- Common Gateway Interface (CGI).
- Server-Side Includes.
- HTTP Cookies.
- Extending the Web Server.
- Java and JDBC, JSQL and JRB.
- Scripting Languages (JavaScript and VBScript).
- Microsoft Active Platform ASP and ADO.
- Oracle Network Computing Architecture.
29
32Common Gateway Interface (CGI)
- Specification for transferring information
between a Web server and a CGI program. - Server only intelligent enough to send documents
and to tell browser what kind of document it is. - But server also knows how to launch other
programs. - When server sees that URL points to a program
(script), it executes script and sends back
scripts output to browser as if it were a file.
30
33CGI - Environment
31
34Common Gateway Interface (CGI)
- CGI defines how scripts communicate with Web
servers. - A CGI script is any script designed to accept and
return data that conforms to the CGI
specification. - Before server launches script, prepares number of
environment variables representing current state
of the server, who is requesting the information,
and so on. - Script picks this up and reads STDIN.
32
35Common Gateway Interface (CGI)
- Then performs necessary processing and writes its
output to STDOUT. - Script responsible for sending MIME header, which
allows browser to differentiate between
components. - CGI scripts can be written in almost any
language, provided it supports reading and
writing of an operating systems environment
variables.
33
36Common Gateway Interface (CGI)
- Four primary methods for passing information from
browser to a CGI script - Passing parameters on the command line.
- Passing environment variables to CGI programs.
- Passing data to CGI programs via standard input.
- Using extra path information.
34
37CGI - Passing Parameters on Command Line
35
38CGI - Advantages
- CGI is the de facto standard for interfacing Web
servers with external applications. - Possibly most commonly used method for
interfacing Web applications to data sources. - Advantages
- simplicity,
- language independence,
- Web server independence,
- wide acceptance.
36
39CGI - Disadvantages
- Communication between client and database server
must always go through Web server. - Lack of efficiency and transaction support, and
difficulty validating user input inherited from
statelessness of HTTP protocol. - HTTP never intended for long exchanges or
interactivity. - Server has to generate a new process or thread
for each CGI script. - Security.
37
40Server-Side Includes (SSI)
- Allows a program to be executed, like CGI, and to
incorporate its output into the document. - Generally, end result is a text document.
- SSI is not governed by an Internet RFC or other
standard. Each server vendor is free to implement
SSI on an ad-hoc basis, if at all. - Most servers follow NCSAs specification.
- All SSI commands are embedded within regular HTML
comments, making the HTML portable. - Security risks of SSI are similar to those of CGI.
38
41HTTP Cookies
- Cookies can make CGI scripts more interactive.
- Cookies are small text files stored on Web
client. - CGI script creates cookie and has Web server send
it to clients browser to store on hard disk. - Later, when client revisits Web site and uses a
CGI script that requests this cookie, clients
browser sends information stored in the cookie. - However, not all browsers support cookies.
39
42Extending the Web Server
- To overcome limitations of CGI, many servers
provide an API that adds functionality to server.
- Two of main APIs are Netscapes NSAPI and
Microsofts ISAPI. - Scripts are loaded in as part of the server,
giving back-end applications full access to all
the I/O functions of server. - One copy of application is loaded and shared
between multiple requests to server.
40
43Extending the Web Server
- Approach more complex than CGI, possibly
requiring specialized programmers. - Can provide very flexible and powerful solution.
- API extensions can provide same functionality as
a CGI program, but as API runs as part of the
server, API approach can perform significantly
better than CGI. - Extending Web server is potentially dangerous,
since server executable is being changed.
41
44Comparison of CGI and API
- CGI and API both extend capabilities of server.
- CGI scripts run in environment created by Web
server program. - Scripts only execute once Web server interprets
request from browser, then returns results back
to the server. - API approach not nearly so limited in its ability
to communicate. - API-based extensions are loaded into same address
space as Web server.
42
45Java
- Proprietary language developed by Sun and
currently marketed by JavaSoft. - Originally intended to support environment of
networked machines and embedded systems. - Now, Java is rapidly becoming de facto language
for Web computing. - Interesting because of its potential for building
Web applications (applets) and server
applications (servlets).
43
46Java
- Java is a simple, object-oriented, distributed,
interpreted, robust, secure, architecture
neutral, portable, high-performance,
multi-threaded and dynamic language. - Has a machine-independent target architecture,
the Java Virtual Machine (JVM). - Since almost every Web browser vendor has already
licensed Java and implemented an embedded JVM,
Java applications can currently be deployed on
most end-user platforms.
44
47Java
45
48Java
- Before Java application can be executed, it must
first be loaded into memory. - Done by Class Loader, which takes .class
file(s) containing bytecodes and transfers it
into memory. - Class file can be loaded from local hard drive or
downloaded from network. - Finally, bytecodes must be verified to ensure
that they are valid do not violate Javas
security restrictions.
46
49Java
- Loosely speaking Java is a safe C.
- Safety features include strong static type
checking, automatic garbage collection, and
absence of machine pointers at language level. - Safety is central design goal ability to safely
transmit Java code across Internet. - Security is also integral part of Javas design -
sandbox ensures untrusted application cannot gain
access to system resources.
47
50JDBC
- Modeled after ODBC, JDBC API supports basic SQL
functionality. - With JDBC, Java can be used as host language for
writing database applications. - On top of JDBC, higher-level APIs can be built.
- Currently, two types of higher-level APIs
- An embedded SQL for Java.
- A direct mapping of relational database tables to
Java classes.
48
51JDBC
- JDBC API consists of two main interfaces an API
for application writers, and a lower-level driver
API for driver writers. - Applications and applets can access databases
using - JDBC API with pure Java JDBC drivers,
- ODBC drivers and existing database client
libraries.
49
52JDBC - Advantages/Disadvantages
- Advantage of using ODBC drivers is that they are
a de facto standard for PC database access, and
are available for many DBMSs, for very low price.
- Disadvantages with this approach
- Non-pure JDBC driver will not necessarily work
with a Web browser. - Currently downloaded applet can connect only to
database located on host machine. - Deployment costs increase.
50
53JDBC
51
54JSQL
- Another JDBC-based approach uses Java with static
embedded SQL. - JSQL comprises a set of clauses that extend Java
to include SQL constructs as statements and
expressions. - JSQL translator transforms JSQL clauses into
standard Java code that accesses database through
a CLI.
52
55Java Relational Binding (JRB)
- Middleware product that bridges from Java to
RDBMSs. - Provides orthogonal persistence through
three-stage process - database creation,
- an import program,
- JRB API.
53
56Java Relational Binding (JRB)
54
57Java Relational Binding (JRB)
- API is set of public classes used by programmer,
which includes methods to connect to database
server, open database, start/end transactions,
create/update/read objects. - JRB relies on security and integrity of
underlying RDBMS - references between objects implemented as FKs
- OIDs modeled as system generated PKs.
55
58Java Relational Binding (JRB)
- JRB also provides notion of class extents.
- Can associate predicate with class extent and
retrieve objects based on their contents (in a
select-from-where style). - The runtime system is implemented on top of a
JDBC-compliant interface layer.
56
59Scripting Languages (JavaScript and VBScript)
- Scripting languages can be used to extend browser
and Web server with database functionality. - As script code is embedded in HTML, it is
downloaded every time page is accessed. - Updating browser is simply a matter of changing
Web document on server. - Two popular scripting languages are JavaScript
and VBScript. - Both are interpreted languages, not compiled,
making it easy to create small applications.
57
60Server-Side JavaScript for Database Access
58
61Microsoft Active Platform
- Microsoft Active Platform is an open,
standards-based software architecture for
delivering applications over the Internet and
intranets. - Contains HTML, scripting languages, and
components (Java, ActiveX). - On client machine called an Active Desktop.
- On Web server called an Active Server.
- Active Platform is encompassing term given to
these related technologies.
59
62Object Linking and Embedding for DataBases (OLE
DB)
- Microsoft has defined set of data objects,
collectively known as OLE DB. - Allows OLE-oriented applications to share and
manipulate sets of data as objects. - OLE DB is an object-oriented specification based
on C API. - Components can be treated as data consumers and
data providers. Consumers take data from OLE DB
interfaces and providers expose OLE DB
interfaces.
60
63OLE DB
61
64Active Server Pages (ASP)
- ASP is programming model that allows dynamic,
interactive Web pages to be created on server. - ASP provides flexibility of CGI, without
performance overhead discussed previously. - ASP runs in-process with the server, and is
optimized to handle large volume of users. - When an .asp file is requested, Web server
calls ASP, which reads requested file, executes
any commands, and sends generated HTML page back
to browser.
62
65Active Server Pages (ASP)
63
66Active Data Objects (ADO)
- Programming extension of ASP supported by
Microsoft IIS for database connectivity. - Supports following key features
- Independently-created objects.
- Support for stored procedures.
- Support for different cursor types.
- Batch updating.
- Support for limits on number of returned rows.
- Designed as an easy-to-use interface to OLE DB.
64
67Microsoft Internet Database Connector (IDC)
- Similar approach to ASP, again specific to
Microsoft Internet Information Server. - IDC is an ISAPI that reads an .idc file that
contains SQL commands. - IDC communicates with a DBMSs ODBC driver to
retrieve necessary data from database and format
it using information in an .htx file.
65
68Microsoft Internet Database Connector (IDC)
66
69Oracle Network Computing Architecture(NCA)
- NCA aimed at providing extensibility for
distributed environments. - It is three-tier architecture based on industry
standards such as - OMGs CORBA 2.0 technology.
- HTTP and HTML for Web enablement.
- IIOP for object interoperability.
- OMGs IDL for language neutral interfaces.
67
70Oracle NCA
68
71Oracle NCA
69
72Security
- All Internet traffic travels in the clear and
anyone who monitors traffic can read it. - Need to ensure with communication that
- It is inaccessible to anyone but sender and
receiver (privacy). - It has not been changed during transmission
(integrity). - Receiver can be sure it came from sender
(authenticity).
70
73Security
- Sender can be sure receiver is genuine
(non-fabrication). - Sender cannot deny he or she sent it
(non-repudiation). - Must also protect information once it has reached
Web server.
71
74Security
- Download may have executable content, which can
perform following malicious actions - Corrupt data or execution state of programs.
- Reformat complete disks.
- Perform a total system shutdown.
- Collect and download confidential data.
- Usurp identity and impersonate user.
- Lock up resources.
- Cause non-fatal but unwelcome effects.
72
75Security
- Look at
- Proxy Servers. - Firewalls
- Message Digest - Digital
- Algorithms Signatures.
- Digital Certificates. - Kerberos.
- SSL and S-HTTP. - SET and SST.
- Java Security. - ActiveX Security.
73
76Proxy Servers
- Proxy server is computer that sits between
browser and Web server. - It intercepts all requests to Web server to try
to fulfil requests itself. - Has two main purposes
- improve performance
- filter requests.
74
77Firewalls
- Designed to prevent unauthorized access to/from a
private network. - Can be implemented in both hardware and software,
or a combination of both. - Several types of firewall techniques
- Packet filter.
- Application gateway.
- Circuit-level gateway.
- Proxy server.
75
78Message Digest Algorithms
- Message digest algorithm takes an arbitrary-sized
string (message) and generates fixed-length
string (digest or hash). - A digest has following characteristics
- It should be computationally infeasible to find
another message that will generate same digest. - Digest does not reveal anything about message.
76
79Digital Signatures
- Digital signature consists of two parts
- string of bits computed from data being signed
- private key of individual or organization wishing
the signature. - Can be used to verify data comes from this
individual or organization.
77
80Digital Signatures
- Digital signature has many useful properties
- Authenticity can be verified, using public key.
- Cannot be forged (assuming private key is kept
secret). - Function of data signed and cannot be claimed to
be signature for any other data. - Signed data cannot be changed or signature will
no longer verify data as being authentic.
78
81Digital Certificates
- Attachment to electronic message used for
security purposes (e.g. verify user sending
message), and provide receiver with means to
encode reply. - Sender applies for certificate from Certificate
Authority (CA). - CA issues encrypted certificate containing
applicants public key and other identification
information.
79
82Digital Certificates
- CA makes its own public key readily available.
- Recipient uses CAs public key to decode
certificate attached to message, verifies it as
issued by CA, and obtains senders public key and
identification information held within
certificate. - With this information, recipient can send an
encrypted reply. - CAs role is critical, acting as go-between in
relationship between two parties.
80
83Kerberos
- A server of secured user names and passwords.
- Provides one centralized security server for all
data and resources on network. - Database access, login, authorization control,
and other security features are centralized on
trusted Kerberos servers. - Has similar function to that of Certificate
server to identify and validate a user.
81
84Secure Sockets Layer (SSL)
- Encryption protocol for transmitting private
documents - Designed to prevent eavesdropping, tampering, and
message forgery. - Works by using private key to encrypt data that
is transferred over SSL connection. - Layered between application-level protocols such
as HTTP and TCP/IP transport-level protocol. - Thus, may be used for other application-level
protocols such as FTP and NNTP.
82
85Secure-HTTP (S-HTTP)
- Protocol for securely transmitting individual
messages over Web. - Both SSL and S-HTTP use techniques such as
encryption, and digital signatures, and - Allow browsers and servers to authenticate each
other. - Allow controlled access to Web site.
- Ensure data exchanged between browser and server
is secure and reliable.
83
86Secure Electronic Transactions
- Open, interoperable standard for processing
credit card transactions over Internet, in simple
and secure way. - Transaction is split in such a way that merchant
has access to information about - what is being purchased,
- how much it costs,
- whether payment is approved,
- but no information on what payment method
customer is using.
84
87SET
- Card issuer (e.g. Visa) has access to purchase
price, but no information on type of merchandise
involved. - Certificates are heavily used by SET, both for
certifying cardholder and for certifying that
merchant has relationship with financial
institution.
85
88SET
86
89Secure Transaction Technology (SST)
- Protocol designed to handle secure bank payments
over Internet. - Uses DES encryption of information, RSA
encryption of bankcard information, and strong
authentication of all parties involved in
transaction.
87
90Java Security
- Sandbox ensures untrusted application cannot gain
access to system resources. - Involves three components
- class loader,
- bytecode verifier,
- security manager.
- Safety features provided by language and JVM, and
enforced by compiler and runtime system. - Security is a policy built on top of safety layer.
88
91Class Loader
- Allocates a (hierarchically structured) namespace
for each class. - Never allows class from less protected
namespace to replace class from more protected
namespace. - Thus, I/O primitives, defined in local Java
class, cannot be invoked or overridden by classes
from out with local machine. - As browsers and Java applications can provide
their own class loader, this may be viewed as
weakness in security.
89
92Bytecode Verifier
- JVM verifies bytecode instructions before
allowing application/applet to run - Typical checks include verifying
- Compiled code is correctly formatted.
- Internal stacks will not overflow/underflow.
- No illegal data conversions will occur.
- Bytecode instructions are appropriately typed.
- All class member accesses are valid.
90
93Security Manager
- Each Java application defines and implements its
own security policy. - A Java-enabled browser contains its own Security
Manager, and any applets it downloads are subject
to its policies. - Generally, downloaded applets are prevented from
- Reading and writing files on clients file
system. - Making network connections to machines other than
host. - Starting other programs on the client.
91
94Security Manager
- Loading libraries.
- Defining method calls.
- These restrictions apply to applets downloaded
over Internet/intranet. - Also do not apply to applets on clients local
disk and in directory on CLASSPATH. - Local applets are loaded by file system loader
and can read and write files, exit JVM, and are
not passed through bytecode verifier.
92
95ActiveX Security
- ActiveX security model places no restrictions on
what a control can do. - Instead, each ActiveX control can be digitally
signed by its author using system called
Authenticode. - Digital signatures are then certified by CA.
- This security model places responsibility for the
computers security on the user.
93
96HTTP/1.1
- Number of new features added. Look at two
- Persistent connections become default behavior.
While open, client can send synchronous or
asynchronous messages, and server can respond to
them in order. - Digest authentication provided as replacement for
basic authentication. Password remains secret
between client and server. Client and server
compute digest value using the MD5 and digest is
sent across network.
94
97XML (eXtensible Markup Language)
- XML is new standard that could preserve general
application independence that makes HTML portable
and powerful. - Pared-down version of SGML, designed especially
for Web documents. - Designers can create their own customized tags to
provide functionality not available with HTML.
95
98XML (eXtensible Markup Language)
- SGML allows document to be logically separated
into two - Document Type Definition (DTD)
- other containing the text itself.
- Useful features include
- Database Schema Definition.
- Linking to relative objects or elements.
- Support for bi-directional links.
- Simplicity may be lost with move to XML.
96
99Put the Pieces Together
- The data warehouse configuration is beyond
client/server computing. Therefore, consider a
distributed component architecture. - Strive for minimum administration on the client
side. Thru thin-client technology like Java is
likely to provide a zero-administration client
setup. - Ensure complete metadata integration. Both IT and
variety of user types will benefit from unified
metadata
100Put the Pieces Together (contd)
- Select the right database to support the
distributed environment. Thus, a RDBMS with the
Java engine in the database would prove useful. - In many Web applications, the HTTP server becomes
a point of congestion as all data from a session
is fed to the browser through this server.
Implementing a CORBA model would provide
distributed object computing and scalability. - The Web interface consists of a browser, search
engine, groupware, push technology, home pages,
hypertext links, and downloaded Java and ActiveX
applets.
101Put the Pieces Together (contd)
- Tools supporting HTML can be universally
deployed. However, for complex analysis, HTML is
cumbersome. Use HTML as much as possible and
reseve Java or plug-ins for complex ad hoc
analysis. - Ensure that you pay enough attention to the
administration and maintenance of fact and
dimension tables.