Title: DataDriven Digital Library Applications The UC Berkeley Environmental Digital Library
1Data-Driven Digital Library Applications -- The
UC Berkeley Environmental Digital Library
- University of California, Berkeley
- School of Information Management and Systems
- SIMS 257 Database Management
2Lecture Outline
- Final Project
- Review
- ORDBMS Feature
- JDBC Access to DBMS
- Data-Driven Digital Library Applications
- Berkeleys Environmental Digital Library
-
3Lecture Outline
- Final Project
- Review
- ORDBMS Feature
- JDBC Access to DBMS
- Data-Driven Digital Library Applications
- Berkeleys Environmental Digital Library
-
4Final Project Requirements
- See WWW site
- http//sims.berkeley.edu/courses/is257/f05/index.h
tml - Report on personal/group database including
- Database description and purpose
- Data Dictionary
- Relationships Diagram
- Sample queries and results (Web or Access tools)
- Sample forms (Web or Access tools)
- Sample reports (Web or Access tools)
- Application Screens (Web or Access tools)
5Final Presentations and Reports
- Specifications for final report are on the Web
Site under assignments - Reports Due on December 14.
- Presentations on December 57, 900-1030
6Lecture Outline
- Final Project
- Review
- ORDBMS Feature
- JDBC Access to DBMS
- Data-Driven Digital Library Applications
- Berkeleys Environmental Digital Library
-
7Object Relational Data Model
- Class, instance, attribute, method, and integrity
constraints - OID per instance
- Encapsulation
- Multiple inheritance hierarchy of classes
- Class references via OID object references
- Set-Valued attributes
- Abstract Data Types
8PostgreSQL
- All of the usual SQL commands for creation,
searching and modifying classes (tables) are
available. With some additions - Inheritance
- Non-Atomic Values
- User defined functions and operators
9Inheritance
- CREATE TABLE cities (
- name text,
- population float,
- altitude int -- (in ft)
- )
- CREATE TABLE capitals (
- state char(2)
- ) INHERITS (cities)
-
10Non-Atomic Values - Arrays
- Postgres allows attributes of an instance to be
defined as fixed-length or variable-length
multi-dimensional arrays. Arrays of any base type
or user-defined type can be created. To
illustrate their use, we first create a class
with arrays of base types. - CREATE TABLE SAL_EMP (
- name text,
- pay_by_quarter int4,
- schedule text
- )
11PostgreSQL Extensibility
- Postgres is extensible because its operation is
catalog-driven - RDBMS store information about databases, tables,
columns, etc., in what are commonly known as
system catalogs. (Some systems call this the data
dictionary). - One key difference between Postgres and standard
RDBMS is that Postgres stores much more
information in its catalogs - not only information about tables and columns,
but also information about its types, functions,
access methods, etc. - These classes can be modified by the user, and
since Postgres bases its internal operation on
these classes, this means that Postgres can be
extended by users - By comparison, conventional database systems can
only be extended by changing hardcoded procedures
within the DBMS or by loading modules
specially-written by the DBMS vendor.
12User Defined Functions
- CREATE FUNCTION allows a Postgres user to
register a function with a database.
Subsequently, this user is considered the owner
of the function - CREATE FUNCTION name ( ftype , ... )
- RETURNS rtype
- AS SQLdefinition
- LANGUAGE 'langname'
- WITH ( attribute , ... )
- CREATE FUNCTION name ( ftype , ... )
- RETURNS rtype
- AS obj_file , link_symbol
- LANGUAGE 'C'
- WITH ( attribute , ... )
13External Functions
- This example creates a C function by calling a
routine from a user-created shared library. This
particular routine calculates a check digit and
returns TRUE if the check digit in the function
parameters is correct. It is intended for use in
a CHECK contraint. - CREATE FUNCTION ean_checkdigit(bpchar, bpchar)
RETURNS bool - AS '/usr1/proj/bray/sql/funcs.so' LANGUAGE
'c' - CREATE TABLE product (
- id char(8) PRIMARY KEY,
- eanprefix char(8) CHECK (eanprefix
'0-92 0-95') - REFERENCES
brandname(ean_prefix), - eancode char(6) CHECK (eancode
'0-96'), - CONSTRAINT ean CHECK (ean_checkdigit(eanpre
fix, eancode)))
14Creating new Types
- CREATE TYPE allows the user to register a new
user data type with Postgres for use in the
current data base. The user who defines a type
becomes its owner. typename is the name of the
new type and must be unique within the types
defined for this database. - CREATE TYPE typename ( INPUT input_function,
OUTPUT output_function - , INTERNALLENGTH internallength
VARIABLE , EXTERNALLENGTH externallength
VARIABLE - , DEFAULT "default"
- , ELEMENT element , DELIMITER
delimiter - , SEND send_function , RECEIVE
receive_function - , PASSEDBYVALUE )
15Rules System
- CREATE RULE name AS ON event
- TO object WHERE condition
- DO INSTEAD action NOTHING
- Rules can be triggered by any event (select,
update, delete, etc.)
16Views as Rules
- Views in Postgres are implemented using the rule
system. In fact there is absolutely no difference
between a - CREATE VIEW myview AS SELECT FROM mytab
- compared against the two commands
- CREATE TABLE myview (same attribute list as for
mytab) - CREATE RULE "_RETmyview" AS ON SELECT TO myview
DO INSTEAD - SELECT FROM mytab
17GiST Approach
- A generalized search tree. Must be
- Extensible in terms of queries
- General (B-tree, R-tree, etc.)
- Easy to extend
- Efficient (match specialized trees)
- Highly concurrent, recoverable, etc.
18Java and JDBC
- Java is probably the high-level language used in
most software development today one of the
earliest enterprise additions to Java was JDBC - JDBC is an API that provides a mid-level access
to DBMS from Java applications - Intended to be an open cross-platform standard
for database access in Java - Similar in intent to Microsofts ODBC
19JDBC
- Provides a standard set of interfaces for any
DBMS with a JDBC driver using SQL to specify
the databases operations.
20JDBC Simple Java Implementation
import java.sql. import oracle.jdbc. public
class JDBCSample public static void
main(java.lang.String args) try //
this is where the driver is loaded
//Class.forName("jdbc.oracle.thin")
DriverManager.registerDriver(new
OracleDriver()) catch (SQLException e)
System.out.println("Unable to load driver
Class") return
21JDBC Simple Java Impl.
try //All DB access is within the
try/catch block... // make a connection to
ORACLE on Dream Connection con
DriverManager.getConnection(
"jdbcoraclethin_at_dream.sims.berkel
ey.edu1521dev", mylogin",
myoraclePW") // Do an SQL statement...
Statement stmt con.createStatement()
ResultSet rs stmt.executeQuery("SELECT NAME
FROM DIVECUST")
22JDBC Simple Java Impl.
// show the Results... while(rs.next())
System.out.println(rs.getString("NAME"))
// Release the database
resources... rs.close()
stmt.close() con.close() catch
(SQLException se) // inform user of
errors... System.out.println("SQL Exception
" se.getMessage()) se.printStackTrace(Syst
em.out)
23Lecture Outline
- Final Project
- Review
- ORDBMS Feature
- JDBC Access to DBMS
- Data-Driven Digital Library Applications
- Berkeleys Environmental Digital Library
-
24Berkeley DL Project
- Object Relational Database Applications
- The Berkeley Digital Library Project
- Slides from RRL and Robert Wilensky, EECS
- Use of DBMS in DL project
- Note that MOST of these things no longer work on
the DL web site
25Overview
- What is an Digital Library?
- Overview of Ongoing Research on Information
Access in Digital Libraries
26Digital Libraries Are Like Traditional
Libraries...
- Involve large repositories of information
(storage, preservation, and access) - Provide information organization and retrieval
facilities (categorization, indexing) - Provide access for communities of users
(communities may be as large as the general
public or small as the employees of a particular
organization)
27Traditional Library System
28But Digital Libraries Are Different From
Libraries...
- Not a physical location with local copies
objects held closer to originators - Decoupling of storage, organization, access
- Enhanced Authoring (origination, annotation,
support for work groups) - Subscription, pay-per-view supported in addition
to free browsing. - Integration into user tasks.
29A Digital Library Infrastructure Model
30UC Berkeley Digital Library Project
- Focus Work-centered digital information
services - Testbed Digital Library for the California
Environment - Research Technical agenda supporting
user-oriented access to large distributed
collections of diverse data types. - Part of the NSF/NASA/DARPA Digital Library
Initiative (Phases 1 and 2)
31UCB Digital Library Project Research
Organizations
- UC Berkeley EECS, SIMS, CED, IST
- UCOP/CDL
- Xerox PARCs Document Image Decoding group and
Work Practices group - Hewlett-Packard
- NEC
- SUN Microsystems
- IBM Almaden
- Microsoft
- Ricoh California Research
- Philips Research
32Testbed An Environmental Digital Library
- Collection Diverse material relevant to
Californias key habitats. - Users A consortium of state agencies,
development corporations, private corporations,
regional government alliances, educational
institutions, and libraries. - Potential Impact on state-wide environmental
system (CERES )
33The Environmental Library -Users/Contributors
- California Resources Agency, California
Environment Resources Evaluation System (CERES) - California Department of Water Resources
- The California Department of Fish Game
- SANDAG
- UC Water Resources Center Archives
- New Partners CDL and SDSC
34The Environmental Library - Contents
- Environmental technical reports, bulletins, etc.
- County general plans
- Aerial and ground photography
- USGS topographic maps
- Land use and other special purpose maps
- Sensor data
- Derived information
- Collection data bases for the classification and
distribution of the California biota (e.g.,
SMASCH) - Supporting 3-D, economic, traffic, etc. models
- Videos collected by the California Resources
Agency
35The Environmental Library - Contents
- As of late 2002, the collection represents over
one terabyte of data, including over 183,000
digital images, about 300,000 pages of
environmental documents, and over 2 million
records in geographical and botanical databases.
36Botanical Data
- The CalFlora Database contains taxonomical and
distribution information for more than 8000
native California plants. The Occurrence Database
includes over 600,000 records of California plant
sightings from many federal, state, and private
sources. The botanical databases are linked to
the CalPhotos collection of California plants,
and are also linked to external collections of
data, maps, and photos.
37Geographical Data
- Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data represents
maps and imagery that have been processed for
inclusion as layers in our GIS Viewer. This
includes Digital Ortho Quads and DRG maps for the
S.F. Bay Area.
38Documents
- Most of the 300,000 pages of digital documents
are environmental reports and plans that were
provided by California state agencies. This
collection includes documents, maps, articles,
and reports on the California environment
including Environmental Impact Reports (EIRs),
educational pamphlets, water usage bulletins, and
county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game
(DFG), San Diego Association of Governments
(SANDAG), and many other agencies. Among the most
frequently accessed documents are County General
Plans for every California county and a survey of
125 Sacramento Delta fish species.
39Testbed Success Stories
- LUPIN CERES Land Use Planning Information
Network - California Country General Plans and other
environmental documents. - Enter at Resources Agency Server, documents
stored at and retrieved from UCB DLIB server. - California flood relief efforts
- High demand for some data sets only available on
our server (created by document recognition). - CalFlora Creation and interoperation of
repositories pertaining to plant biology. - Cloning of services at Cal State Library, FBI
40Research Highlights
- Documents
- Multivalent Document prototype
- Page images, structured documents, GIS data,
photographs - Intelligent Access to Content
- Document recognition
- Vision-based Image Retrieval stuff, thing, scene
retrieval - Natural Language Processing categorizing the
web, Cheshire II, TileBar Interfaces
41Multivalent Documents
- MVD Model
- radically distributed, open, extensible
- behaviors and layers
- behaviors conform to a protocol suite
- inter-operation via IDEG
- Applied to enlivening legacy documents
- various nice behaviors, e.g., lenses
42Document Presentation
- Problem Digital libraries must deliver digital
documents -- but in what form? - Different forms have advantages for particular
purposes - Retrieval
- Reuse
- Content Analysis
- Storage and archiving
- Combining forms (Multivalent documents)
43Spectrum of Digital Document Representations
Adapted from Fox, E.A., et al. Users, User
Interfaces and Objects Evision, an Electronic
Library, JASIS 44(8), 1993
44Document Representation Multivalent Documents
- Primary user interface/document model for UCB
Digital Library (Wilensky Phelps) - Goal An approach to new document representations
and their authoring. - Supports active, distributed, composable
transformations of multimedia documents. - Enables sophisticated annotations, intelligent
result handling, user-modifiable interface,
composite documents.
45Multivalent Documents
46(No Transcript)
47(No Transcript)
48MVD availability
- The MVD Browser is now available as open source
on SourceForge - http//multivalent.sourceforge.net
- See also
- http//elib.cs.berkeley.edu
49GIS in the MVD Framework
- Layers are georeferenced data sets.
- Behaviors are
- display semi-transparently
- pan
- zoom
- issue query
- display context
- spatial hyperlinks
- annotations
- Written in Java
50GIS Viewer Features
- Annotation and saving
- points, rectangles (w. labels and links), vectors
- saving of annotations as separate layer
- Integration with address, street finding,
gazetteer services - Application to image viewing tilePix
- Castanet client
51(No Transcript)
52(No Transcript)
53(No Transcript)
54GIS Viewer Example
http//elib.cs.berkeley.edu/annotations/gis/buildi
ngs.html
55Geographic Information Plans and Ideas
- More annotations, flexible saving
- Support for large vector data sets
- Interoperability
- On-the-fly
- conversion of formats
- generation of catalogs
- Via OGDI/GLTP
- Experimenting with various CERES servers
56Documents Information from scanned documents
- Built document recognizers for some important
documents, e.g. Bulletin 17. TR-9. - Recognized document structure, with order
magnitude better OCR. - Automatically generated 1395 item dam relational
data base. - Enabled access via forms, map interfaces.
- Enable interoperation with image DB.
57(No Transcript)
58(No Transcript)
59(No Transcript)
60Document Recognition Ongoing Work
- Document recognizers for dozen document types
- Development and integration of mathematical OCR
and recognition. - Eventually produce document recognizer generator,
i.e., make it easier to write recognizers.
61Vision-Based Image Retrieval
- Stuff-based queries blobs
- Basic blobs colors, sizes, variable number
- demonstrated utility for interesting queries
- Blob world Above plus texture, applied to
- retrieving similar images
- successful learning scene classifier
- Thing-finding Successfully deployed detectors
adding body plans (adding shape, geometry and
kinematic constraints)
62Image Retrieval Research
- Finding Stuff vs Things
- BlobWorld
- Other Vision Research
63(Old stuff-based image retrieval Query)
64(Old stuff-based image retrieval Result)
65Blobworld use regions for retrieval
- We want to find general objects? Represent
images based on coherent regions
66(No Transcript)
67(No Transcript)
68(Thing-based image retrieval using body
plans Result)
69Natural Language Processing
Automatic Topic Assignment
- Developed automatic categorization/disambiguation
method to point where topic assignment (but not
disambiguation) appears feasible. - Ran controlled experiment
- Took Yahoo as ground truth.
- Chose 9 overlapping categories took 1000 web
pages from Yahoo as input. - Result 84 precision 48 recall (using top 5
of 1073 categories)
70Further Information
- Berkeley DL web site
- http//elib.cs.berkeley.edu