Digital Library Technologies - PowerPoint PPT Presentation

About This Presentation
Title:

Digital Library Technologies

Description:

l n='48' 'To-morrow is rs key='StValentine' Saint Valentine's /rs day, /l ... Signature files: electronic edge-notched cards, trading space for false drops ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 18
Provided by: comminfo
Category:

less

Transcript and Presenter's Notes

Title: Digital Library Technologies


1
Digital Library Technologies
  • Text formats and storage
  • Searching text
  • Images
  • Speech
  • Multimedia
  • Networking

2
Text formats
Ascii simple, no formatting, accessible HTML
simple, moderate formatting, accessible word
processors formatting, access limited PDF
formatted, complex, access limited TEI
formatted, open, very complex ltsp
who"Oph"gtltspeakergtlthi rend"i"gtOph.lt/higtlt/speaker
gtltpgtltlb n"46"/gt Pray let's have no words of
this, but whenltlb n"47"/gt they ask you what it
means, say you thislt/pgtltstagegt lthi
rend"i"gtSong.lt/higtlt/stagegtltlg part"M"
type"song"gtltl n"48"gt"To-morrow is ltrs
key"StValentine"gtSaint Valentine'slt/rsgt
day,lt/lgtltl n"49"gtAll in the morning
betime,lt/lgtltl n"50"gt And I a maid at your
window, lt/lgt
3
Control of the format
Ascii user has complete control of display HTML
user has considerable control of display PDF
publisher has all the control Authors and
readers disagree on who should decide things
like column layout, type size, etc. Over time,
more and more Web documents have the format
nailed down.
4
Text compression
Basic strategies statistics or
dictionaries Statistics Morse code the more
frequent letters get shorter codes Huffman coding
is the traditional method here, but lengthening
the alphabet will give better results. Dictionari
es Lempel-Ziv or LZW. Find repeated strings and
list them at the beginning. Questions
instantaneously decodable? Is a factor of 2
worth the trouble?
5
Searching text files
Linear scan (grep) not for very big collections,
no update problem Inverted files tries, or just
divide by blocks May wish to compress occurrence
lists, index by both ends, allow fielded
searching, and keep frequency information Signatu
re files electronic edge-notched cards, trading
space for false drops Bitmaps best for very
common words add to inverted files Clustering
for complex searching, summarizing results Case
folding, suffixing, stop lists.
6
Grab an example compromise
Grab was an attempt to balance between the speed
of inversion and the compactness of linear
search. Bitmap vectors on hashed words,
compressed 10bits to 4 bits. Go back later and
cast out false drops. For 5 extra space, get
90 speedup on linear. Never caught on. Space
is too cheap today, and files are too big. Might
as well use full inversion.
7
Why not a DBMS?
Why don't text retrieval systems use a DBMS
underneath? Few numerical entries, and vast
numbers of items Special needs, such as index
browsing and truncation searching Input not
neatly structured into records, and variable
length of items may have to be retrieved Not
much updating. Parallel searching just coming
into vogue.
8
What do the search engines do?
Very large inverted files and parallel search
engines on a great many machines
(thousands). Big caches. They may search only
in the cache and avoid all disk delays Are
willing to give different results depending on
what data is in cache
9
Collaborative ranking and filtering
Google is the best known search engine it
derives from backrub at the Stanford digital
library project. See http//www-db.stanford.edu/
backrub/google.html The Anatomy of a Large-Scale
Hypertextual Web Search Engine Sergey Brin and
Lawrence Page. Simply pages pointed to by a lot
of other people are probably better. Other work
from Jon Kleinberg at Cornell has looked at links
in both directions, and this is all related to
collaborative filtering.
10
Image formats
There are a great many image formats. The best
known are GIF and JPG. Why are there so
many? Images are bulky. The best compression
is lossy, and one can choose what kinds of
things to lose. GIF loses color space it is
perfect on bw. JPG is more general You can do
non-lossy compression, eg. Tiff G4, just run
ordinary Huffman-like compression on the
signal. Wavelet and fractal compression are
coming along. JPEG2000 may replace JPEG
someday. DjVu is particularly interesting
oriented for text, it divides the page into
background and foreground and does wavelets on
the background and dictionary compression on the
foreground.
11
Sound formats
Some technology, but mostly commerce a)
Digitization rates. You can do speech at 8 kHz,
but for music you ought to do better CD music is
44.1 Khz. b) Compression You can get speech to
2400 baud or so, and music by a factor of 10
(MP3 current favorite). Commerce Real vs.
Microsoft (WMP). Digital rights
management Unlike text, few people can write
sound manipulation software, and so everyone is
dependent on one or another vendor.
12
Video formats
Video is extremely bulky. With 24 frames/second
(movies) or 30 (TV), an hour of video is easily a
gigabyte even with minimal resolution on each
image. But there is enormous scene-to-scene
redundancy. MPEG sequence key frames and then
differentially coded frames JPEG like coding on
individual frames prediction of moving objects.
MPEG-1 1.5 Mbit/sec MPEG-2 4-9
Mbit/sec MPEG-4 mixing synthetic (animation)
with camera video MPEG-7 metadata The next real
improvement is going to have to be longer-term
storage and segmentation, e.g. separating the
background from a scene and keeping it for many
frames.
13
Image searching
QBIC color, texture, some shape Color histogram
is easiest beware any demo of sunsets Current
work at Berkeley better at segmentation labeling
14
Image labeling
David Forsyth Jitendra Malik, Berkeley
15
Sound searching
Speech speech recognition speaker
identification Music we now have hum search
software. See Bill Birmingham, U of Michigan
Donald Byrd, U. Mass.
16
Video searching
See Informedia, Howard Wactlar, CMU. Combination
of closed-captioning speech recognition
face recognition OCR of on-screen text
some image searching Also a great deal of work on
presentation.
17
Summary
There are lots of things in digital libraries
today. And there are more to come 3-D objects,
scientific data, software, All of this will
have to be stored, organized and searched.
Write a Comment
User Comments (0)
About PowerShow.com