Title: Nomadic Digital Library Research at Cornell
1Automated Digital Libraries William Y.
Arms Department of Computer Science Cornell
University
2Two Questions
3Before Digital Libraries
Access to scientific, medical, legal information
In the United States -- excellent if you
belonged to a rich organization (e.g, a
major university) -- very poor otherwise In many
countries of the world -- very poor for everybody
4Question 1
Must access to scientific and professional
information be expensive?
5Research Libraries are Expensive
library materials
buildings facilities
staff
6The Potential of Digital Libraries
7Question 2
How effectively can computers be used for the
skilled tasks of professional librarianship? --
Time horizon 5 to 20 years -- All materials in
digital form
8Automated Library Services
9Skilled Librarianship
People are skilled at judgment, understanding,
discrimination, etc. -- selection --
cataloguing, indexing -- seeking for
information -- evaluating information --
reference service Can computers provide
equivalent services?
10Equivalent Services
Example Cataloguing rules -- Application of
cataloguing rules to monographs is skilled -- It
is hard to imagine a computer system with these
skills but ... -- Catalogs and cataloguing
rules are the means not the end
11Equivalent Services
Information discovery Why are web search services
the most widely used information discovery tools
in universities today?
12Conventional Criteria
Web search services have many weaknesses --
selection is arbitrary -- index records are
crude -- no authority control -- duplicate
detection is weak -- search precision is
deplorable yet they clearly satisfy important
requirements ...
13Effectiveness of Web Search
Inspec v. Google Google is usually superior for
general computing and computer science questions
gt Broader coverage gt Adequate indexing
records gt Better ranking
14Simple Algorithms Immense Computing Power
15History Licklider
J. C. R. Licklider Libraries of the Future,
1965 -- envisaged digital libraries for
scientists at their place of work --
listed desiderata for a digital library --
studied construction of fully automated digital
libraries -- put emphasis on artificial
intelligence and natural language processing
16History Licklider
Licklider's predictions for digital libraries
were remarkably good, but ... -- over optimistic
about progress in artificial
intelligence -- underestimated what can be done
by brute force computing
17Brute Force Computing
Few people can appreciate the power of Moore's
Law -- Computing power doubles every 18
months -- Increases 100 times in 10 years --
Increases 10,000 times in 20 years Simple
algorithms immense computing power may
outperform human intelligence
18Brute Force Computing
Example Creators of the world champion chess
program (Deep Thought later Deep Blue) --
moderate chess players -- simple tree-search
algorithm -- very, very fast computer hardware
19An Anecdote
The question (Marvin Minsky) -- How would you
design as computer system that can answer
questions such as, "Why was the space
station a bad idea?"? The answer (Danny
Hillis) -- Design much more powerful computers!
20Examples of Automated Digital Library Services
21Web Search
Brute force indexing and retrieval -- retrieve
every page on the web -- index every word --
repeat every month Getting better all the time --
improved algorithms -- faster computers and
networks -- analysis of users
22Web Search
Ranking algorithms Closeness of match -- vector
space and statistical methods (Salton, et
al., c. 1970) Importance of digital object --
Google ranks web pages by how many other pages
link to them, gives greater weight to
links from higher ranking pages.
(NSF/DARPA/NASA Digital Libraries Initiative)
23Archiving and Preservation
Internet Archive -- Monthly, web crawler gathers
every open access web page with associated
images -- Web pages are preserved for future
generations -- Files are available for scholarly
research not perfect ... -- HTML pages, images
no Java applets, style sheets -- materials are
dumped with no organization or indexing --
access for scholars is rudimentary
24Reference Linking
Web of Science (ISI) -- input combination of
automatic means, skilled people -- limited
number of journals -- very expensive ResearchInde
x (a.k.a. CiteSeer, a.k.a. ScienceIndex) (NEC) --
fully automatic -- all open access material in
computer science -- a free service
25Beyond Text
Informedia (Carnegie Mellon) Automatic processing
of segments of video, e.g., television news.
Algorithms for -- dividing raw video into
discrete items -- generating short summaries --
indexing the sound track using speech
recognition -- recognizing faces -- searching
using natural language processing
(NSF/DARPA/NASA Digital Libraries Initiative)
26Costs and Benefits
27Costs of Catalogs and Indexes
Catalog, index and abstracting records are very
expensive when created by skilled
professionals -- only available for certain
categories of material (e.g., monographs,
scientific journals) -- contain limited fields
of information (e.g., no contents page) --
restricted to static information High costs
reduce effectiveness and access
28Costs of Automated Digital Libraries
The Google company -- 5.5 million searches
daily -- 85 people (half technical, 14 with
Ph.D. in computing) -- 2,500 PCs running Linux,
with 80 terabytes of disk The Internet
Archive -- 7 people with support from
Alexa (March 2000)
29Overall
If you are rich ... -- Research libraries, using
commercial information services, provide
excellent service at very high cost to a
favored few -- Automated digital libraries are a
long way from providing the personal
reference service available to a faculty
member at a well-endowed university but ...
30The Model T Library
The Model T Ford, with mass production, brought
car travel to the masses ...
-- Automated digital libraries, with open access
materials, can already provide good service
at low cost
-- In the future automated digital libraries can
bring scientific, scholarly, medical and
legal information to everybody at
negligible cost
31A Footnote
32Library Expertise
The future of scientific and professional
information is tied to computing, but ... --
automated digital libraries need small teams of
highly skilled people -- development of
automated digital libraries is bypassing
libraries (Google, ResearchIndex,
Informedia, Internet Archive) The level of
computing expertise in U.S. research libraries is
depressingly low
33Further reading
William Y. Arms, "Automated digital libraries."
To be submitted to D-Lib Magazine, July/August
2000. William Y. Arms, "Economic models for
open-access publishing." iMP, March 2000.
http//www.cisp.org/imp/march_2000/03_00arms.htm
34Automated Digital Libraries William Y.
Arms Department of Computer Science Cornell
University