Title: LIS510 lecture 12
1LIS510 lecture 12
- Thomas Krichel
- 2006-12-13
2today
- Leftovers from last time.
- I discuss some elements of Bill Arms book on
Digital Libraries. - Its introductory book that general, but smartly
written. - It is not a book to each someone to become a
digital librarian. - LIS650 and LIS651 are for that. They really deal
with the introduction to digital information. - I also talk generally about understanding some
digital contents.
3definition
- An informal definition of a digital library is a
managed collection of information, with
associated services, where the information is
stored in digital formats and accessible over a
network. - managed in the key word here.
4benefits of digital libraries
- The digital library brings the library to the
user. - Computer power is used for searching and
browsing. - Information can be shared.
- Information is easier to keep current.
- The information is always available.
- New forms of information become possible.
5costs
- Non-digital libraries are very expensive.
- Digital libraries are also expensive. Many
publishers charge more for online editions that
for traditional print. - However the cost of the infrastructure is
dropping. - And there are potentials for changes in the way
information is supplied in digital libraries.
6technical change
- Electronic storage is becoming cheaper than
paper. - Personal computer displays are becoming more
pleasant to use. - High-speed networks are becoming widespread.
- Computers have become portable.
7libraries adapt
- Libraries get wired
- They offer electronic access, even to the home
user. - Other actions depend on the library type
- Some shift from information access to community
center. - Some adopt digital reference with 24/7
asynchronous help. - Some get involved in digital archiving of
institutional assets.
8digital library cost
- The digital library material will cost more
initially because publishers want to see a return
in the extra functionality they have developed. - In the longer run, digital library costs may be
lower than in print - lower storage cost
- less risk to the items
- fewer staff (but differently trained) requirements
9classic roles for the library with digital
material
- Investigation what to buy
- Negotiation of the purchase
- Acquisition of access to a service
- Installation of access devices
- Training of users
- Maintenance update, migrate, replace
10beyond the library
- The classic roles will at best a stagnating, if
not declining source for information
professionals. - The rise of open access will mean that no longer
as many assets as before will have to be
purchased. Todays example - http//dme.mozarteum.at
- Training needs of users decline as digital media
are getting easier to use.
11new roles for information professionals
- The information age does not happen without
information professionals. - There a huge demand for tech-savvy information
professionals out there. Examples include - web site maintenance
- digital archiving
12impact of technology on staff
- Information professionals that are
technologically savvy will thrive better than
those who are not. - Fortunately the Palmer School offers LIS508,
LIS650, LIS651. - It still does not have a system administration
class, but that may come as well.
13impact of technology on staff
- Constant computer use can cause serious health
problems - Problem areas are
- bad posture problems at the desk
- eye strain
- The use of mouse is particularly bad. Learn how
to avoid using it. - Injuries take a long time to heal.
14digital libraries are hard
- In digital libraries terminology is a bad
problem. Basic concepts are hard to find. - These definition problems also hurt efforts to
build sophisticated information systems by
semi-automated means. - We live in the age of the brute-force
calculation, not the age of artificial
intelligence.
15data and metadata
- Metadata is data about data. The distinction
between data and metadata depends often on the
context. - Metadata is often divided into
- descriptive metadata
- structural metadata
- administrative metadata
16whats in the digital library?
- Items ?
- Material ?
- Documents ?
- Objects?
- Digital Items ?
- Digital Material ?
- Digital Documents ?
- Digital Objects ?
17storage and dissemination
- Items are stored in digital format in a way we
can call the stored form of the item. - When the item is shown to the user, it is shown
as a presentation or dissemination. This is
the way the object leaves the server. - When it arrives at the users machines, they have
to render the presentation.
18users and clients
- A user is someone who uses a digital library.
Many times, the user is anonymous and can not be
identified. - A client is a software that the user runs to use
the digital library. Sometimes this is called a
user agent. Many times common people refer to it
as a browser.
19work and contents
- These are difficult things to discuss. Look at
the example at the song Der Lindenbaum. Could
mean - song as sound and words
- score
- performance
- recording
- mp3 file containing the recording
20repositories
- This is general term used to talk about a
computer system that has primarily the function
of storing contents. - When long-run storage is involved a repository
becomes an archive. - A server is a computer that is switched on
constantly to provide services to the public.
21an example of terminology
- A data model is an abstraction (or an extra
level of indirection) for digital objects such
that each digital object can be seen as an
instance of the class defined by the data model. - A surrogate is a transmittable serialization or
representation of a digital object that can be
passed back and forth so we can do things with
it. Possible serialization techniques include XML
and RDF/XML.
22a digital library from scratch
- Much of the data that is stored in digital
libraries is text. - Most other material, that is not textual in
nature, such as - sound files
- graphics
- need textual metadata in order to be found.
- Current technology is not able to find it
otherwise.
23Information
- Information is best understood as what it takes
to answer a question. - The simplest question has a yes or no answer.
Therefore a bit is the natural measure of
information. - Term first used by John Turkey in 1946.
- Concatenation of binary digit.
24Usage of bits
- Computers are sometimes classified by the number
of bits they can process at one time. "32 bit
processor" - Graphics are also often described by the number
of bits used to represent each dot.
25bits and bytes
- a bit can take the values 0 or 1, thus it can
describe 2 possibilities - two bits can take the value 00, 01, 10, 11, thus
it can describe four 22 possibilities - n bits can encode 2 power n possibilities.
- The first chips used to process 8 bits at a time.
It become customary to refer to them as a byte.
It can encode 2 power 8 possibilities. - We can use binary numbers just as decimal
numbers.
26application of bytes
- IP (Internet Protocol) numbers are used as the
addresses of computers on the Internet. - In IP version 4 (the one that is most commonly
used), each IP number has 4 bytes. - It is represented as x.x.x.x where x is a number
between 0 and 255 (why?) - How many computers can there be on the Internet
at any one time?
27Many bytes
- Larger units are
- Kilo byte is 2 power 10 bytes (1024 bytes)
- Mega bytes is 2 power 20 bytes
- Giga bytes is 2 power 30 bytes
- Tera byte is 2 power 40 bytes
- From ancient Greek words for "thousand", "large",
"giant", and "monster", respectively. Terms date
back to the French revolution.
28Hex numbers
- A byte is often represented by two hex numbers.
- Each hex number can encode 16 values
- Written 0 to 9, then A B C D E F. F is 15.
- Conventionally prefixed with 0x
- Use Microsoft calculator with scientific notation
to convert.
29applications of hex numbers
- Media Access Control (mac) addresses of hardware
that allows access to computer networks. They are
6-byte numbers, each byte written as 2 hex
numbers, e.g. 006008F520A9 - character numbers that you see when you are
inserting a special symbol in Microsoft software,
e.g. powerpoint. - Color codes on web pages use 6 hex digits.
- 000000 is black
- FFFFFF is white
30Information in a computer file
- A file is a piece of data on a stored on a
computer. - Any file contains a sequence of 0s and 1s, like
1010100101010011110101010101 - For a computer to make sense of a file, it has
to know what type of file it is.
31executable files
- Files that are executable are files that make the
computer do something. For example the file
starts a program, say powerpoint. An executable
on one computer may not run on another one. - Non-executable files hold data that is used by an
executable file. We will call them data files.
Example powerpoint slides file.
32Characters
- Much of the information processed by computers is
in the form of characters. - From wikipedia
- A character is a unit of information that roughly
corresponds to a grapheme, or written symbol, of
a natural language, such as a letter, numeral, or
punctuation mark. - A character is not a grapheme because there are
ligatures.
33control characters
- The concept also includes control characters,
which do not correspond to natural language
symbols but to other bits of information used to
process texts of the language, such as
instructions to printers or other devices that
display such texts. - An example for such a control character is the
newline character.
34text files
- Many data files contain textual data.
- Textual data is a sequence of characters.
- A character is an elementary symbol that has some
meaning - alphabet letter
- hieroglyph
- Example email file
- Text files can be read by many computer programs.
35non-text files
- Examples for non-text files are
- graphics files
- movie files
- sound files
- Non-text files are of minor significance in
library settings - There is no way to organize information retrieval
for non-text files. They have to be retrieved
using a textual surrogate. - Traditional library material are textual
- will talk about this later.
36Representing characters
- Computers don't understand text, they only
understand numbers. For computers to be able to
treat text, there must be a correspondence
between numbers and text characters. Such a
correspondence is called a character set. - Examples for characters are
- a
- c
- ë
37Legacy character sets
- In early days, computers were a lot less powerful
than they are today. - Could only deal with the characters that are most
commonly used. - Such sets are
- ascii
- ISO-8859-1
- cp1252
38ASCII
- American Standard Code for Information
Interchange - 7-bit character set. There is no such thing as
8-bit ASCII - 95 printable symbols
- 33 control characters (0-31, 127)
- http//www.ccmr.cornell.edu/helpful_data/ascii2.ht
ml has a list up to 127
39some ASCII control characters
- CR (13, M) is the carriage return
- LF (10, J) is the linefeed
- FF (12, L) is the form feed (new page)
- BS (8, H) is the backspace
- DEL (127, ALT-127) is delete
- ESC (27, ) escape
40ISO-8859-1
- ISO-8859-1, aka ISO-latin-1 extends ASCII with
characters that are commonly used by the western
European languages. - It is the default character set of html.
- Positions 128 to 159 are not used.
- Cp1252 fills these with graphic chars. It is as
Microsoft character set.
41This is not enough
- There are around 6800 different languages around.
- Some of these languages use characters sets that
are not finite, i.e. folks can make up now
characters out of existing ones! - Setting up a character set for all languages is
almost impossible.
42ISO 10646-1
- Defines the Universal Character Set (UCS)
- UCS contains the characters required to represent
characters used by many known languages, even the
likes of Oriya, Telugu, Bopomofo, Runic. - ISO 10646 defines formally a 31-bit character
set. They are represented as 32 bits, i.e. 4
bytes, or 8 hex chars. - Not finished.
.
43Unicode
- ISO is a inter-government agency. Slow and
bureaucratic. - Industry has come together to work on Unicode, a
2-byte character set. - With some minor exceptions, the Unicode
characters are the some as the first 65536
characters in UCS. - Much better documented standard.
44Unicode and legacy sets
- The first 128 characters are identical to those
in ASCII - The next 128 characters are identical to ISO
8859-1 (Latin-1). - Unicode is well documented and the Unicode book
can be downloaded from the Internet. A must-have
for the serious digital librarian.
45Beyond characters
- There is more to text than a string of
characters. - There is layout
- titles
- abstracts
- mathematical formula spacing
46Layout
- Layout can be conveyed by additional text that
has special meaning. Examples - LaTeX
- HTML
- PostScript
- Another way is to do non-textual layout by adding
some other digital signals. Examples - DVI
- MS Word
- MS Powerpoint
- These can not be shown in these slides!
47Example LaTeX
- \bigskip\textbfClass structure
- Classes will be held in the computer lab in the
Palmer School between 1815 and 2045. An
optional practice session will last until 2115. - \begintabular_at_llll_at_
- 02006--09--12introduction to the course \\
- 12006--09--19libraries and food \\
- 22006--09--26introduction to shushing \\
48Example HTML
- ltpgtltstronggtClass structurelt/stronggtltpgtClasses
will be held in the computer lab in the Palmer
School between 1815 and 2045. An optional
practice session will last until 2115.ltpgtClass
details - ltpgtltcentergtlttable width100 border1gt
- lttrgtlttd alignleftgt 0 lt/tdgtlttd alignleftgt
2006821109821112 lt/tdgtlttd alignleftgtlta
href"lis510w06a-00.ppt"gtintroduction to the
courselt/agt lt/tdgtlt/trgtlttrgtlttd alignleftgt 1
lt/tdgtlttd alignleftgt 2006821109821119
lt/tdgtlttd alignleftgtlta href"lis510w06a-01.ppt"gtli
braries and foodlt/agt lt/tdgt
49Example PostScript
- Fc(Class)g(structur)o(e)-104 3956 y
Fd(Classes)26b(will)g(be)e(held)g(in)h(the)f(compu
ter)f(lab)i(in)f(the)h(P)o(almer)f(School)g(betwee
n)f(1815)h(and)g(2045.)36 b(An)25
b(optional)e(practice)h(session)-104 4055
y(will)d(last)g(until)f(2115.)-104 4155
y(Class)i(details)-104 4307 y(0)141
b(2003\22609\22623)94b(introduction)18
b(to)i(the)h(course)-104 4407 y(1)141
b(2002\22609\22630)94 b(bits)21
b(bytes)f(and)g(characters)-104 4507 y(2)141
b(2003\22610\22607)94 b(databases)20
b(and)g(markup)e(languages)-
50DVI (rendition, "class structure")
- 1659 fntnum27 current font is ptmb8t
- 1660 setchar67 h-820459473168-347291,
hh-22 - 1661 setchar108 h-347291182183-165108,
hh-10 - 1662 setchar97 h-165108327680162572, hh11
- 1663 setchar115 h162572254928417500, hh27
- 1664 setchar115 h417500254928672428, hh43
- 1665 right3 163840 h672428163840836268,
hh53 - 1669 setchar115 h8362682549281091196, hh69
- 1670 setchar116 h10911962182321309428,
hh83 - 1671 setchar114 h13094282909761600404,
hh101 - 1672 setchar117 h16004043643761964780,
hh124 - 1673 setchar99 h19647802909762255756,
hh142 - 1674 setchar116 h22557562182322473988,
hh156 - 1675 setchar117 h24739883643762838364,
hh179 - 1676 setchar114 h28383642909763129340,
hh197
51XML
- XML the extensible markup language. It have
become the lingua franca for structured textual
data. - It is also increasingly use on the web.
52Databases
- Databases are collection of data with some
organization to them. - The classic example is the relational database.
- But not all database need to be relational
databases.
53Relational databases
- A relational database is a set of tables. There
may be relations between the tables. - Each table has a number of record. Each record
has a number of fields. - When the database is being set up, we fix
- the size of each field
- relationships between tables
54Example Movie database
- ID title director date
- M1 Gone with the wind F. Ford Coppola 1963
- M2 Room with a view Coppola, F Ford 1985
- M3 High Noon Woody Allan 1974
- M4 Star Wars Steve Spielberg 1993
- M5 Alien Allen, Woody 1987
- M6 Blowing in the Wind Spielberg, Steven
1962 - Single table
- No relations between tables, of course
55Problem with this database
- All data wrong, but this is just for
illustration. - Name covered inconsistently. There is no way to
find films by Woody Allan without having to go
through all spelling variations. - Mistakes are difficult to correct. We have to
wade through all records, a masochists pleasure.
56Better movie database
- ID title director year
- M1 Gone with the wind D1 1963
- M2 Room with a view D1 1985
- M3 High Noon D2 1974
- M4 Star Wars D3 1993
- M5 Alien D2 1987
- M6 Blowing in the Wind D3 1962
- ID director name birth year
- D1 Ford Coppola, Francis 1942
- D2 Allan, Woody 1957
- D3 Spielberg, Steven 1942
57Relational database
- We have a one to many relationship between
directors and film - Each film has one director
- Each director has produced many films
- Here it becomes possible for the computer
- To know which films have been directed by Woody
Allen - To find which films have been directed by a
director born in 1942
58Many-to-many relationships
- Each film has one director, but many actors star
in it. Relationship between actors and films is a
many to many relationship. - Here are a few actors
- ID sex actor name birth year
- A1 f Brigitte Bardot 1972
- A2 m George Clooney 1927
- A3 f Marilyn Monroe 1934
59Actor/Movie table
- actor id movie id
- A1 M4
- A2 M3
- A3 M2
- A1 M5
- A1 M3
- A2 M6
- A3 M4
- as many lines as required
60SQL
- Once we have the relational database, we can ask
sophisticated questions - Which director has had the most female actors
working for him? - In which years films have been shot that starred
actors born between 1926 and 1935? - Such questions can be encoded in a language know
as structured query language or SQL. All
relational database vendors implement a dialect
of SQL.
61databases in libraries
- Relational databases dominate the world of
structured data - But not so popular in libraries
- Slow on very large databases (such as catalogs)
- Library data has nasty ad-hoc relationships, e.g.
- Translation of the first edition of a book
- CD supplement that comes with the print version
- Difficult to deal with in a system where all
relations and field have to be set up at the
start, can not be changed easily later.
62http//openlib.org/home/krichel
- Thank you for your attention!