Title: LIS618 lecture 6
1LIS618 lecture 6
- Thomas Krichel
- 2004-03-14
2Structure
- Google
- news
- interfaces to non-web sources
- Usenet
- ODP
- relational databases
- OpenURL
- file sharing
3Google news
- Is a gathering of top stories from news stories.
- The entire pages in built by computer. Which
stories make it to the top depends on - how prominently the stories appear on news sites
- which sites the stories appear on
- when the articles were published
- how many articles cover the same story
- Note the side bar with stories of different topic
sections.
4special syntax for news I
- source gives news from a source only
- example sourcecnn works
- examples sourcebbc, sourcenytimes
source"new york times" dont seem to get
anywhere. - location gives a location. Can by a two-letter
state or a country - locationny
- locationrussia
5special syntax for news II
- allintitle searches for words in the title of
the article (not of the page) - example allintitle dead injured
- allintext searches for words in the text
- example allintext saarland government
- allinurl searches in article URLs
- example allinurlbbc Wales
- Restrictions
- One allin??? special syntax only.
- Must come first in the query.
6Google interfaces to 3rd party data
- Google groups are an interface to Usenet news,
called Google Groups. - Google directory is an interface to the Open
Directory Project. - In both cases Google is dependent on the quality
of these underlying data source.
7Usenet news
- Usenet is a collection of user-submitted notes on
various subjects that are posted to servers on a
worldwide network. Each subject collection of
posted notes is known as a newsgroup. - A newsgroup is a discussion about a particular
subject consisting of notes written to a
networked site and distributed through Usenet. - Newsgroups are hierarchical. Hierarchical levels
are separated by dots example comp.text.tex. - alt, news, info, biz, rec, comp, sci, humanities,
soc, misc, talk are classic world-wide groups. - alt stands for anarchists, lunatics and
terrorists.
8Usenet history
- The idea of network news was born in 1979 when
two graduate students, Tom Truscott and Jim
Ellis, thought of using UUCP to connect machines
for the purpose of information exchange among
users. They set up a small network of three
machines in North Carolina. - UUCP is UNIX to UNIX copy'' a protocol that is
used to copy files between machines running some
flavor of UNIX, without the need for IP protocol.
Usenet is older than the Internet
9decline of Usenet
- essentially open to all (peer-to-peer system)
- used by spammers for
- posting
- gathering addresses
- steady decline of quality of contribution
- steady decline of quantity of contributions
10Usenet worth checking out
- independent reviews of products, often written by
experts. - Example interpretation of beethoven sonatas by
Wilhelm Kempff. - Sorting by date reveals that the newsgroup
rec.music.classical.recordings is still active.
On a good day, you will find no finer guide to
records.
11special syntax for Google Groups
- group limits posting to a certain group
- title limits to titles of postings
- author searches for author name or email address
- Mixing syntaxes works well.
- Example intitlekempff grouprec.music.classical
.recordings
12the open directory project
- The Open Directory Project is the largest, most
comprehensive human-edited directory of the Web.
It is constructed and maintained by a vast,
global community of volunteer editors. - Claim that there is a historic precedence in the
Oxford English Dictionary. - Formerly known as GnuHoo'', then NewHoo'',
then acquired by NetScape, and called dmoz''.
13dmoz.org
- dmoz is maintained by volunteers net-citizen''.
No special qualifications required, but claimed
to be experts. - There are about 30,000 volunteers (they claim).
- Powers the core directory services for the Web's
largest and most popular search engines and
portals - Netscape Search AOL Search
- Google Lycos
- HotBot DirectHit
- Headquarters run by Netscape.
14Appearance of ODP
- If Google finds a relevant category it puts it
into the result. - Remember a Google response is a list of results.
- Each result has
- title
- snippet
- URL
- Some results have optionally a category attached.
Following such categories is a winner if your
information need is broad.
15full-text databases
- These databases have an emphasis on providing
full-text information in a web environment. - Their particular strength is the aggregation of
material from a range of publishers. - This especially concerns scholarly publishing,
where the source material are distributed among a
large number of sources.
16Access
- Some of the is arranged via the Brooklyn LIU
campus. We can use the on-campus access here. - The databases have some full-text, but not a lot.
17Proquest
- go into the database selection, delete everything
and then use the research library. - we can search for Paul Levine. It appears that
- not all articles have full-text
- there is no distinction between different Paul
Levines - Otherwise it appears straightforward to use
18aggregators
- Proquest and ebsco work as aggregators. They put
different scholarly journals in one database
together, so you dont have do deal with
publishers different interfaces. - Publishers are reluctant to join and impose
moving-wall embargos on full-text release. - So you can not access the full-text via them. But
your library may have the text somewhere.
19the library as aggregator
- typically, a library buys holdings from a
publisher, as well as cross-publisher abstract
and indexing data. - when users finds a reference in an abstract and
wants to access the full text, they are stuck - Herbert Van de Sompel has been working on this
problem.
20special effects (SFX)
- Herberts idea was to equip the interface with a
special effects button. - When users press the button, the interface would
transmit metadata such as - author name
- journal name
- title
- date
- to a special database, called a resolver.
21resolver
- The resolver examines the metadata and makes a
decision on what to show to the user. - if the journal is subscribed to and the date is
recent, it may formulate a query to the
publishers database and fetch the record and/or
full text there. - if the journal is not held, suggest ILL
- etc
22configuring the resolver
- librarians, who know the local setting, will
configure the server so that users are given the
appropriate extended services given the local
circumstance. - Note that what is returned is a set of extended
services, not the response to a specific query.
23Bison Futé model
- This refers to further work by Herbert to
generalize the idea. - On a web page, you find a link. It has been made
by the provider of the web pages. - But this link may not be a appropriate. There
maybe better technology that allows you to move
in the same direction but with your own link. - In other words we talk about context-sensitive
linking.
24OpenURL
- This is now a draft standard with NISO to
standardize the special effects request. - The OpenURL is a transport architecture for
context objects. - Context objects unite descriptions of
- the reference found
- the context in which is was found
25implications for information retrieval
- The implications on the library world are already
important. - many library systems software already implement
OpenURLs and provide resolvers - But impact could be wider and could cover a whole
new structure for the web, replacing static links
with on-the-fly dynamic ones.
26Databases
- Databases are collection of data with some
organization to them. - The classic example is the relational database.
- But not all database need to be relational
databases.
27Relational databases
- A relational database is a set of tables. There
may be relations between the tables. - Each table has a number of record. Each record
has a number of fields. - When the database is being set up, we fix
- the size of each field
- relationships between tables
28Example Movie database
- ID title director date
- M1 Gone with the wind F. Ford Coppola 1963
- M2 Room with a view Coppola, F Ford 1985
- M3 High Noon Woody Allan 1974
- M4 Star Wars Steve Spielberg 1993
- M5 Alien Allen, Woody 1987
- M6 Blowing in the Wind Spielberg, Steven
1962 - Single table
- No relations between tables, of course
29Problem with this database
- I made up all the data. It is just for
illustration. - Name covered inconsistently. There is no way to
find films by Woody Allan without having to go
through all spelling variations. - Mistakes are difficult to correct. We have to
wade through all records, a masochists pleasure.
30Better movie database
- ID title director year
- M1 Gone with the wind D1 1963
- M2 Room with a view D1 1985
- M3 High Noon D2 1974
- M4 Star Wars D3 1993
- M5 Alien D2 1987
- M6 Blowing in the Wind D3 1962
- ID director name birth year
- D1 Ford Coppola, Francis 1942
- D2 Allan, Woody 1957
- D3 Spielberg, Steven 1942
31Relational database
- We have a one to many relationship between
directors and film - Each film has one director
- Each director has produced many films
- Here it becomes possible for the computer, and
then the user - To know which films have been directed by Woody
Allen - To find which films have been directed by a
director born in 1942
32Many-to-many relationships
- Each film has one director, but many actors star
in it. Relationship between actors and films is a
many to many relationship. - Here are a few actors
- ID sex actor name birth year
- A1 f Brigitte Bardot 1972
- A2 m George Clooney 1927
- A3 f Marilyn Monroe 1934
33Actor/Movie table
- actor id movie id
- A1 M4
- A2 M3
- A3 M2
- A1 M5
- A1 M3
- A2 M6
- A3 M4
- as many lines as required
34SQL
- Once we have the relational database, we can ask
sophisticated questions - Which director has had the most female actors
working for him? - In which years films have been shot that starred
actors born between 1926 and 1935? - Such questions can be encoded in a language know
as structured query language or SQL. All
relational database vendors implement a dialect
of SQL.
35importance of relational databases
- Relational databases dominate the world of
structured information. Examples - employment and payroll in a company
- stock management
- e-commerce
- There are quite easy ways to get relational
databases to work with web interfaces. Some are
freely available. The most common one is the LAMP
(Linux Apache MySQL PHP) architecture.
36relational databases in libraries
- A 2004 enquiry on the LITA revealed that many
respondents said that they did regret most not
having learned more about relational databases in
library school. - But there are problems with relational databases
in libraries - Slow on very large databases (such as catalogs)
- Library data has nasty ad-hoc relationships, e.g.
- Translation of the first edition of a book
- CD supplement that comes with the print version
- Difficult to deal with in a system where all
relations and field have to be set up at the
start, can not be changed easily later.
37off-web Internet information retrieval
- Under this heading, I principally think about
activities known as file-sharing. - They concern the (mostly illegal) exchange of
files between users. Such files many encode - music
- films
- There is a lot of it going on, but we are not
sure how much.
38Napster
- Napster was the first prominent file-sharing
service. - Napster ran a central server. You connected to
that server and announced what files you had to
share. - Every search was conducted on the dataset
assembled at the central server. - Connections to download files were done between
peer machines only.
39end of Napster
- Napster argued since it was only involved in
collecting the information about files available,
it was legal. - Napster never shared any illegal file.
- The courts thought otherwise.
- It was shut down.
- Napster network died without a central machine.
- To enable true piracy, we need a truly
distributed system.
40gnutella protocol
- This protocol underlies much of the current
file-sharing activity on the Internet. - It enables a peer-to-peer network between
machines. There every machine is a client and a
server and called a servent accordingly. - To connect to a gnutella network, you need the IP
address of one single machine that is already
part of the network.
41connection to the guntella network
- Once you establish connection to the first
servent, you announce your presence. - The first servent will pass on that message to
all the servents that it is connected to, and so
on. - This quickly adds up to a lot of traffic!
42time to live
- Every gnutella message has a time to live TTL. It
is decremented every time it passes at a servent. - The TTL is usually quite small. It can be
arbitrarily reduced by servents. - Therefore you only talk to servents that are
close to you. But your software will determine
which servents to try to contact first. That
usually depends on previous query results.
43searches
- When you do a search, it is passed on from
servent to servent through the p2p network. - Servents have their own rule how to respond to
queries. - Most of the time search strings are matched
against a file name. - Some may try to match against the directory name.
- Some general queries may be rejected.
- Some results sets may be truncated.
44downloading
- If you see a file that you like to have, you can
try to download it. - To implement downloads the servents use http.
Thus everyone who is connected to a file sharing
network run a web server! - However, there usually is a tight limit on how
many downloads a server will accept. - Modern servents have the ability to download from
several servents.
45ease to infringe
- Clearly all the traffic on gnutella, with current
technology, can be observed. - But the infringement is so massive that it
appears difficult to clamp down on. - The easy to infringe is technological.
- RIAA have sued. They reach the tippy top of the
iceberg, with the hope to dissuade.
46http//openlib.org/home/krichel
- Thank you for your attention!