Title: LIBR 557 Advanced Information Retrieval
1How Information is Organized
- Databases and the Structure of Information
2The structure of information
- Understanding the structure of how something is
organized is the first huge step in effective
retrieval and controlling the information anxiety
beast. - - Richard Saul Wurman, Information Anxiety guru
3Class Objective
- to introduce the organization of information in
the traditional information retrieval model
4Class Outline
- What is a database?
- Structure of a simple database
- Making a database searchable
51. What is a Database?
- A computerized collection of information that is
arranged in a way that makes it easy to retrieve
information - A collection of machine-readable information
accessible through a computer
6Why study databases?
- Often contain unique information
- Information is often reliable
- Cover hundreds of publications in a single search
request - Detailed, human indexing
- Search syntax is sophisticated and powerful
72. Structure of a Simple Database
- Data
- raw information
- Records
- discrete units of information
- Fields
- distinct part or section of a record
8The repetitive lyrics database
- 3 songs with highly irritating, repetitive lyrics
9Repetitive lyrics database song 1
- Oh yeah, I'll tell you something,
- I think you'll understand.
- When I'll say that something
- I want to hold your hand,
- I want to hold your hand,
- I want to hold your hand.
10Repetitive lyrics database song 2
-
- I, I will always, always love you
- I will always love you
- I will always love you
- I will always love you
- change this to something with more words
something tricky like Ill, youll
11Repetitive lyrics database song 3
- Ill tell you what I want, what I really really
want, - So tell me what you want, what you really really
want, - I wanna, I wanna, I wanna, I wanna, I wanna
really really really wanna zigazig ah
12Structure of a Simple Database
- Data
- raw information
- Records
- discrete units of information
- Fields
- distinct part or section of a record
13Types of Electronic Databases
- Bibliographic databases
- Full-text databases
- Numeric databases
- Directory databases
14Typical Fields in a Bibliographic Database
- Unique identifier
- Title
- Author
- Date published
- Journal name
- Publisher
- Document Type (eg book review)
- Subjects or descriptors
- Abstract
15Types of Electronic Databases
- Bibliographic databases
- Full-text databases
- Numeric databases
- Directory databases
163. Making a Database Searchable
- Which fields will be searchable?
- What type of indexing will be used for each
searchable field? - Are there any words that shouldnt be indexed?
17Inverted Indexing
- In a back-of-the-book index, entries point to a
specific page or paragraph in the book - In an inverted index, entries point to a specific
record in the database - Inverted indexing is done by software
18Advantages of Inverted Indexing
- Speed of Retrieval
- Word Position
- Field searching
- Proximity and phrase searching
- Word Frequency
19Steps in Inverted Indexing
- Provide unique identifier for each document if
none exists - Analyze each record for significant words
- Generate alphabetical list of significant words
with a pointer to the unique identifier
20Inverted Indexing, Step 2
- What is a Significant Word?
- Significant words are all words except stop words
(words like AND, AN, FOR, TO, THE) - Stop words slow down the indexing and searching
process - Position of stop words marked to enable proximity
searching - Can be indexed as part of a phrase
-
21Inverted Indexing, Step 2
- Analyze record for significant words
- Divide record into fields
- Label field (eg AU, TI)
- Note position of each word in the field
- Significant words are identified
- Position of stopwords also noted
22Inverted Indexing, Step 2 Quotations Database
- Record 123
- Quote field (QU)
- To be or not to be, that is the question
23Inverted Indexing, Step 2 Quotations Database
- To be or not
- QU002 QU003 QU004
- to be, that is
- QU006 QU007 QU008
- the question
- QU0010
24Inverted Indexing, Step 3
- Generate (parse) a list of significant words with
a pointer to a records unique identifier - Sort these words alphabetically
- Remove duplicates
25Inverted Indexing, Step 3 Generate List of
Significant Words
- Word File Position
- be 123 QU002
- or 123 QU003
- not 123 QU004
- be 123 QU006
- that 123 QU007
- is 123 QU008
- question 123 QU0010
26Inverted Indexing, Step 3 Sort List
Alphabetically
- Word File Position
- be 123 QU002
- be 123 QU006
- is 123 QU008
- not 123 QU004
- or 123 QU003
- question 123 QU0010
- that 123 QU007
27Inverted Indexing, Step 3 Remove Duplicates
- Word File Frequency Position
- be 123 2 QU002,QU006
- is 123 1 QU008
- not 123 1 QU004
- or 123 1 QU003
- question 123 1 QU0010
- that 123 1 QU007
-
28Inverted Indexing Word Phrase Indexing
- Individual words indexed in fields like abstract,
title - Phrases indexed in fields like author, journal
name - Both words and phrases can be indexed in the same
field - Word fragments (in scientific databases)
29Word Phrase Indexing in Dialog
- Remember
- A word in Dialog is a set of alphabetical or
numeric characters surrounded by either
punctuation or a space - A phrase in Dialog is an entire entry in a
field - British General Electric Co.
- Basch, Reva
- Basch, R.
30Inverted Indexing in the Repetitive Lyrics
Database
- Step 1 Unique identifier colour of post-it
- Step 2 Analyze each record for significant words
- Step 3 Generate alphabetical list of significant
words with a pointer to the unique identifier
31Searching Inverted Indexes
- Presence of term
- Boolean logic
- Position of term
- Field searching
- Absolute position
- Proximity
- Post-coordinated phrase searching
- Frequency of term
- Ranking
32What Next?
- Dialog lab
- Last weeks practice exercises
- Expand, Thesaurus, Multiple Files
- Next weeks lecture
- Search strategies and tactics
- Database publishing trends
- February 1 class
- Management of library databases
- More about database publishing trends