Title: Vocabulary
1Vocabulary languages in searching
- Connection
- indexing
- searching
2Basic assertion
- Indexing and searching inexorably connected
- you cannot search that that was not first indexed
in some manner or other
- indexing of documents or objects is done in order
to be searchable
- there are many ways to do indexing
- to index one needs an indexing language
- there are many indexing languages
- even taking every word in a document is an
indexing language
- Knowing searching is knowing indexing
3General definitions
- Vocabulary Encarta Dictionary
- 1. words known
- LANGUAGE - all the words used by or known to a
particular person or group, or contained in a
language as a whole
- Language
- 1. speech of group
- the speech of a country, region, or group of
people, including its diction, syntax, and
grammar
- 2. system of communication
- a system of communication with its own set of
conventions or special words
4From general to specific
- These general definitions are valid for
application in indexing searching to define
- index terms
- indexing vocabulary
- indexing language
- search terms
- search vocabulary
- query (request, search) language
5Specific
- Index term
- a word or phrase that denotes (describes) a
concept connotes (implies) a class
index term table describes a
and implies many kinds of tables
for which, if desired, we may have more specific
index terms
6Specific ...
- Indexing vocabulary
- a set of index terms used in a domain or for a
set of documents or objects
- it could be even a single document or object e.g.
a book
- Indexing language
- an indexing vocabulary together with rules
syntax, grammar for their application and use
7Specific ...
- Search terms
- a counterpart to index terms, also denoting a
concept and connoting a class for a search
- Search vocabulary
- a set of search terms in a domain or available in
a systems
- Query language
- a search vocabulary together with rules for their
use in searching
8More
- An index language is the language used to
describe documents and requests.
- The elements of the index language are index
terms, which may be derived from the text of the
document to be described, or may be arrived at
independently. - The vocabulary of an index language may be
controlled or uncontrolled.
- (van Rijsbergen, 1979)
9Controlled vocabulary
- Predetermined indicating what terms to be used
in indexing
- may show definition of and relations between
terms
- examples thesaurus, subject heading list,
classification
- Also indicates terms that may be selected for
searching
- An indexing AND a searching tool
- Human constructed
- and costly to construct and use
10Uncontrolled vocabulary
- Derived from documents
- nowadays automatically
- using various ways or algorithms
- constant issue which way is better
- Used to construct inverted indexes
- a concordance, such as of the Bible, indicating
place and position of each word mentioned in the
text is an inverted index
- monks used to do it in 12th century, computers do
it today
- Inverted indexes are used for free text
searching
11Controlled vs. free text searching
- Endless source of debate controversy
- But, each has its place for given circumstance
retrieval goal
- Each has strengths weaknesses
- can you list or find a list comparing them?
- Users mostly use free text searching
- Professional searchers use both as warranted
- As option
- KNOW THY CONTROLLED VOCABULARY
12Inverted indexes
- Useful to know how they function to understand
search retrieval. Steps
- Each document is indexed
- every word in a document is taken as index term
with exception of stop words
- position in text is noted
- Indexes for all documents are merged
- index terms are arranged alphabetically in the
bowel of the system
- under each index term are document numbers in
which it appears position in text for that
document
13So, when you search
- for digital AND libraries
- computer takes all documents under digital
- and all documents under libraries
- compares to see which documents have both terms
and then
- provides you the list of those documents in a
default format or you may choose a format
- This is also called coordinate indexing
- coordination is done at time of searching
14Variation when you search
- for digital (WITH) libraries or
- digital libraries i.e as a phrase
- computer goes through the same steps as before
but then also
- looks for documents where digital is positioned
right before libraries
- remember computer knows position of each term
in each document, each sentence
- So searching for a phrase is a form of searching
of terms connected with AND but in a given
sequence
15Example of inverted file
For simplicity documents have one sentence.Stop
words a, of, in.
Search for slow AND truck gets as results
documents 1 and 3 since both contain slow and
truck
Search for slow (w) truck retrieves only document
3 in which slow is 7th and truck is 8th, they
are right next to each other. Doc 1 has both
words, but not next to each other thus not
retrieved
16Thesaurus
- Good old Peter Mark Roget had a most useful idea
did a great job
- Following this idea thesaurus became THE major
tool for controlled vocabulary in information
retrieval (IR)
- starting in 1950s to this day many IR thesauri
have been developed
- all have a similar structure function
- but they are difficult costly to construct
17What is a thesaurus?
- For writers, it is a tool like Rogets one
with words grouped and classified to help select
the best word to convey a specific nuance of
meaning. - For indexers and searchers, it is an information
storage and retrieval tool a listing of words
and phrases authorized for use in an indexing
system, together with relationships, variants and
synonyms, and aids to navigation through the
thesaurus. - (Milstead, 2000)
18more
- A thesaurus to an information scientist is a
controlled set of the terms used to index
information in a database, and therefore also to
search for information in that database so the
same concepts are represented by the same term. - (Batty, 1998)
19Basic thesaurus components
- For each entry thesaurus has a classification
grid
- Descriptor (DE) an index term that has
- Scope note (SN) context in which used
- Broader terms (BT) higher in a hierarchy
- Narrower terms (NT) lower in a hierarchy
- Related terms (RT) other connected descriptors
- Used for (UF) synonyms that are not
descriptors
- Note not all of these may be present for every
descriptor
- A searcher or indexer can use these as a guide
for selection/rejection for browsing to get
ideas
20Examples of thesauri
- Thesauri have been constructed for great many
domains, from A to Z
- here are some lists
- international multilingual thesauri
- online thesauri
- among them ERIC Thesaurus (we use it for
example)
- BUT different thesauri may and do treat the same
descriptor (index term) differently
- having different, more or fewer narrower,
broader, related terms
- thus it is dangerous to use them interchangeably
21Standard structure
With variations on the theme, thesauri have
similar conceptual structure to guide searcher or
indexer
Note Every descriptor doesn't have to have all
of these
22Same thesaurus but
- Examples of ERIC (Educational Resources
Information Center) thesaurus as used differently
in different systems
- ERIC own system
- ERIC file on DIALOG (begin 1)
- ERIC file on OVID (accessible through RUL)
- Notice how each uses thesaurus displays search
in its own way, but principles still the same
- Oh well
23ERIC online thesaurus on ERIC
- Allows for
- searching for words that are included in
descriptors by category or all categories
- browsing alphabetically
- browsing in one of about 40 categories
- Search for library in all categories found 76
descriptors that have library included
- Out of these selected library education
24ERIC online thesaurus on ERICdescriptor library
education
25ERIC thesaurus on DIALOG
- In a convoluted way ERIC thesaurus (and other
ones) can be displayed on DIALOG (and other
vendors, such as OVID)
- How?
- begin in file 1 ERIC
- then expand a desired term here we used term
library
- you will see under R that certain terms have
related terms meaning that these are thesaurus
entries
- then expand on one of those to see related terms
- then you can browse choose which ones to use in
search
- And here are Print Screens of the process
26going
Expand library
27going
28going
We now chose descriptor LIBRARY ADMINISTRATION
and expand on that one
Neat trick
You can expand on expand get related terms
29going
These are now R terms of various type
14 related terms for this one are listed
Can expand on this one to see other RT
You can also select any of these to search
30going
We have now selected r10 library expenditures
31going
Now we can view some items in a chosen format
or we can further modify this search - add
refine,
32gone
This is one of the items we got
33ERIC thesaurus on OVID(accessed through RUL)
For library ask to map as thesaurus term
34going
35going
36going
Retrieved ready to display
37gone
38Relevance feedback
- Method for using information in items judged
relevant to further refine or change the search
- e.g. in relevant items we can browse titles,
descriptors, identifiers, abstracts to get
leads for further search terms tactics
- in some advanced systems this may be done
automatically
39Query expansion
- Method for adding, modifying, changing search
terms in query
- to broaden, narrow, focus, change terms
- Many sources can be used
- relevance feedback, thesauri, dictionaries,
textbooks, documents, catalogs, people users,
colleagues, your own mind experience
- Some systems suggest terms for query expansion
40Conclusion
- At the base of all searching are
- terms
- vocabularies
- languages
- but a variety exists
- In reality in searching there is no completely
controlled or uncontrolled vocabulary
- matter of degree
- most importantly, matter of mastery
41symbolicallycontrolled free vocabulary
42thank you!