Title: ISYS 300 Week 2 Document Representation
1ISYS 300 - Week 2Document Representation
- Dr. Xia Lin
- Associate Professor
- College of Information Science and Technology
- Drexel University
2QUIZ
- Question 1 What are two major components of IR
systems? - Â
- Â
- Question 2 What are the two abstraction
principles of IR ?
3Reviews of Last Week
- Information retrieval systems
- The user requests information
- The system processes queries
- IR system is to match two abstractions
- abstraction of information needs
- abstraction of data from text
- Differences between
- Data, information, knowledge
4Documents
- Logical Units of Text
- Units of records (text other components)
- Units that can be stored, retrieved, and
displayed as an unique entity - Units of semantic entity
- units of text grouped together for a purpose
- Units of unformatted text
- Text as written by authors of documents.
5Document Representation
- Documents need to be represented in a concise and
identifiable formats/structures. - Not every words of the text are meaningful for
searching/retrieval. - Documents themselves do not have identifiable
attributes such as author, titles.
6Document Representation
- Document representation helps users identify and
receive information from the system. - identify authors and titles
- identify subjects
- provide summaries/abstracts
- classify subject categories
- Document Citation
- A set of information to make it easy to identify
a document.
7Document Surrogates
- Each document should have a unique identifier
- Accession (sequential) number
- Classification number
- Barcodes
- ISBN number
- Good for the computer but not enough for the
user? - Go to bookstore and get the book 0-471-14338-3.
- Do you want to have 200737-103146 for dinner?
8Document Indexing
- Computerized Indexing
- Indexing based on citations
- Indexing based on full text
- Subject indexing
- Creating a set of control vocabularies (Thesaurus
or Subject headings) to represent documents - Assigning terms of control vocabularies to
documents
9Computerizing Indexing
- The Computer creates indexing files based on
document surrogates - to improve access speed
- to increase access points
- to improve precision
- to reduce false drops
- to identify similar documents
- How?
10Computerized Indexing
- Title indexing
- Sort all the titles alphabetically
- Not consider the beginning a or the
- Convert all letters to uppercases.
- Matching always starts from the beginning of the
title (not individual words). - Most early IR systems (such as library catalogs)
used title indexing
11Keyword indexing
- Parsing every individual words from documents
- First decision What is a word?
- Are digits words?
- How about the letter and digit combination B6,
B12 - Is F-16 one word or two words?
- Hyphens
- Online, on-line, on line ?
- F-16
- List all the words alphabetically with points
back to documents inverted indexing.
12Phrase indexing
- There is no safe ways to parse phrases out of
titles or full text of documents. - One way to do phrase indexing is by positions if
two word are used next to each other, they are
(potentially) a phrase. - Most phrase indexes are done manually.
13Inverted Indexing
- Purpose
- Preparing documents for search engines to search
- Objective
- Create a sorted list of words with pointers
indicating which and WHERE the words appear in
the documents. - Process the list in many different ways to meet
the retrieval needs
14Inverted Indexing
- Inverted indexing consists of an ordered list of
indexing terms, each indexing term is associated
with some document identification numbers. - Retrieval is done by first searching in the
ordered list to find the indexing term, then
using the document identification numbers to
locate documents
15Examples
- ISYS102 Introduction to information systems
- Info110 Human computer interaction
- Â
- info300 Information retrieval systems
16Step 1 Generate a list of all the words
- ISYS102 Introduction to information systems
- ISYS110 Human computer interaction
- ISYS300 Information retrieval theories and
systems
ISYS102 Introduction ISYS102 to ISYS102
information ISYS102 systems ISYS110
human ISYS110 computer ISYS110 interaction ISYS300
information ISYS300 retrieval ISYS300
theories ISYS300 and ISYS300 systems
17Step2 remove stop words
- ISYS102 Introduction
- ISYS102 to
- ISYS102 information
- ISYS102 systems
- ISYS110 human
- ISYS110 computer
- ISYS110 interaction
- ISYS300 information
- ISYS300 retrieval
- ISYS300 theories
- ISYS300 and
- ISYS300 systems
18Step 3 Invert the list
- Introduction ISYS102
- information ISYS102
- Systems ISYS102
- Human ISYS110
- Computer ISYS110
- Interaction ISYS110
- Information ISYS300
- Retrieval ISYS300
- Theories ISYS300
- Systems ISYS300
19Step 4 Sort the list
- Computer ISYS110
- Human ISYS110
- Information ISYS102
- Information ISYS300
- Interaction ISYS110
- Introduction ISYS102
- Retrieval ISYS300
- Systems ISYS102
- Systems ISYS300
- Theories ISYS300
20Step 5 Merge same words in the list
- Computer ISYS110
- Human ISYS110
- Information ISYS102, ISYS300
- Interaction ISYS110
- Introduction ISYS102
- Retrieval ISYS300
- Systems ISYS102, ISYS300
- Theories ISYS300
21Example retrieving in an inverted file
- computer 110TI02
- design 110TX01
- human 110TI01
- information 102TI02, 102TX01, 300TI01, 300TX02
- interaction 110TI03
- interface 110TX03
- introduction 102TI01
- management 102TX03
- retrieval 300TI02, 300TX03
- systems 102TI03, 300TI03, 300TX04, 102TX02
- text 300TX01
- user 110TX02
22Second Examples
- 1 TI Cats and dogs Best friends of Kate
- DE Cats Dogs fictionÂ
- 2 TI New methods of feeding cats and dogs
- DE cats dogs feeding behaviors
- 3 TI Canine mandibular structure
- DE Dogs Anatomy Musculature skeleton
- 4 TI It rains like cats and dogs last night
- DE mystery fiction
23The Inverted Indexing File
- anatomy 03DE02
- behaviors 02DE04
- canine 03TI01
- cats 01DE01, 01TI01, 02DE01, 02TI04
- cultural 01DE04
- dogs 01DE02, 01TI02, 02DE02, 02TI05,
03DE01 - enemies 01TI04feeding 02TI03feeding 02DE03
mandibular 03TI02methods
02TI02misunderstood 01TI06mortal
01TI03musculature 03DE03 new
02TI01simply 01TI05skeleton 03DE
04structure 03TI03studies 01DE05
24Example Create an inverted indexing for the
following
25Unix Basics
- Unix
- The most powerful Operating system
- Multi-tasks/multi-thread/multi-user OS
- Excellent host for IR systems and databases as
well as web servers. - Command-based access
26Subject Indexing
- A human analytic process for identifying,
selecting, and representing document concepts - Create indexing languages
- Using standardized, limited vocabularies for
index purposes. - Assign indexing terms to documents
- Using only the terms in the index language
selected.
27Second Examples
- 1 TI Cats and dogs Best friends of Kate
- DE Cats Dogs fictionÂ
- 2 TI New methods of feeding cats and dogs
- DE cats dogs feeding behaviors
- 3 TI Canine mandibular structure
- DE Dogs Anatomy Musculature skeleton
- 4 TI It rains like cats and dogs last night
- DE mystery fiction
28Second Examples
- 1 TI Cats and dogs Best friends of Kate
- 2 TI New methods of feeding cats and dogs
- 3 TI Canine mandibular structure
- 4 TI It rains like cats and dogs last night
-
29Metadata
- Metadata are data about data
- to describe features of the data (digital
objects) - Content what the object is about
- Context who, what, why, where and how aspects
associated with the object - Structure associations within or among
individual objects
30Example Identify Content, context, and
structures in the following
- Author Arms, William Y.
- Title Digital libraries / William Y. Arms.
- Imprint Cambridge, Mass. MIT Press, c2000.
- CALL Â Z692.C65 A76 2000Â
- Description x, 287 p. ill. 24 cm.
- Series Digital libraries and electronic
publishing - Note Includes index.
- Subject
- Libraries -- United States -- Special collections
-- Electronic information resources. - Digital libraries -- United States.
- ISBN 0262011808 (alk. paper)
31Why Metadata?
- Metadata is a key to ensuring that resources will
survive and continue to be accessible into the
future. - Standards
- Structures and organization
- Content and context
32Functions of Metadata
- To help organize resources
- To facilitate resource discovery
- To facilitate interoperability
- To support digital identification
- To support archiving and preservation
33Types of Metadata
- Descriptive
- Title, abstract, keywords
- Administrative
- Who and how it is created
- Right management
- Structural
- Relationships among objects
34Attributes of Metadata
- Source of metadata
- Nature of metadata
- Structure
- Conform to a standard
- Semantics
- Controlled vocabulary or not
- Level
- How details the metadata are.
35Metadata Schemes
- A metadata schema provides a formal structure
designed to identify the knowledge structure of a
given discipline and to link that structure to
the information of the discipline through the
creation of an information system that will
assist the identification, discovery and use of
information within that discipline.
36- Schemes are sets of metadata elements to describe
a resource - Semantics definitions and meanings of the
metadata elements - Contents values given to metadata elements
- Content rules what values should be used, how
the values should be formulated.
37XML
- XML stands for eXtensible Markup Language
- Designed to separate style, content, and context,
and presentation in the web environment - Designed to deploy content-specific tags for
content indexing and retrieval. - Designed as a subset of SGML
38Example
- lt?xml version"1.0" encoding"utf-8" ?gt
- ltbook isbn"0836217462"gt
- lttitlegtBeing a Dog Is a Full-Time Joblt/titlegt
- ltauthorgtCharles M. Schulzlt/authorgt
- ltcharactergt
- ltnamegtSnoopylt/namegt
- ltfriend-ofgtPeppermint Pattylt/friend-ofgt
- ltsincegt1950-10-04lt/sincegt
- ltqualificationgtextroverted
beaglelt/qualificationgt - lt/charactergt
- ltcharactergt
- ltnamegtPeppermint Pattylt/namegt
- ltsincegt1966-08-22lt/sincegt
- ltqualificationgtbold, brash and
tomboyishlt/qualificationgt - lt/charactergt
- lt/bookgt
39XML is an industry itself
- All the major software companies implemented some
types of XML-related software - XML-related standards are continually developed
everyday. - XSL Extensible Stylesheet Language
- XSLT -- Extensible Stylesheet Language
Transformations - XSLT enables and empowers interoperability
- Xlink -- XML Linking Language
- Assign meanings to links
- RDF Resource Description Framework
40XML Example (www.XML.com)
- lt?xml version"1.0"?gt
- ltartistinfogt
- ltsurnamegtModiglianilt/surnamegt
- ltnamegtAmadeolt/namegt
- ltborngtJuly 12, 1884lt/borngt
- ltdiedgtJanuary 24, 1920lt/diedgt
- ltbiographygt
- ltpgtIn 1906, Modigliani settled in Paris,
where ...lt/pgt - lt/biographygt
- lt/artistinfogt
41Example
- lt?xml version"1.0"?gt
- ltperiodgt
- ltcitygtParislt/citygt
- ltcountrygtFranceltcountrygt
- lttimeframe begin"1900" end"1920"/gt
- lttitlegtParis in the early 20th century (up to
the twenties) lt/titlegt - ltendgtAmadeolt/endgt
- ltdescriptiongt
- ltpgtDuring this period, Russian, Italian,
...lt/pgt - lt/descriptiongt
- lt/periodgt
42- ltenvironment xmlnsxlink"http//www.w3.org/1999/x
link" - xlinktype"extended"gt
- lt!-- The resources involved in our link
are the artist --gt - lt!-- himself, his influences and the
historical references --gt - ltartist xlinktype"locator"
xlinklabel"artist" - xlinkhref"modigliani.xml"/gt
- ltinfluence xlinktype"locator"
xlinklabel"inspiration" - xlinkhref"cezanne.xml"/gt
- ltinfluence xlinktype"locator"
xlinklabel"inspiration" - xlinkhref"lautrec.xml"/gt
- ltinfluence xlinktype"locator"
xlinklabel"inspiration" - xlinkhref"rouault.xml"/gt
- lthistory xlinktype"locator"
xlinklabel"period" - xlinkhref"paris.xml"/gt
- lthistory xlinktype"locator"
xlinklabel"period" - xlinkhref"kisling.xml"/gt
- lt/environmentgt
43Discussion
- Differences between XML and HTML?
- Relationships between XML and metadata?
44XML/Metadata Tools
- Reggie
- a metadata editor
- Output RDF, HTML,
- a Java application
- URL http//metadata.net/dstc/
45DC DOT
- http//www.ukoln.ac.uk/metadata/dcdot/
- Exercises
- Add Dublin Core Headings to the class Web page.
46Writing an XML Document
- XML document must be well formed
- A root element is required.
- Closing tags are required.
- Elements must be properly nested.
- Case matters.
- Entity references must be declared in a DTD or a
schema.
47XML document content
- lttitlegtNASA Image Exchangelt/titlegt
- ltsitegthttp//nix.nasa.gov/lt/sitegt
- ltmetadatagt
- ltrepository-namegtNASA Image Exchangelt/repository-n
amegt - ltcategorygt
- ltlabelgtCATEGORYlt/labelgt
- ltdatagtimageslt/datagt
- lt/categorygt
-
48XML Scheme
49XML Document Headings
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt?xml-stylesheet type"text/css"
href"http//project.cis.drexel.edu/classes/isys30
0/XML/repository.css" ?gt - ltrepository xmlnsxsi"http//www.w3.org/2000/10/X
MLSchema-instance" xsinoNamespaceSchemaLocation"
http//project.cis.drexel.edu/classes/info653/XML/
DLRepository.xsd"gt
50Style Sheet
- repository displayblock font-sizelargecolorM
aroon - title displayblockfont-sizelargetext-alignce
nter - site displayblock text-aligncenter
- metadata floatrightclearrightwidth225pxbord
erthin solid Tealpadding10px - repository-name dislplayblockfont-sizemediumb
ackgroundNavycolorYellow text-aligncenter - label displayblockfont-sizemedium
- data displayblock font-sizesmallcolorblue
positionrelative left9px - descriptiondisplayblock
- review displayblock colorblack
- name displayblock text-alignright
colorBlue fontsmall - term displaynone
51Controlled Vocabulary
- Goals
- To permit easy locations of documents by topic.
- To define topic areas, and hence relate one
document to another. - to provide multiple access pointers to documents
- to enforce a uniformity throughout an information
retrieval system
52Controlled Vocabulary
- Formats
- Hierarchical Classified list
- hierarchical subject descriptors
- associative cross references
- classification notation (codes)
- Alphabetical list
- include both descriptors and other lead-in terms
53Main Componentsin a Controlled Vocabulary
Broader Term
Keyword/ Descriptor
Synonymous Term
Related Term
Narrower Term
54Example
Broader Terms
Diseases Neoplasms
Related Terms
Synonyms
Abdominal Neoplasms Hyperplasia Seminoma
Cancer
Malignancy Malignant tumor Cancer morphology
Malignant neoplasm of skins Breast Cancer
Primary malignant
neoplasm of liver
Narrower Terms
55Controlled Vocabulary
- Examples
- Case studies Descriptor
- SN Details analyses, usually focusing on a
particular problem of an individual, group, or
organization (note do not confuse with medical
case histories - NT
- Cross sectional studies
- Longitudinal studies
56Examples (Case Studies)
- BT
- Evaluation methods
- Research
- RT
- Case records
- Counseling
- Qualitative research
57Advantages of Subject Indexing
- facilitates concept search
- search by topics/subjects, not just by words
- link related documents by subject terms
- Make implicit information explicit
- Provides a standard terminology to index and
search documents. - Use small indexing vocabulary
- Help the searcher find related terms
58Disadvantages of Subject Indexing
- Expensive manual operations
- To construct the controlled vocabulary
- To assign terms to documents
- Difficult to keep up to date
- Terminology changes very fast
- New terms are added daily.
- Inconsistent process of human indexing
- Same documents are assigned different indexing
terms by different indexers - The user may not use the same terms to find
documents as the indexer would use to index the
documents.
59Two Examples of Document Representation
- Controlled Vocabulary
- human-based indexing
- subject-based indexing
- Inverted indexing
- computer-based indexing
- statistical-based indexing
60Considerations of Document Representation
- Discriminating power
- to identify a document uniquely
- to reduce ambiguity
- Examples
- ISBN number for book
- bar codes for products
61- Descriptiveness
- describe all the information as complete as
possible - fulltext
- abstracts
- extracts
- reviews
- Completeness and correctness
62Considerations of DR
- Similarity Identification
- to group similar documents
- keywords or subject indexing
- book classification numbers
- Difficulty for the computer to assign keywords,
subject descriptors, or classification numbers to
documents
63Considerations of DR
- Conciseness
- simple and clear
- reduce process time and storage space
- Examples
- authors and titles
- Needs by both the computer and the user
64Relationships of four considerations
- Higher discrimination power may lower the
capability of identifying similarities among
documents. - Good descriptiveness may defeat the conciseness
- Whats good for the computer may not always be
good for the user. - A good representation should seek a balance of
the four, and take consideration of both the
computer and the user.