Title: MGT2201 Information Management
1MGT2201 Information Management
- Module 6 - Classification and Indexing for
Retrieval
2Records retrieval
- Every minute of delay in finding a record is
costly - in user of requester waiting time and in
filer searching time - to say nothing of possible
loss of business as an ultimate result - Smith and Kalluas, 1997
- Effective retrieval requires a knowledge of
classification and indexing techniques and a
thorough understanding of the organisations
activities
3Classification Systems
- ..systems (which) reflect the business of the
organisation from which they derive and are
normally based on an analysis of the
organisations business activities. The systems
can be used to support a variety of records
management processes - AS ISO 15489 Part 1 (Section 9)
- Classification groups records to limit the
searching process - AS ISO 15489 recommends the use of a business
classification scheme
4How business classification assists in the
management of records
- Providing linkages between individual records
which accumulate to provide a continuous record
of activity - Ensuring records are named in a consistent manner
over time - Assisting in the retrieval of all records
relating to a particular function or activity - Determining security protection and access
appropriate for sets of records - Allocating user permissions for access to, or
action on, particular groups of records - Distributing responsibility for management of
particular sets of records - Distributing records for action
- Determining appropriate retention periods and
disposition actions for records
5Steps in developing a business classification
scheme
Gather documentary information and conduct
interviews
Understand overall mission/objectives of the
organisation
Derive and list the functions needed to achieve
objectives
Identify hierarchies of activities which support
each function
Identify the transactions which operationalise
each activity
Identify processes/activities common across
functions
Produce a map of the hierarchies for each function
6REQUIRED ATTRIBUTES OF BUSINESS CLASSIFICATION
SCHEMES
- Sufficient classes and subclasses (keywords and
descriptors) for all business functions and
activities - Terminology derived from functions/activities,
not organisational unit names - Unambiguous terminology
- Discrete classes (keywords)
- Hierarchical - from most general to most specific
concept - Specific to the organisation
- Devised in consultation with users
- Maintained to reflect changing business needs
7Business v knowledge-base classification
- Business classification aims to
- Create a scheme for arrangement and retrieval
- Provide a basis for determining
- Which documents to capture for evidential
purposes - Determine retention periods
- Define and assign security levels
- Knowledge base classification aims to provide a
basis for arrangement and retrieval of records
only - Business classification is based on the broad
core functions and activities of the organisation - Knowledge-based classification is based on either
- literary warrant (document content or subject
matter) or - user warrant (the needs of the user group)
- Business classification is limited to activities
which have accountability requirements knowledge
based systems do not make this distinction
8Series, files and documents
- Classification provides a basis for arranging and
retrieving records - Classification takes place at the level of record
series - Series .. a group of identical or related records
that are normally used and filed as a unit, and
which permits evaluation as a nit for retention
scheduling purposes - Records series are divided into
- Subseries (maybe more than one)
- Files
- Documents
9The record series hierarchy
There may be number of sub-series
SERIES
HOUSE SALES
SUB-SERIES
Toowoomba
FILE
Sale to G Bass
DOCUMENT
Contract of Sale
10Defining series and sub-series
- Series - a group of identical or related records
that are normally used and filed as a unit, and
which permits evaluation as a unit for retention
scheduling purposes - Sub-series - a sub-division of a record series to
provide additional precision in arranging and
retrieving records
Activity 6.1
11File based v document based systems
- Individual retrieval units may be either files or
documents - A file is a group of related documents located
within a file cover or folder
12Establishing new files
- Creation of a new file needs to be properly
authorised - A new file should only be created when no file
previously existed on that activity or subject - When information in one file also relates to an
issue dealt with in another file, cross
references should be included in the indexing
system
13Registration of records
- ..the act of giving a record a unique identifier
on its entry into a system - (AS ISO 15489,Part 1 3.18)
- The purpose of registration is to provide
- evidence that a record has been created or
captured in a records system - (AS ISO 15489,Part 2 4.3.3)
- a record may be registered at the file or
document level depending on assessment of
evidence requirements
14Variation in the registration process
- Registration in paper based manual systems
- A register is normally a separate document
- Registration in computerised (automated) systems
- A register is likely to be a combination of data
elements - Registration in electronic records systems
- Register may include classification and
determination of disposition and access status - Can register records automatically without the
intervention of a records management practitioner - Metadata required for registration can often be
automatically derived from the computing and
business environment from which the record
originates
Registration should be unalterable with any
changes able to be tracked through an audit trail
15Minimum metadata required at registration
- Unique identifier
- Date and time of registration
- Title or abbreviated description
- Author (person or corporate body), sender or
recipient
Actual metadata required will depend on evidence
requirements and type of technology used
16Information which may be included in the records
unique identifier (1)
- Document name or title
- Text description or abstract
- Date of creation
- Date and time of communication and receipt
- Incoming, outgoing or internal
- Author (with his/her affiliation)
- Sender (with his/her affiliation)
- Recipient (with his/her affiliation
- Physical form
- Classification according to the classification
scheme - Links to related records documenting the same
sequence of business activity or relating to same
person or case, if the record is part of a case
file
17Information which may be included in the records
unique identifier (2)
- Business system from which the record was
captured - Application software and version under which the
record was created or in which it was captured - Standard with which the records structure
complies (eg Standard Generalised Markup Language
(SGML) Extensible Mark-up Language (XML) - Details of embedded document links including the
applications software and version under which the
linked record was created - Templates required to interpret document
structure - Access
- Retention period
- Other structural and contextual information
useful for management purposes
18Registration at document or file level
- Registration may take place at document or file
level - Even within a file-based system, important
documents may still be registered - File based registration process (see next slide)
- Document based registration process
(correspondence management system) - Documents registered as discrete items and gien
their own number and/or classification terms - Each document also usually labelled with number
of file in which it is stored
19Registration at file level
Start
Document arrives to be classified
Does a file exist on activity or subject?
Does file title still reflect activity/subject
accurately?
Modify file title or classification terms
Attach document to file
No
Yes
No
Yes
End
Create new file
20Indexing
- ..the process of establishing and applying terms
or codes to particular records by which they may
be retrieved. - Appropriate allocation of indexing terms allows
retrieval of records across classifications or
categories. - (AS 4390-1996 (Part 4, 8.1, p10)
- appropriate allocation of index terms extends
the possibilities of retrieval of records across
classifications, categories and media - (AS ISO 15489, Part 2, 4.3.4.3)
21Deriving indexing terms
- Indexing terms can be derived from the document
by computer or assigned manually using
pre-established categories or indexing terms such
as a thesaurus - Indexing terms are commonly derived from
- The format or nature of the record
- The title or main heading of the record
- The subject content of the record, usually in
accord with the business activity - the abstract of a record
- Dates associated with transactions recorded in
the record - Names or clients or organisations
- Particular handling or processing requirements
- Attached documentation not otherwise identified
or - The uses of the record
- (AS ISO 15489, Part 1 (4.3.4.3)
22File titling
- Titles need to be representative of a records
context as well as its content - File title possibly a set of index terms act as
a label for the file - File titles aim to achieve two objectives
- to help minimise confusion over what file to
place a document in - to aid retrieval
Automated retrieval software uses sequential
numbering, making titles even more important.
Each word in a title is searchable.
23File title structures (1)
- The words used in titles / indexes may be
- selected from a thesaurus or
- natural language terms (taken from document
itself not from a list of allowed terms) - A number of options exist for file titling
structures eg - OPTION 1 controlled vocabulary in hierarchical
order followed by a number of natural language
terms, eg - Controlled Vocab Sales - Houses - Toowoomba
- Natural language terms Sale to B Llyod
- OPTION 2 natural language summary statement eg
- Sale of house at 143 Taylor Street to B Llyod
24File title structures (2)
- OPTION 3 a natural language summary statement
followed by set of controlled vocabulary terms in
no particular order eg - Sale of 143 Taylor Street
- Toowoomba, Sales, Houses
- OPTION 4 Keyword and descriptors in no particular
order - OPTION 5 Lintons Keyword System of keyword
followed by up to four descriptors in
hierarchical order from general to specific, eg - keyword Sales
- descriptors Houses, Toowoomba, Lloyd
25Keyword AAA Thesaurus
- Used widely in public sector organisations
- complies with AS4390-1996 i.e.
- based on business classification rather than
knowledge-base classification - tight hierarchical structure employing three
levels of terms i.e. - keyword
- activity descriptor (may be more than one)
- subject descriptor/free text (may be more than
one) - can be used with electronic or paper records
- http//www.unimelb.edu.au/CSD/image/execserv/keyin
tro.htm
26Advantages and disadvantages of hierarchical file
titling
- Hierarchical file titling allows
- Browsing/printing of alphabetical listings with
file titles grouped together within broad class
terms (keywords) and activities - Broad searching (at level of keyword) or
- Very specific searching (at level of free text)
- Possible disadvantages
- Need to prespecify as many hierarchies as
possible - Tendency to force each title into an
inappropriate hierarchical order
27Metadata and Electronic Records
- Metadata .. A description or profile of a
document or other information object which may
contain data about its context, form and content. - A vital ingredient of electronic recordkeeping
because the risk of loss of electronic documents
is much higher than for paper records - addition of metadata can be automated by records
management software programs - http//www.gmb.com.au/products/button/intro.htm
- Overcomes inconsistency in naming electronic
documents - Essential to include in classification and
indexing process the location of electronic
records
28Steps in the Indexing Process (also involves
classification)
- Examine the document in an attempt to classify
and find suitable indexing terms - look for - title
- names of originating persons or organisations
- opening and closing paragraphs
- groups of words underlined or printed in
different typefaces - 2 Identify useful retrieval concepts by asking
questions such as - Does the document/file record a transaction?
- Does the document/file record an activity or
course of action? - Does the document/file refer to methods for
accomplishing a course of action? - Does the document/file deal with a particular
product, organisation, or condition? - Does the subject of the document/file contain an
action concept ie an operation or process?
29Appropriate search elements (retrieval keys)
- Subject terms i.e. Sale of Houses
- Proper names i.e. G Lloyd
- Document types i.e. contract of sale
- Identifying numbers i.e. 2000/10
- The number of indexing terms employed will be
determined by - - file titling structures
- user needs
- available software
30Steps in the Indexing Process (cont)
- 1 Examine the document
- 2 Identify useful retrieval concepts
- 3 Translate concepts into the indexing
vocabulary. Issues to be considered include - controlled and/or natural language
- method of indexing proper names
- pre-coordinate or post-coordinate method
- how specific index headings will be
- how to achieve consistency in indexing
-
31Controlled v Natural Language vocabulary
- Controlled vocabulary
- indexer translates identified concepts into the
standardised or authorised allowed terms in an
alphabetical thesaurus - Natural language
- non-thesaurus terms and phrases assigned by the
indexer in an extra field eg Narrative - often include proper names
- Summaries or abstracts can also be used as index
terms where terms in any field are searchable
online
32Methods of Indexing proper names
- Personal, organisational and other proper names
are common indexing terms - file titles may consist of a name only or a name
plus several additional indexing terms - consistency in name indexing is difficult
- names often have various forms
- composed of elements that can be cited in
different orders - have a tendency to change over years
- Directory method is most common method of
indexing proper names
33Pre-coordinate and post-coordinate indexing(1)
- Pre-coordinate indexing - terms in a compound
topic are pre-combined into a single subject
heading, eg - Sales - Houses - 143 James Street
- necessary to use cross referencing, eg
- 143 James Street - See Under Sales - Houses -
143 James Street - often used in on-line indexing systems as search
can be conducted on single words within a heading
34Pre-coordinate and post-coordinate indexing(2)
- Post-coordinate indexing - each term in a
multi-aspect topic is entered as an individual
indexing unit eg - Sales
- Houses
- 143 James Street
- terms may be entered in any order
- can be searched using Boolean operators
- better suited to online searching
35Specificity of index headings
- Keywords are examples of broad class terms
- proper names are examples of specific indexing
terms - indexing terms should include some keywords
(broad terms) plus descriptors (more specific)
and possible free text (most specific eg proper
names)
36Consistency in Indexing
- Use of thesaurus, naming rules, guidelines on
translation lead to - consistent indexing, predictability in retrieval
- If no use of thesaurus etc
- problems with scattered files,
- poor retrieval rates,
- incomplete files, problems with efficient
retention and disposal
?
?
37Impact of technology on indexing and retrieval
Increasingly sophisticated indexing software
Increasingly sophisticated search engines
Increasingly sophisticated navigational mechanisms
38New concepts for classification and indexing in
technological environments
Digitally based team collaboration Organisational
intranet could be regarded as a very simple
example of groupware
GROUPWARE
WORKFLOW SOFTWARE
Automates the flow of tasks and information
around an organisation
Digital documents which may be a combination of
text, audio or graphic objects with elements not
necessarily stored together on one server but
brought together through hypertext links
COMPOUND DOCUMENTS
39Indexing and Search Methods for Full Text
Databases and Networked information
- The nature and extent of human classification and
indexing required will depend on the storing,
indexing, and searching software capabilities of
the system - LANs and intranets allow the requesting of
information by a client from document collections
stored on a server - Geographically dispersed organisations can access
corporate documents stored at different points on
the network
40RETRIEVAL IN NEW TECHNOLOGICAL ENVIRONMENT
Systematic Directory Structures
Successful retrieval
Standardised Naming Conventions
41Evaluating retrieval performance
RECALL - the number of documents
retrieved PRECISION - number of documents found
to be relevant
Recall
Precision
Recall
Precision
42Indexing and Searching Technologies
- Free text searching - computer searches for work
or phrase in one or more database fields or
document full text - Free text scanning - computer sequentially scans
terms in each document or a database to find a
match - N-grams or suffix arrays - index stores word
fragments on which matching takes place - Pattern recognition - index stores binary
representations - overcomes need for correct
spelling but reduces precision - great for sound,
video and images - Document clustering - document assigned a theme
which is used as the index value - Hypertext systems - nodes or chunks of
information (including text, images or sound) are
stored and connected by means of links or pathways
43Search Approaches (1)
- Boolean searches
- and - Loans and Students
- or - Loans or Students
- not - Loans not Students
- Wildcard searching ( to search for word where
some letter/s are missing at beginning, in middle
of or at end of word) - McG ( more than one letter) Truncation
example - McGra_at_y (_at_ just one letter)
44Search approaches (2)
- Proximity operators - used to stipulate that
terms must be adjacent, in same sentence, in same
paragraph etc - Student ADJ Loans
- Fuzzy logic - search specifications made more
vague that that input by researcher - Eg Include Or Search if only AND is requested,
but rank Or results lower
45Relevance Ranked Searching
- Documents found in response to a query ranked
from most to least relevant - 2 approaches
- term summing - computer counts how many times
each term in the query occurs in the document - weighted term summing - summing based on
frequency of occurrence and weighted value which
is dependent on the uniqueness of the term
46Internet search engines
- Web search engines eg Google...
- Use computer programs to move through Web
addresses, titles/headers and words on web pages
to collect word and addresses and place them in
an index and rank relevance of sites to query