Title: Diapositive 1
1INIST Machine-aided indexing Abdelmajid
Khayari Stéphane Schneider INIST/CNRS France NFA
IS Forum New York. April 22, 2005
2- INIST
- Institute for Scientific
- and Technical Information
-
- A service of the French CNRS.
- Activities collection, analysis and
dissemination of the results and findings of
worldwide research. - Fields covered science, technology, medicine,
humanities and social sciences. - Leading scientific and technical document
supplier in France. - Producer of multilingual, multidisciplinary
bibliographic databases, PASCAL, FRANCIS and
ISD covering the core worldwide scientific
literature.
2
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
3- INIST (continued)
-
-
- Provider of customized services to the
scientific - community (portals, current awareness,
training, etc.) - Partner in open access initiatives.
- Research partner in the NTIC community
3
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
4- Aims and scopes
- Introduction of a part of automation in the
indexing - process
- To which extent can the process be automated ?
- Which approach is suitable ?
- What are the prerequisites ?
- Evaluation of the final result
- Is there support from the indexers ?
- Does it meet the expectations of the
- indexers ?
- Are the results acceptable ?
4
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
5- Current indexing practices
- About 70 internal and external specialized
indexers. - Documents in diverse languages (main English,
- French, Spanish, German).
- Semi-manual allocation of descriptors and
- classification codes.
- Use of controlled vocabularies and
classification - schemes completed with free key-words.
- Multilingual descriptors.
5
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
6- Workstation
- Development of a home-made platform accessed by
- INIST indexers since year 2000 via our intranet.
- Set up of indexing programs and fine tuning.
-
- Collaborative work on terminology resources.
- About 2000 input records processed each night.
- Use of fully automated indexing for periodicals
that - are not manually analyzed (May, 2004).
6
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
7- Two indexing programs
- Lexical method. Uses equivalence rules gathered
in - subject terminological resources to assign
- descriptors and classification codes to
documents. -
- Statistical approach (lexical collocation). A
two-stage - process
- Training stage a corpus of human-indexed
- citations is processed to create association
- dictionaries.
- Indexing stage using the association
dictionaries, - controlled vocabulary descriptors and
- classification codes are assigned to the
- incoming documents.
-
7
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
8- Lexical method
- Text processing of bibliographical records
(titles abstracts author keywords) - Parsing text to phrases and phrases to words.
- Lemmatization.
- Matching with subject terminology resources
- Searching for terms that correspond to
descriptors and classification codes. - Searching intervals for compound terms (2 to 6
words). - Ordering of candidate key-words and
classification codes semantic categories are
used to construct the indexing grid. -
8
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
9- Lexical method (continued)
- Generation of additional descriptors and codes
each keyword or code may trigger another one
using association rules in a cascade-like manner
- Rat -gt Animal / Acropulpitis -gt Finger.
- Pointing task -gt Manual task -gt Motor control
- Pointing task Vision -gt Visuomotor integration
- Ranking of keywords and codes.
- Filtering of the assigned elements number of
occurrences for each category is used as a
filter to set the desired number of candidate
descriptors. -
9
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
10- Lexical method (continued)
- Check-up by human indexers candidate
descriptors and codes are validated and
completed. - Continuous feedback on terminological resources
is operated in parallel by introducing new
equivalence rules, new data (synonyms) in
existing rules or deleting noise-producing
rules. - Feedback on the indexing program (changing
parameters and terminology resources
combination). -
-
10
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
11- Initial version
- Deployment of a basic version using the 2
methods. - A semi-automated process is used to construct
the subject terminology resources. - Bibliographic databases are used to extract
corpora dealing with a specific subject (i.e.
pain) - Corpora are processed to extract a ranked list
of descriptors which is run against the
controlled vacabulary to extract synonyms,
translations, semantic categories, etc. - This core thematic dictionary is enriched with
new concepts and new data. - During an iterative indexing and re-indexing
process, the performance of the dictionary is
improved. -
11
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
12- Initial version (continued)
- Sharing of thematic resources - Someone has the
dictionary of diseases or of geographical names
I need. - Access to full-text articles (OCRized and
directly from publishers). - Direct feedback to administrators/developers
incorporated. - Evaluation of performance on each citation or
collection of citations. Final indexing is
compared to the initial one as proposed by the
program. -
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
12
13- Evaluation
- Indexers support for MAI was not easy to obtain
- Important psychological reluctance at the
beginning - (the machine will never be able to perform a
highly - intellectual task abstraction).
- The crucial need to formalize the specialist
- knowledge is becoming well understood.
- Many concerns about fully automated indexing
since - the standard scale is the human produced
indexing - (i.e. a candidate descriptor which is not
inaccurate - per se will be considered as wrong since the
human - indexer did not include it in the final record.)
13
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
14- Evaluation (continued)
- The lexical method is predominantly used by
indexers. - The statistical one is used mainly for
determination - of classification codes.
- Stats are obtained by comparing machine indexing
- with the final record after human revision
- Performance is proportional to the degree of
- improvement of terminology resources (in pilot
- subject fields up to 80 accurate candidate
- descriptors can be obtained).
- Unsuccessful machine-indexing triggers feedback
on - computer programs and on terminology
- resources
14
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
15- Evaluation (continued)
- During the deployment phase, time-saving is not
- always achieved because feedback on
- terminology resources is time-consuming.
- Nevertheless, benefits are real in terms of
- Indexing consistency (less intra- and
inter-individual - variations)
- Indexors expertise and knowledge acquired
during - the abstraction-indexing processes are integrated
- into an organization resource (knowledge
- capitalization and sharing).
-
15
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.
16- Future trends
- Indexing programs improvement
- Improvement of textual pattern extraction
(genes, etc.) - Introduction of advanced natural language
processing - Extraction of concepts.
- Extraction of relationships between concepts.
-
- Improvement of citation pre-classification in
order to - be able to assign the right combination of
- subject resources.
- Constitution of a unified terminology database
16
INIST Machine-Aided Indexing. NFAIS forum. New
York. April 22, 2005.