Title: Tatyana N' Yudina
1Research Computing Center of Moscow State
University NCO Center for Information Research
Tatyana N. Yudina yudina_at_mail.cir.ru University
Information System RUSSIA (Russian
inter-University Social Sciences Information and
Analytical consortium) www.cir.ru
2Plan
- UIS RUSSIA. General
- Thesaurus
- ALTP
- Bilingual Information Retrieval
- Text categorization
- Examples
3University Information System RUSSIA Collections
1 500,000/ 17.5Gb (www.cir.ru)
4NLP technology in UIS RUSSIA
holdings
convertors
Automatic Linguistic Text Processing/Linguist
ic Processors
.POD
.OUT
.LEM
.HDR
.HTM
WEB www.cir.ru (Apache OAS)
ORACLE
Administrator.
5UIS RUSSIA
- Collections of documents in English
- - RePEc (Research Papers in Economics,
www.repec.org) abstracts and full texts - - Council of Europe/European Court of
Human Rights documents - - UNESCO documents
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Thesaurus
13Sociopolitical Thesaurus
29,000 concepts, 75,000 terms 110,000 conce
ptual relations
- constructed specially as a tool for
automatic text processing - contains terms from economic, financial,
political, military, social, legislative
and cultural domains - a set of relations is adapted to
information-retrieval applications - regularly tested during automatic text
processing
14THESAURUS for Information Retrievalin
Sociopolitical Domain
- -- Thesaurus provides for query refinement -
- reformulation/expansion
- -- Terminology of Thesaurus covers 95-98 of
- words and terms of Russian government
publications, academic papers and mass media
texts from 1991 - -- Thesaurus is a main element of ALTP/automatic
linguistic text processing technology.
15Sociopolitical Thesaurus vs. Legislative
Indexing Vocabulary (LIV)
16General Structure of Thesaurus
17Query Refinement
18Navigation in Thesaurus
19ALTP
- Automatic Linguistic
- Text Processing
-- Conceptual Indexing -- Automatic Coherent
Summarisation -- Automatic Text Categorisation
20Term Extraction for Russian Official Documents
(RF Government Regulation N604 26.06.1995)
21Thematic Lines of Thesaurus Terms (RF
Government Regulation N604 26.06.1995)
22Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
23Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
24Structure of Thematic Representation
Main Thematic Nodes
Specific Thematic Nodes
25Structural Thematic Summary(RF Government
Regulation N604 26.06.1995)
26Bilingual Information Retrieval
27English-Russian Sociopolitical Thesaurus
- Hierarchical conceptual net of 65 thousand
English terms - Manual work
- Use of general and special English-Russian
dictionaries - Study of conventional American and British
dictionaries and information-retrieval
thesauri - Cross-checking of translations. Addition
multiword variants. Internet checks.
28English-Russian Sociopolitical Thesaurus testing
and use in new applications
- Automatic text categorization of economic
papers and abstracts using JEL subject
headings (700 categories) (supported by Ford
Foundation, USA) - Automatic text processing of statistical
tables (in cooperation with Berkeley
University, USA) - Automatic text processing of European
documents (European Court of Human Rights,
Council of Europe, European Union) problems
of harmonization of Russian Legislation
29Thesaurus Terminology in Sociopolitical Domain
30Adding languages to Sociopolitical Thesaurus
- It is a challenge to develop multilingual
Sociopolitical thesaurus, to describe terms of
Sociopolitical domain from different languages
in the same hierarchical net - A project under discussion to add Tatar
language to the bilingual thesaurus. Tatars are
the second nation in Russia
31Approach to Organization of Bilingual Search in
UIS RUSSIA
- Development of a bilingual ontology in
sociopolitical domain based on Russian
Sociopolitical Thesaurus for automatic text
processing
32???????????? ?????????? ??????????????
????????????? ????????
????????????? ?????????????
???????????? ????????????? ?????????? ?????????
???????????? ????????
???????????? ?????????????
33???????????? ?????????? ??????????????
????????????? ????????
????????????? ?????????????
???????????? ????????????? ?????????? ?????????
???????????? ????????
???????????? ?????????????
34Use of Thesaurus in Information Retrieval
applications
- Flexible knowledge-based categorization
systems (9 systems) NEW Automatic text
categorization of Russian
legislation (200 000 documents) 3000
categories -
- Knowledge-based text summarization system
-- SUMMAC conference -
- Thesaurus-based information retrieval
- -- a specially constructed thesaurus can
significantly improve efficiency of
information retrieval (3-point average
precision)
35Bilingual Search in UIS RUSSIA
36www.cir.ru/is4/
37Text Categorization
38???????????? ?????????
?????????? ??? ?????? ?????????? ????? ???????
?????????? 60
??????? ????????, ?? ????????? ???????
39?????????????? ??????????????
40Text Categorization Using Thematic Representation
Systems of Subject Headings -- RF Subject
Headings System for Legal Acts (RF President
Decree N511, 15.03.200 1169 items, 4
levels) -- RF Central Election Committee
Legal Subject Headings (450 items 4 levels) --
80 Top Terms of Legislative Indexing
Vocabulary (LIV) Congressional Research
Service of the US Congress
41????? ???????? ???????
???????
????????????2
????????????1
???
?11
?12
?13
?21
???????22
?
?
?
???
???
-
42Subject Heading as Formulae of Support Concepts
43Full Representation of Subject Heading(expansion
of support concepts)
44Examples
45http//www.cir.ru/docs/ips/techno/gmtpod/index_e.j
sp
46Results of Text Categorization
Info
47Known terms
Word/Phrase Sense Disumbiguation T_M ok
M not
48Related Terms fo judgment
49Thematic Summary
Link between two thematic lines DISCHARGE
ORIGINATE DISMISSAL WAGE LABOR
and HUNGARY FORINT HUNGARIANS BUDAPEST
STATE, COUNTRY
50Two thematic lines JUDICAL TRIAL and HUNGARY
in text
51Automatic Summary
52Russian Text English Terms
53Russian Text English Thematic Summary
54Bilingual Text Categorization
Support of Subject Heading