Tatyana N' Yudina - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Tatyana N' Yudina

Description:

Russian-European Center for Economic Policy; Fiscal Policy Center ... gazeta; Izvestia; ... Mass media. 55,000. 1998-... State Statistics Agency; ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 55
Provided by: lll3
Category:
Tags: gazeta | tatyana | yudina

less

Transcript and Presenter's Notes

Title: Tatyana N' Yudina


1
Research Computing Center of Moscow State
University NCO Center for Information Research
Tatyana N. Yudina   yudina_at_mail.cir.ru University
Information System RUSSIA   (Russian
inter-University Social Sciences Information and
Analytical consortium) www.cir.ru
2
Plan
  • UIS RUSSIA. General
  • Thesaurus
  • ALTP
  • Bilingual Information Retrieval
  • Text categorization
  • Examples

3
University Information System RUSSIA Collections
1 500,000/ 17.5Gb (www.cir.ru)
4
NLP technology in UIS RUSSIA
holdings
convertors
Automatic Linguistic Text Processing/Linguist
ic Processors
.POD
.OUT
.LEM
.HDR
.HTM
WEB www.cir.ru (Apache OAS)
ORACLE
Administrator.
5
UIS RUSSIA
  • Collections of documents in English
  • - RePEc (Research Papers in Economics,
    www.repec.org) abstracts and full texts
  • - Council of Europe/European Court of
    Human Rights documents
  • - UNESCO documents

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Thesaurus
13
Sociopolitical Thesaurus
29,000  concepts,     75,000  terms 110,000  conce
ptual relations
  • constructed specially as a tool for
    automatic text processing
  • contains terms from economic, financial,
    political, military, social, legislative
    and cultural domains
  • a set of relations is adapted to
    information-retrieval applications
  • regularly tested during automatic text
    processing

14
THESAURUS for Information Retrievalin
Sociopolitical Domain
  • -- Thesaurus provides for query refinement -
  • reformulation/expansion
  • -- Terminology of Thesaurus covers 95-98 of
  • words and terms of Russian government
    publications, academic papers and mass media
    texts from 1991
  • -- Thesaurus is a main element of ALTP/automatic
    linguistic text processing technology.

15
Sociopolitical Thesaurus vs. Legislative
Indexing Vocabulary (LIV)
16
General Structure of Thesaurus
17
Query Refinement
18
Navigation in Thesaurus
19
ALTP
  • Automatic Linguistic
  • Text Processing

-- Conceptual Indexing -- Automatic Coherent
Summarisation -- Automatic Text Categorisation
20
Term Extraction for Russian Official Documents
(RF Government Regulation N604 26.06.1995)
21
Thematic Lines of Thesaurus Terms (RF
Government Regulation N604 26.06.1995)
22
Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
23
Network of Thematic Nodes (RF Government
Regulation N604 26.06.1995)
24
Structure of Thematic Representation
Main Thematic Nodes
Specific Thematic Nodes
25
Structural Thematic Summary(RF Government
Regulation N604 26.06.1995)
26
Bilingual Information Retrieval
27
English-Russian Sociopolitical Thesaurus
  • Hierarchical conceptual net of 65 thousand
    English terms
  • Manual work
  • Use of general and special English-Russian
    dictionaries
  • Study of conventional American and British
    dictionaries and information-retrieval
    thesauri
  • Cross-checking of translations. Addition
    multiword variants. Internet checks.

28
English-Russian Sociopolitical Thesaurus testing
and use in new applications
  • Automatic text categorization of economic
    papers and abstracts using JEL subject
    headings (700 categories) (supported by Ford
    Foundation, USA)
  • Automatic text processing of statistical
    tables (in cooperation with Berkeley
    University, USA)
  • Automatic text processing of European
    documents (European Court of Human Rights,
    Council of Europe, European Union) problems
    of harmonization of Russian Legislation

29
Thesaurus Terminology in Sociopolitical Domain
30
Adding languages to Sociopolitical Thesaurus
  • It is a challenge to develop multilingual
    Sociopolitical thesaurus, to describe terms of
    Sociopolitical domain from different languages
    in the same hierarchical net
  • A project under discussion to add Tatar
    language to the bilingual thesaurus. Tatars are
    the second nation in Russia

31
Approach to Organization of Bilingual Search in
UIS RUSSIA
  • Development of a bilingual ontology in
    sociopolitical domain based on Russian
    Sociopolitical Thesaurus for automatic text
    processing

32
???????????? ?????????? ??????????????
????????????? ????????
????????????? ?????????????
???????????? ????????????? ?????????? ?????????
???????????? ????????
???????????? ?????????????
33
???????????? ?????????? ??????????????
????????????? ????????
????????????? ?????????????
???????????? ????????????? ?????????? ?????????
???????????? ????????
???????????? ?????????????
34
Use of Thesaurus in Information Retrieval
applications
  • Flexible knowledge-based categorization
    systems (9 systems) NEW Automatic text
    categorization of Russian
    legislation (200 000 documents) 3000
    categories
  • Knowledge-based text summarization system
    -- SUMMAC conference
  • Thesaurus-based information retrieval
  • -- a specially constructed thesaurus can
    significantly improve efficiency of
    information retrieval (3-point average
    precision)

35
Bilingual Search in UIS RUSSIA
36
www.cir.ru/is4/
37
Text Categorization
38
???????????? ?????????
?????????? ??? ?????? ?????????? ????? ???????
?????????? 60
??????? ????????, ?? ????????? ???????
39
?????????????? ??????????????
40
Text Categorization Using Thematic Representation
Systems of Subject Headings -- RF Subject
Headings System for Legal Acts (RF President
Decree N511, 15.03.200 1169 items, 4
levels) -- RF Central Election Committee
Legal Subject Headings (450 items 4 levels) --
80 Top Terms of Legislative Indexing
Vocabulary (LIV) Congressional Research
Service of the US Congress
41
????? ???????? ???????
???????
????????????2
????????????1
???
?11
?12
?13
?21
???????22
?
?
?
???
???


-

42
Subject Heading as Formulae of Support Concepts
43
Full Representation of Subject Heading(expansion
of support concepts)
44
Examples
45
http//www.cir.ru/docs/ips/techno/gmtpod/index_e.j
sp
46
Results of Text Categorization
Info
47
Known terms
Word/Phrase Sense Disumbiguation T_M ok
M not
48
Related Terms fo judgment
49
Thematic Summary
Link between two thematic lines DISCHARGE 
ORIGINATE  DISMISSAL WAGE  LABOR 
and HUNGARY  FORINT  HUNGARIANS  BUDAPEST 
STATE, COUNTRY
50
Two thematic lines JUDICAL TRIAL and HUNGARY
in text
51
Automatic Summary
52
Russian Text English Terms
53
Russian Text English Thematic Summary
54
Bilingual Text Categorization
Support of Subject Heading
Write a Comment
User Comments (0)
About PowerShow.com