CERN - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

CERN

Description:

CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo R ez ETT/SI Data Handling Group- CERN – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 43
Provided by: euro111
Category:
Tags: cern | blue | cheese

less

Transcript and Presenter's Notes

Title: CERN


1
CERN
European Organization for Nuclear Research
Automatic Keyword Assignment for High Energy
Physics Literature
Arturo Montejo Ráez ETT/SI Data Handling Group-
CERN Geneva (Switzerland)
Joint Research Center, Ispra (Italy) -4 March 2002
2
CERN
European Organization for Nuclear Research
Data Handling Group
What we are going to see today...
  • Keyword assignment process
  • Why keywords?
  • How it is done for High Energy Physics papers
  • The HEPindexer project
  • Future work
  • Data
  • Algorithm
  • Experiments
  • Results

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
3
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
Indexer
Authors
Keyworded papers
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
4
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
5
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
The document...
  • Full text paper
  • Stored in a database
  • Simplified representation needed

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
6
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
The thesaurus...
  • Controlled vocabulary of concepts
  • Relationships between keywords
  • Categories and subcategories
  • Can be domain specific
  • Can be translated into multiple languages

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
7
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
The thesaurus a relational model for terms
cheese MT 6016 processed agricultural
produce BT1 milk product NT1
blue-veined cheese NT1 cow's milk cheese
NT1 fresh cheese NT1 goat's milk cheese
NT1 hard cheese NT1 processed cheese NT1
semi-soft cheese NT1 sheep's milk cheese
NT1 soft cheese RT cheese factory (6031)
8
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
The thesaurus a subject tree
04 POLITICS 0406 political framework 0411
political party 0416 electoral procedure and
voting 0421 parliament 0426 parliamentary
proceedings 0431 politics and public safety
0436 executive power and public service 08
INTERNATIONAL RELATIONS 0806 international
affairs 0811 cooperation policy 0816
international balance 0821 defence 10 EUROPEAN
COMMUNITIES 1006 Community institutions and
European civil service 1011 Community law
1016 European construction 1021 Community
finance
9
CERN
European Organization for Nuclear Research
Data Handling Group
Keyword assignment process
The indexer...
  • An expert in the domain of the documents
  • An expert in the use of the thesaurus
  • Heavy task
  • Not always the same proposition
  • Expensive!

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
10
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
  • Permit to index documents in a coherent way
  • Can be viewed like the "index" at the end of a
    book
  • Concepts that represent better the content
  • Human made (value added)
  • Meaningful
  • Can stablish relations between documents
  • Multilingual

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
11
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
Access to documents
But... we already have fulltext indexing!
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
12
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
  • Classification
  • To store (libraries)
  • To access (narrow searches)

Category 1
Category 2
Category 3
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
13
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
Crosslingual access
Razor?
Navaja
Navaja
Razor
Razor
Couteau
Couteau
Lametta
Lametta
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
14
CERN
CERN
CERN
CERN
European Organization for Nuclear Research
European Organization for Nuclear Research
European Organization for Nuclear Research
European Organization for Nuclear Research
Data Handling Group
Data Handling Group
Data Handling Group
Data Handling Group
Why keywords?
Why keywords?
Multilingual comparison
Multilingual comparison
Murder
Lametta
Razor
Frabbica
Lametta
Razor
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
15
CERN
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
Advantages over fulltext searches
  • No ambiguity
  • Better relevance and precision

More advanced tools for searching and
classification are coming!
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
16
CERN
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
The BIG problem...
- E X P E N S I V E -
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
17
CERN
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
18
CERN
CERN
European Organization for Nuclear Research
Data Handling Group
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
19
CERN
CERN
European Organization for Nuclear Research
Data Handling Group
The CERN
  • The world's largest particle physics centre
  • Explores what matter is made of, and what forces
    hold it together
  • Employs just under 3000 people
  • 6500 scientists, come for their research

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
20
CERN
European Organization for Nuclear Research
Data Handling Group
How it is done for High Energy Physics papers
DESY Deutsche Elektronen-Synchrotron (Hamburg,
Germany)
  • DESY thesaurus
  • Group of indexers (students, experts...)
  • Only High Energy Physics related papers

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
21
CERN
European Organization for Nuclear Research
Data Handling Group
How it is done for High Energy Physics papers
The DESY thesaurus
A a4(2040) ('postulated particle, a4(2040)',
was delta(2040)) a6(2450) ('postulated
particle, a6(2450)', was delta(2450)) abelian
aberration absorption -absorptive model
(model, absorption) accelerator . . . B B
B anti-B B BL number B(5320) (excited
B) -B ('B2...', similar for B/s, etc.)
B2(5732) (postulated particle, B2(5732))
B- -B-factory (B, particle source) B-L
number . . .
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
22
CERN
European Organization for Nuclear Research
Data Handling Group
How it is done for High Energy Physics papers
The DESY thesaurus
  • Few categories rarely used
  • Only two type of keywords
  • main keywords (1191)
  • secondary keywords (949)
  • No relationships between terms
  • Specific terminology

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
23
CERN
European Organization for Nuclear Research
Data Handling Group
How it is done for High Energy Physics papers
The DESY thesaurus specific terminology
  • Energy declarations 1.5-2.7 GeV-cms
  • Resonances Delta (1232)
  • Reaction equations anti-p p ---gt K0 K- pi
  • Combinations angular distribution, (photon),
    mass spectrum (pi pi- pi0)
  • Two-particle initial state 'anti-p p',
    'electron positron'

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
24
CERN
European Organization for Nuclear Research
Data Handling Group
How it is done for High Energy Physics papers
The problem
Indexer
Physicists
More than 500 preprints per week!
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
25
CERN
European Organization for Nuclear Research
Data Handling Group
The HEPindexer project
The solution
Physicists
Indexer
Keyworded papers
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
26
CERN
European Organization for Nuclear Research
Data Handling Group
The HEPindexer project
  • Use of IR techniques
  • Objective evaluation
  • Real time answer
  • Easy portable
  • Full integrable into CDS
  • Posibility of growing
  • Fully automatical aider tool

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
27
CERN
European Organization for Nuclear Research
Data Handling Group
The HEPindexer project
Keyword Term
Keyworded papers (collection)
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
28
CERN
European Organization for Nuclear Research
Data Handling Group
The HEPindexer project
Documents
DESY keywords
Keyword Term
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
29
CERN
European Organization for Nuclear Research
Data Handling Group
Data
The HEPindexer project
2441 training collection
  • 3,661 documents
  • 19,143 terms
  • 1,191 main keywords

1220 test collection
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
30
CERN
European Organization for Nuclear Research
Data Handling Group
Algorithm
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
31
CERN
European Organization for Nuclear Research
Data Handling Group
Algorithm
The HEPindexer project
Preprocessing
  • Punctuation
  • Lower case
  • Remove stop words
  • Stemming

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
32
CERN
European Organization for Nuclear Research
Data Handling Group
Algorithm
The HEPindexer project
Weight term - document
Weight keyword - document
Weight keyword - term
Similarity keyword - document
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
33
CERN
European Organization for Nuclear Research
Data Handling Group
Experiments
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
34
CERN
European Organization for Nuclear Research
Data Handling Group
Experiments
The HEPindexer project
AÇB
Keywords in the trainning collection
A
B
A keywords propossed by DESY B keywords
propossed by HEPindexer
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
35
CERN
European Organization for Nuclear Research
Data Handling Group
Results
The HEPindexer project
52.7 of precision 58.5 of recall
Response in 2 seconds
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
36
CERN
European Organization for Nuclear Research
Data Handling Group
Results
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
37
CERN
European Organization for Nuclear Research
Data Handling Group
Results
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
38
CERN
European Organization for Nuclear Research
Data Handling Group
Software
The HEPindexer project
  • C / STL
  • UNIX
  • Command line interface
  • Digilib Web interface (PHP)
  • http//cern.ch/digilib
  • Installation on the CERN Document Server
  • http//cds.cern.ch

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
39
CERN
European Organization for Nuclear Research
Data Handling Group
Software
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
40
CERN
European Organization for Nuclear Research
Data Handling Group
Software
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
41
CERN
European Organization for Nuclear Research
Data Handling Group
Software
The HEPindexer project
Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
42
CERN
European Organization for Nuclear Research
Data Handling Group
Future Work
  • Automatic proposition of secondary keywords
  • Improve the algorithm
  • (lemmatizer, multiwords, segmentation...)
  • Use of references to link documents based on
  • common concepts
  • Specific algorithms for handling of energies,
  • particle decays, desintegrations, etc.
  • Agents
  • OAI
  • Apply Semantic Web approaches

Automatic Keywording for HEP literature Ispra
(Italy) 4 March 2002
Write a Comment
User Comments (0)
About PowerShow.com