Title: Metadata generation and glossary creation in eLearning
1Metadata generation and glossary creation in
eLearning
- Lothar Lemnitzer
- Review meeting, Zürich, 25 January 2008
2Outline
- Demonstration of the functionalities
- Where we stand
- Evaluation of tools
- Consequences for the development of the tools in
the final phase
3Demo
- We simulate a tutor who adds a learning objects
and generates and edits additional data
4Where we stand (1)
- Achievements reached in the first year of the
project - Annotated corpora of learning objects
- Stand-alone prototype of keyword extractor (KWE)
- Stand-alone prototype of glossary candidate
detector (GCD)
5Where we stand (2)
- Achievements reached in the second year of the
project - Quantitative evaluation of the corpora and tools
- Validation of the tools in user-centered usage
scenarios for all languages - Further development of tools in response to the
results of the evaluation
6Evaluation - rationale
- Quantitative evaluation is needed to
- Inform the further development of the tools
(formative) - Find the optimal setting / parameters for each
language (summative)
7Evaluation (1)
- Evaluation is applied to
- the corpora of learning objects
- the keyword extractor
- the glossary candidate detector
- In the following, I will focus on the tool
evaluation
8Evaluation (2)
- Evaluation of the tools comprises of
- measuring recall and precision compared to the
manual annotation - measuring agreement on each task between
different annotators - measuring acceptance of keywords / definition
(rated on a scale)
9KWE Evaluation step 1
- On human annotator marked n keywords in document
d - First n choices of KWE for document d extracted
- Measure overlap between both sets
- measure also partial matches
10Best method F-Measure
Bulgarian TFIDF/ADRIDF 0.25
Czech TFIDF/ADRIDF 0.18
Dutch TFIDF 0.29
English ADRIDF 0.33
German TFIDF 0.16
Polish ADRIDF 0.26
Portuguese TFIDF 0.22
Romanian TFIDF/ADRIDF 0.15
11KWE Evaluation step 2
- Measure Inter-Annotator Agreement (IAA)
- Participants read text (Calimera Multimedia)
- Participants assign keywords to that text
(ideally not more than 15) - KWE produces keywords for text
12KWE Evaluation step 2
- Agreement is measured between human annotators
- Agreement is measured between KWE and human
annotators - We have tested two measures / approaches
- kappa according to Bruce / Wiebe
- AC1, an alternative agreement weighting suggested
by Debra Haley at OU, based on Gwet
13IAA human annotators IAA of KWE with best settings
Bulgarian 0.63 0.99
Czech 0.71 0.78
Dutch 0.67 0.72
English 0.62 0.82
German 0.64 0.63
Polish 0.63 0.67
Portuguese 0.58 0.67
Romanian 0.59 0.61
14KWE Evaluation step 3
- Humans judge the adequacy of keywords
- Participants read text (Calimera Multimedia)
- Participants see 20 KW generated by the KWE and
rate them - Scale 1 4 (excellent not acceptable)
- 5 not sure
1520 kw First 5 kw First 10 kw
Bulgarian 2.21 2.54 2.12
Czech 2.22 1.96 1.96
Dutch 1.93 1.68 1.64
English 2.15 2.52 2.22
German 2.06 1.96 1.96
Polish 1.95 2.06 2.1
Portuguese 2.34 2.08 1.94
Romanian 2.14 1.8 2.06
16GCD Evaluation - step 1
- A human annotator marked definitions in document
d - GCD extracts defining contexts from same document
d - Measure overlap between both sets
- Overlap is measured on the sentence level,
partial overlap counts
17Is-definitions Recall Precision
Bulgarian 0.64 0.18
Czech 0.48 0.29
Dutch 0.92 0.21
English 0.58 0.17
German 0.55 0.37
Polish 0.74 0.22
Portuguese 0.69 0.30
Romanian 1.0 0.53
18GCD Evaluation step 2
- Measure Inter-Annotator Agreement
- Experiments run for Polish and Dutch
- Prevalence-adjusted version of kappa used as a
measure - Polish 0.42 Dutch 0.44
- IAA rather low for this task
19GCD Evaluation step 3
- Judging quality of extracted definitions
- Participants read text
- Participants get definitions extracted by GCD for
that text and rate quality - Scale 1 4 (excellent not acceptable)
- 5 not sure
20 defin. testers Av. value
Bulgarian 25 7 2.7
Czech 24 6 3.1
Dutch 14 6 2.8
English 10 4 3.3
German 5 5 2.1
Polish 11 5 2.7
Portuguese 36 6 2.2
Romanian 9 7 3.0
21GCD Evaluation step 3
- Further findings
- relatively high variance (many 1 and 4)
- Disagreement between users about the quality of
individual definitions
22- Individual user feedback - KWE
- The quality of the generated keywords remains an
issue - Variance in the responses from different language
groups - We suspect a correlation between language of the
users and their satisfaction - Performance of KWE relies on language settings,
we have to investigate them further
23- Individual user feedback GCD
- Not all the suggested definitions are real
definitions. - Terms are ok, but definitions cited are often not
what would be expected. - Some terms proposed in the glossary did not make
any sense. - The ability to see the context where a definition
has been found is useful.
24Consequences - KWE
- Use non-distributional information to rank
keywords (layout, chains) - Present first 10 keywords to user, more keywords
on demand - For keyphrases, present most frequent attested
form - Users can add their own keywords
25Consequences - GCD
- Split definitions into types and tackle the most
important types - Use machine learning alongside local grammars
- Look into the part of the grammars which extract
the defined term - Users can add their own definitions
26Plans for final phase
- KWE, work with lexical chains
- GCD, extend ML experiments
- Finalize documentation of the tools
27Validation
- User scenarios with NLP tools embedded
- Content provider adds keywords and a glossary for
a new learning object - Student uses keywords and definitions extracted
from a learning object to prepare a presentation
of the content of that learning object
28Validation
- Students use keywords and definitions extracted
from a learning objects to prepare a quiz / exam
about the content of that learning object
29Validation
- We want to get feedback about
- The users general attitude towards the tools
- The users satisfaction with the results obtained
by the tools in the particular situation of use
(scenario)
30User feedback
- Participants appreciate the option to add their
own data - Participants found it easy to use the functions
31Plans for the next phase
- Improve precision of extraction results
- KWE implement lexical chainer
- GCD use machine learning in combination with
local grammars or substituting these grammars - Finalize documentation of the tools
32Corpus statistics full corpus
- Measuring lengths of corpora ( of documents,
tokens) - Measuring token / tpye ratio
- Measuring type / lemma ratio
33 of documents of tokens
Bulgarian 55 218900
Czech 1343 962103
Dutch 77 505779
English 125 1449658
German 36 265837
Polish 35 299071
Portuguese 29 244702
Romanian 69 484689
34Token / type Types / Lemma
Bulgarian 9.65 2.78
Czech 18.37 1.86
Dutch 14.18 1.15
English 34.93 2.8 (tbc)
German 8.76 1.38
Polish 7.46 1.78
Portuguese 12.27 1.42
Romanian 12.43 1.54
35Corpus statistics full corpus
- Bulgarian, German and Polish corpora have a very
low number of tokens per type (probably problems
with sparseness) - English has by far the highest ratio
- Czech, Dutch, Portuguese and Romanian are in
between - type / lemma ration reflects richness of
inflectional paradigms
36To do
- Please check / verify this numbers
- Report, for the M24 deliverable, about
improvements / recanalysis of the corpora (I am
aware of such activities for Bulgarian, German,
and English)
37Corpus statistics annotated subcorpus
- Measuring lenghts of annotated documents
- Measuring distribution of manually marked
keywords over documents - Measuring the share of keyphrases
38 of annotated documents Average length ( of tokens)
Bulgarian 55 3980
Czech 465 672
Dutch 72 6912
English 36 9707
German 34 8201
Polish 25 4432
Portuguese 29 8438
Romanian 41 3375
39 of keywords Average of keywords per doc.
Bulgarian 3236 77
Czech 1640 3.5
Dutch 1706 24
English 1174 26
German 1344 39.5
Polish 1033 41
Portuguese 997 34
Romanian 2555 62
40Keyphrases
Bulgarian 43
Czech 27
Dutch 25
English 62
German 10
Polish 67
Portuguese 14
Romanian 30