Language Technologies 1 - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Language Technologies 1

Description:

1. Language Technologies (1) Diana Maynard. University of Sheffield, UK. ACAI 05 ... Text mining is about knowledge discovery from large collections of ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 69
Provided by: Opte
Category:

less

Transcript and Presenter's Notes

Title: Language Technologies 1


1
Language Technologies (1)
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
  • Diana Maynard
  • University of Sheffield, UK

2
Text mining and the Semantic Web
3
What is Text Mining?
  • Text mining is about knowledge discovery from
    large collections of unstructured text.
  • Its not the same as data mining, which is more
    about discovering patterns in structured data
    stored in databases.
  • Similar techniques are sometimes used, however
    text mining has many additional constraints
    caused by the unstructured nature of the text and
    the use of natural language.
  • Information extraction (IE) is a major component
    of text mining.
  • IE is about extracting facts and structured
    information from unstructured text.

4
Challenge of the Semantic Web
  • The Semantic Web requires machine processable,
    repurposable data to complement hypertext
  • Once metadata is attached to documents, they
    become much more useful and more easily
    processable, e.g. for categorising, finding
    relevant information, and monitoring
  • Such metadata can be divided into two types of
    information explicit and implicit.

5
Metadata extraction
  • Explicit metadata extraction involves information
    describing the document, such as that contained
    in the header information of HTML documents
    (titles, abstracts, authors, creation date, etc.)
  • Implicit metadata extraction involves semantic
    information deduced from the material itself,
    i.e. endogenous information such as names of
    entities and relations contained in the text.
    This essentially involves Information Extraction
    techniques, often with the help of an ontology.

6
Motivation
  • Implicit or semantic metadata extraction and
    annotation is the glue that ties ontologies into
    document spaces
  • Metadata is the link between knowledge and its
    management
  • Manual metadata production cost is too high
  • State-of-the-art in automatic annotation needs
    extending to target ontologies and scale to
    industrial document stores and the web

7
Information Extraction (IE)
8
IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
9
IE for Document Access
  • With traditional query engines, getting the facts
    can be hard and slow
  • Where has the Queen visited in the last year?
  • Which places on the East Coast of the US have
    had cases of West Nile Virus?
  • Which search terms would you use to get this kind
    of information?
  • How can you specify you want someones home page?
  • IE returns information in a structured way
  • IR returns documents containing the relevant
    information somewhere (if youre lucky)

10
IE as an alternative to IR
  • IE returns knowledge at a much deeper level than
    traditional IR
  • Constructing a database through IE and linking it
    back to the documents can provide a valuable
    alternative search tool.
  • Even if results are not always accurate, they can
    be valuable if linked back to the original text

11
Some example applications
  • HaSIE
  • KIM
  • Threat Trackers

12
HaSIE
  • Application developed by University of Sheffield,
    which aims to find out how companies report about
    health and safety information
  • Answers questions such as
  • How many members of staff died or had accidents
    in the last year?
  • Is there anyone responsible for health and
    safety?
  • What measures have been put in place to improve
    health and safety in the workplace?

13
HASIE
  • Identification of such information is too
    time-consuming and arduous to be done manually
  • IR systems cant cope with this because they
    return whole documents, which could be hundreds
    of pages
  • System identifies relevant sections of each
    document, pulls out sentences about health and
    safety issues, and populates a database with
    relevant information

14
HASIE
15
KIM
  • KIM is a software platform developed by Ontotext
    for semantic annotation of text.
  • KIM performs automatic ontology population and
    semantic annotation for Semantic Web and KM
    applications
  • Indexing and retrieval (an IE-enhanced search
    technology)
  • Query and exploration of formal knowledge

16
KIM
Ontotexts KIM query and results
17
Threat tracker
  • Application developed by Alias-I which finds and
    relates information in documents
  • Intended for use by Information Analysts who use
    unstructured news feeds and standing collections
    as sources
  • Used by DARPA for tracking possible information
    about terrorists etc.
  • Identification of entities, aliases, relations
    etc. enables you to build up chains of related
    people and things

18
Threat tracker
19
Named Entity Recognition the cornerstone of IE
  • Identification of proper names in texts, and
    their classification into a set of predefined
    categories of interest
  • Persons
  • Organisations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions
  • Various other types as appropriate

20
Why is NE important?
  • NE provides a foundation from which to build more
    complex IE systems
  • Relations between NEs can provide tracking,
    ontological information and scenario building
  • Tracking (co-reference) Dr Head, John, he
  • Ontologies Manchester, CT
  • Scenario Dr Head became the new director of
    Shiny Rockets Corp

21
Two kinds of approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • require only small amount of training data
  • development can be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need LE expertise
  • require large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus

22
Typical NE pipeline
  • Pre-processing (tokenisation, sentence splitting,
    morphological analysis, POS tagging)
  • Entity finding (gazeteer lookup, NE grammars)
  • Coreference (alias finding, orthographic
    coreference etc.)
  • Export to database / XML

23
An example GATE
  • GATE (Generalised Architecture for Text
    Engineering) is a framework for language
    processing
  • GATE also includes
  • plugins for language processing, e.g. parsers,
    machine learning tools, stemmers, IR tools, IE
    components for various languages...
  • tools for visualising and manipulating ontologies
  • ontology-based information extraction tools
  • evaluation and benchmarking tools

24
GATE Users
  • American National Corpus project
  • Perseus Digital Library project, Tufts
    University, US
  • Longman Pearson publishing, UK
  • Merck KgAa, Germany
  • Canon Europe, UK
  • Knight Ridder, US
  • BBN (leading HLT research lab), US
  • SMEs Melandra, SG-MediaStyle, ...
  • a large number of other UK, US and EU
    Universities
  • UK and EU projects inc. SEKT, PrestoSpace,
    KnowledgeWeb, MyGrid, CLEF, Dot.Kom, AMITIES,
    CubReporter,

25
Past Projects using GATE
  • MUMIS conceptual indexing automatic semantic
    indices for sports video
  • MUSE multi-genre multilingual IE
  • HSL IE in domain of health and safety
  • Old Bailey IE on 17th century court reports
  • Multiflora plant taxonomy text analysis for
    biodiversity research in e-science
  • EMILLE creation of S. Asian language corpus
  • ACE / TIDES IE competitions and collaborations
    in English, Chinese, Arabic, Hindi
  • h-TechSight ontology-based IE and text mining

26
Current projects using GATE
  • ETCSL language tools for Sumerian digital
    library
  • SEKT Semantic Knowledge Technologies
  • PrestoSpace Preservation of audiovisual data
  • KnowledgeWeb Semantic Web network of excellence
  • SWAN Large-scale semantic annotation
  • LIRICS Linguistic infrastructure for
    Interoperable Resources and Systems

27
Architectural principles ofGATE
  • Non-prescriptive, theory neutral (strength and
    weakness)
  • Re-use, interoperation, not reimplementation
    (e.g. diverse XML support, integration of
    Protégé, Jena, Weka...)
  • (Almost) everything is a component, and component
    sets are user-extendable
  • (Almost) all operations are available both from
    API and GUI

28
GATE
29
Information Extraction for the Semantic Web
  • Traditional IE is based on a flat structure, e.g.
    recognising Person, Location, Organisation, Date,
    Time etc.
  • For the Semantic Web, we need information in a
    hierarchical structure
  • Idea is that we attach semantic metadata to the
    documents, pointing to concepts in an ontology
  • Information can be exported as an ontology
    annotated with instances, or as text annotated
    with links to the ontology

30
Richer NE Tagging
  • Attachment of instances in the text to concepts
    in the domain ontology
  • Disambiguation of instances, e.g. Cambridge, MA
    vs Cambridge, UK

31
Another example Magpie
  • Developed by the Open University
  • Plugin for standard web browser
  • Automatically associates an ontology-based
    semantic layer to web resources, allowing
    relevant services to be linked
  • Provides means for a structured and informed
    exploration of the web resources
  • e.g. looking at a list of publications, we can
    find information about an author such as projects
    they work on, other people they work with, etc.

32
MAGPIE in action
33
MAGPIE in action
34
Evaluation
35
Evaluation metrics and tools
  • Evaluation metrics mathematically define how to
    measure the systems performance against
    human-annotated gold standard
  • Scoring program implements the metric and
    provides performance measures
  • for each document and over the entire corpus
  • for each type of NE
  • may also evaluate changes over time
  • A gold standard reference set also needs to be
    provided this may be time-consuming to produce
  • Visualisation tools show the results graphically
    and enable easy comparison

36
Methods of evaluation
  • Traditional IE is evaluated in terms of Precision
    and Recall
  • Precision - how accurate were the answers the
    system produced?
  • correct answers/answers produced
  • Recall - how good was the system at finding
    everything it should have found?
  • correct answers/total possible correct answers
  • Usually a tradeoff between precision and recall,
    so a weighted average of the two (F-measure) is
    generally also used.

37
GATE AnnotationDiff Tool
38
Metrics for Richer IE
  • Precision and Recall are not sufficient for
    ontology-based IE, because the distinction
    between right and wrong is less obvious
  • Recognising a Person as a Location is clearly
    wrong, but recognising a Research Assistant as a
    Lecturer is not so wrong
  • Similarity metrics need to be integrated so that
    items closer together in the hierarchy are given
    a higher score, if wrong
  • Also possible is a cost-based approach, where
    different weights can be given to each concept in
    the hierarchy, and to different types of error,
    and combined to form a single score

39
Learning Accuracy
  • LA Hahn98 originally defined to measure how
    well a concept had been added in the right level
    of the ontology
  • LA measures the degree to which the system
    correctly predicts the concept class which
    subsumes the target concept to be learned.
  • Used by Cimiano et al 2003 to measure how well
    the instance has been added in the right place in
    the ontology.

40
Learning Accuracy Metric
  • SP the shortest length from root to the key
    concept
  • FP shortest length from root to the predicted
    concept. If the predicted concept is correct,
    then FP 0, i.e. FP is only considered in the
    case that the answer given by the system is
    wrong.
  • CP shortest length from root to the MSCA (the
    lowest concept common to SP and FP paths)
  • DP shortest length from MSCA to predicted
    concept
  • If predicted concept is correct, i.e. if FP 0,
    then LA CP / SP 1
  • If predicted concept is incorrect, LA CP / FP
    DP

41
Problems with LA
  • LA doesnt consider the height of the Key
    concept, which means that however far away the
    Key is from the MSCA, the score is the same
  • It also means that similarity is not
    bidirectional, which is intuitively wrong
  • We propose an alternative to LA, known as BDM
    (Balanced Distance Metric) which takes this into
    account

42
BDM
  • MSCA most specific concept common to Key and
    Response
  • CP distance from root to MSCA
  • DPR distance from MSCA to Response concept
  • DPK distance from MSCA to Key concept
  • Each one is normalised wrt average length of
    chain in which Key and Response occur
  • This makes the penalty in terms of node traversal
    relative to the semantic density of the concepts
    in question

43
BDM - normalisations
  • n1 average length of the set of chains
    containing the key or the response concept,
    computed from the root concept.
  • n2 average length of all the chains containing
    the key concept, computed from the root concept.
  • n3 average length of all the chains containing
    the response concept, computed from the root
    concept.

44
BDM the metric
  • BDM is calculated for all correct and partially
    correct responses

CP distance from root to MSCA DPK distance
from MSCA to Key DPR distance from MSCA to
Response
45
BDM observations
  • BDM considers the relative specificity of the
    taxonomic positions of the key and response
  • It does not distinguish between the
    directionality of this relative specificity,
    however.
  • For instance, the key can be a specific concept
    (e.g. 'car') and the response can be a general
    concept (e.g. 'relation'), or vice versa.
  • Either way, the score will be the same.

46
Augmented Precision and Recall
BDM is integrated with traditional Precision and
Recall in the following way
47
Creating a gold standard corpus
  • OntoNews corpus 292 news articles from 3 news
    agencies (Guardian, Financial Times, Independent)
  • 3 topics international politics, UK politics and
    business.
  • covers August October 2001
  • Corpus annotated manually wrt KIMO ontology

48
KIMO a reference ontology
  • KIMO is earlier version of the Proton ontology,
    created by Ontotext in scope of KIM platform
  • http//proton.semanticweb.org
  • Contains around 250 classes and 100 relations
  • Domain-independent and modular (comprises top
    ontology and more specific lower ontology)

49
Annotating OntoNews
  • Annotation set covers range of levels and types
    of semantic annotation
  • Decomposable into subsets that constitute 3 types
    of ontologies
  • Named entities
  • Top ontology (20 high level concepts)
  • Common nouns
  • Coverage is significantly greater than previous
    initiatives, e.g. MUC, ACE

50
(No Transcript)
51
Tools for semantic annotation
  • Semi-automatic
  • MnM
  • S-CREAM/OntoMat
  • Automatic
  • SemTag
  • KIM
  • h-Techsight

52
MnM
  • Semi-automatic in that it requires initial
    training by user
  • Uses pre-defined set of concepts in ontology
  • User browses web and manually annotates his
    chosen pages
  • System learns annotation rules, tests them, and
    takes over annotation, populating ontologies with
    the instances found
  • Precision and recall are not perfect, however
    retraining is possible at any stage

53
S-CREAM
  • Semi-automatic CREAtion of Metadata
  • Uses Onto-O-Mat Amilcare
  • Trainable for different domains
  • Aligns conceptual markup (which defines
    relational metadata) provided by e.g. Ont-O-Mat
    with semantic markup provided by Amilcare

54
Annotated data in S-CREAM
55
Amilcare
  • Amilcare learns IE rules from pre-annotated data
    (e.g. using Ont-O-Mat)
  • Uses GATE (ANNIE) for pre-processing applies
    rules learnt in training phase to new documents
  • Concepts need to be pre-defined, but system can
    be trained for new domain
  • Can be tuned towards precision or recall

56
Automatic methods
  • SemTag
  • KIM
  • h-Techsight

57
SemTag and KIM
  • SemTag and KIM both annotate webpages using
    instances from an ontology
  • Main problem is to disambiguate such instances
    which occur in multiple parts of the ontology
  • SemTag aims for accuracy of classification,
    whereas KIM aims more for recall (finding all
    instances)
  • KIM also uses IE to find new instances not
    present in ontology

58
SemTag
  • Automated semantic tagging of large corpora,
    using TAP ontology (contains 65K instances)
  • Largest scale semantic tagging effort to date
  • Uses concept of Semantic Label Bureau
  • Annotations are stored separately from web pages
    (standoff markup)
  • Uses corpus-wide statistics to improve quality of
    tagging, e.g. automated alias discovery
  • Tags can be extracted using a variety of
    mechanisms, e.g. search for all tags matching a
    particular object

59
SemTag Architecture
60
KIM
  • Uses an ontology (KIMO) with 86K/200K instances
  • Lookup phase marks instances from the ontology
  • Disambiguation uses an Entity Ranking algorithm,
    i.e., priority ordering of entities with the same
    label based on corpus statistics
  • Lookup is combined with rule-based IE system
    (from GATE) to recognise new instances of
    concepts and relations
  • Special KB enrichment stage where some of these
    new instances are added to the KB

61
KIM (2)
62
h-TechSight KMP
  • Knowledge management platform for fully automatic
    metadata creation and ontology population, and
    semi-automatic ontology evolution, powered by
    GATE and ToolBox.
  • Data-driven analysis of ontologies enables trends
    of instances to be monitored
  • Uses GATE to support the instance-based evolution
    of ontologies in the Chemical Engineering domain.
  • Analysis of unrestricted text to extract
    instances of concepts from such ontologies

63
1
2
3
4
64
Ontology-based IE in h-TechSight
  • Ontology-Based IE for semantic tagging of job
    adverts, news and reports in chemical engineering
    domain
  • Semantic tagging used as input for ontological
    analysis
  • Terminological gazetteer lists are linked to
    classes in the ontology
  • Rules classify the mentions in the text wrt the
    domain ontology
  • Annotations output into a database or as an
    ontology

65
Limitations of h-TechSight
  • h-Techsight uses rule-based IE system
  • Requires human expert to write rules
  • Accurate on restricted domains with small
    ontologies
  • Adaptation to a new domain / ontology may require
    some effort

66
Summary of Semantic Annotation Tools
  • Tradeoff between semi-automatic and fully
    automatic systems, dependent on application,
    corpus size etc
  • Tradeoff between rule-based and ML techniques for
    IE
  • Tradeoff between dynamic vs static systems

67
Summary
  • Introduction to Human Language Technologies and
    how they can be used to enhance the development
    of the Semantic Web
  • Focused on text mining and information extraction
    techniques
  • Examples of different SOA applications
  • Examined development of traditional methods to
    encompass ontologies
  • New techniques for evaluation

68
Human Language Technologies Part 2
  • Part 2 of this tutorial will focus in detail on
    some new developments in adapting traditional HLT
    methods for the Semantic Web
  • Mixed Initiative Information Extraction extends
    capabilities of traditional OBIE
  • RichNews aims at automating annotation of
    multimedia news programs
Write a Comment
User Comments (0)
About PowerShow.com