Title: Literature Extraction: Entities and Relations
1Literature ExtractionEntities and Relations
- ChengXiang (Cheng) Zhai
- Department of Computer Science
- Institute for Genomic Biology
- Statistics
- Graduate School of Library Information Science
- University of Illinois at Urbana-Champaign
ABC Workshop, UIUC, Dec. 5-6, 2007
2Outline
- General Background on Text Information Management
- Information Extraction Entities Relations
- Towards Automated Gene Annotation
3Text Management Technologies
Mining
Access
Select information
Create Knowledge
Add Structure/Annotations
Organization
4Text Information Access
- Search Works well if you know exactly what you
want (e.g., PubMed) - Navigation More useful for exploring information
space or when you cant formulate a good query - Recommendation Push information to a user
5Text Information Organization
- Summarization
- Single document vs. multiple documents,
unstructured vs. structured - Helps digest information
- Categorization
- Classify text into predefined categories (e.g.,
different GO terms) - Adds structures to text helps browsing direct
prediction - Clustering
- Group similar text segments or articles into the
same cluster - Helps reveal underlying structures
6Text Mining
- Information Extraction
- Pulling out entities (e.g., genes and proteins)
and relations (e.g., protein interactions) - Helps semantic analysis and inferences
- Topic Extraction
- Tease out subtopics/themes in text (e.g.,
separate multiple subtopics in the same article) - Helps summarization and browsing
- Inferences
- Attempts to create new knowledge (e.g., Gene A
has function F Gene A and Gene B are similar ?
Gene B has function F) - Pattern Discovery
- Discover outliers, correlations/associations,
clusters, trends, (e.g., gene A and gene B tend
to occur in similar context) - Helps create new knowledge
7Key Technique Information Retrieval
- Text/topic representation Vector-space,
probabilistic models - Term weighting TF-IDF
- Text matching (similarity) Vector/distribution
similarity functions - User feedback (learning about what a user is
interested in) machine learning
8Key Technique Machine Learning
- Computation YF(X)
- Knowledge-based Specify a recipe (program)
- Data-driven Provide examples of (X,Y) pairs
- Basic setup
- Given training data (many (X,Y) pairs)
- Assume some kind of function relation between X
and Y often with parameters - Fit the function to the training data to set
parameters - Hope the learned function to be able to compute Y
for new X - Generally require many training examples
(supervised learning) - May also work without training examples
(clustering)
9Key Technique Natural Language Processing
- Basic Tasks
- Part-of-speech tagging (recognizing nouns, verbs,
) - Syntactic parsing (recognizing sentence
structure) - Semantic analysis (trying to get the meaning)
- State of the art methods rely on machine learning
supplemented by some limited linguistic knowledge
- Generally a very difficult task, but easier for a
specific domain such as biomedical literature
10Massive Entity Recognition
- Class1 Small Variation (Dictionary/Ontology)
- Organism, Anatomy , Biological Process, Pathway,
Protein Family - Class2 Medium Variation
- Gene, cis Regulatory Element
- Class3 Large Variation
- Phenotype, Behavior
11Massive Relation Extraction
- Expression Location
- the expression of a gene in some location
(tissues, body parts) - Homology/Orthology
- one gene is homologous to another gene
- Biological process
- one gene has some role in a biological process
- Genetic/Physical/Regulatory Interaction
- one gene interacts with another gene in a certain
fashion (3 types of relations) - a simple case Protein-Protein Interaction (PPI)
12Generating New Knowledge
- Entity Relation Graph Mining
- Logic-based Inferences
13Example of Interactive Graph Mining
Behavior B2
isa
isa
Co-occur-fly
Co-occur-bee
Behavior B1
Gene A1
Behavior B4
Behavior B3
Orth-mos
Co-occur-mos
Co-occur-fly
Gene A1
Gene A2
Gene A3
Reg
Reg
Reg
orth
Reg
Gene A4
Gene A4
Gene A5
- 1.XNeighborOf(B4, Behavior, co-occur,isa)
B1,B2,B3 - 2. YNeighborOf(X, Gene, c-occur, orth
A1,A1,A2,A3 - 3. YY A5, A6 A1,A1, A2, A3,A5,A6
- 4. ZNeighborOf(Y, Gene, reg) A4, A4
-
14Logic-Based Discovery
- Encode all kinds of knowledge in the same
knowledge representation language - Perform logic inferences
- Example
- Regulate (GeneA, GeneB, ContextC). Literature
mining - SeqSimilar(GeneA,GeneA) Sequence mining
- Regulate(X,Y,C)? Regulate(Z,Y,C)
SeqSimilar(X,Z) Human knowledge - ? Regulate(GeneA,GeneB,ContextC)
- ADD InPathway(GeneB, P1)
- InPathway(X,P)? Regulate(X,Y,C) InPathway(Y,P)
Human knowledge - ? InvolvedInPathway(GeneA,P1)
15Towards Automated Annotation
Genome
Name 1
Name 2
Name k
2. Gene Summarization
6. Relation Retrieval
1. Literature Search
4. Gene-Term Association
Literature
7. Inferences
Relevant Text
3. GOTerm Summarization
Entities Relations
Gene Ontology
5. New Term Suggestion
Term1 Term2 . Term n
16Annotation Support Technologies
- Level 1 Literature Search
- document filtering (locate relevant articles)
- relevant passage retrieval (locate relevant
passages) - Level 2 Semi-automatic annotation
- Gene summarization
- GO term summarization (profiling)
- Gene-Term association analysis
- New GO term suggestion
- Level 3 Automatic annotation Gene?GO
- Relation retrieval (direct mentioning)
- Inferences/Relation mining (Inferred knowledge)
17Thank You!