UCB BioText TREC 2003 Genomics Track - PowerPoint PPT Presentation

About This Presentation
Title:

UCB BioText TREC 2003 Genomics Track

Description:

UCB BioText group took part in Task 1 and Task 2 ... e.g. Enzyme, Gene or Genome, Mammal, Tissue, Virus etc. Alphanumeric descriptor codes ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 34
Provided by: hea4
Category:
Tags: biotext | trec | ucb | genomics | mammal | track

less

Transcript and Presenter's Notes

Title: UCB BioText TREC 2003 Genomics Track


1
UCB BioTextTREC 2003 Genomics Track
  • Participants
  • Marti Hearst
  • Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz
  • University of California, Berkeley
  • Genomics tasks 1 and 2

2
Overview
  • UCB BioText group took part in Task 1 and Task 2
  • Task 1 Information retrieval Information
    Extraction ( Text Classification)
  • Task 2 Text Classification Information
    Extraction
  • Commonalities for the both tasks
  • Named entities recognition in the text
  • Genes and synonyms
  • MeSH concepts
  • Text classification algorithms

3
MeSH Hierarchy
  • Unique identifier e.g. Abdomen has D000005
  • UMLS semantic tags
  • e.g. Enzyme, Gene or Genome, Mammal, Tissue,
    Virus etc.
  • Alphanumeric descriptor codes
  • A Anatomy Body Regions A01
    Abdomen A01.047
  • B Musculoskeletal
    System A02 Back A01.176
  • C Digestive
    System A03 Breast A01.236
  • D Respiratory
    System A04 Extremities A01.378
  • E Urogenital
    System A05 Head A01.456
  • F
    Neck
    A01.598
  • G
    .
  • H Physical Sciences Electronics
    Amplifiers
  • I
    Astronomy Electronics, Medical
  • J
    Nature Transducers
  • K Time

4
Task 1
5
TREC Task 1 Overview
  • Search 525,938 MedLine records
  • Titles, abstracts, MeSH category terms, citation
    information
  • Topics
  • Taken from the GeneRIF portion of the LocusLink
    database
  • We are supplied with a gene names
  • Definition of a GeneRIF
  • For gene X, find all MEDLINE references that
    focus on the basic biology of the gene or its
    protein products from the designated organism. 
    Basic biology includes isolation, structure,
    genetics and function of genes/proteins in normal
    and disease states.

Task 1
6
TREC Task 1 Sample Query
  • 3 2120 Homo sapiens OFFICIAL_GENE_NAME ets
    variant gene 6 (TEL ncogene)
  • 3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6
  • 3 2120 Homo sapiens ALIAS_SYMBOL TEL
  • 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant
    gene 6
  • 3 2120 Homo sapiens PRODUCT ets variant gene 6
  • 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene
  • The first column is the official topic number
    (1-50).
  • The second column contains the LocusLink ID for
    the gene.
  • The third column contains the name of organism.
  • The fourth column contains the gene name type.
  • The fifth column contains the gene name.

Task 1
7
General Architecture
Classifier
weight 0.01
"has GeneRIF"
Task 1
8
Main Challenges
  • Task 1
  • Given a gene and an organism, find documents
    likely to have a GeneRIF
  • Relevance judgment GeneRIF references from
    LocusLink
  • Main challenges
  • Ranking
  • Recall
  • Find more gene synonym variations
  • Precision
  • Filter out abstracts with genes from incorrect
    organisms
  • Lower the rank of documents not likely to have a
    GeneRIF

Task 1
9
Gene Synonym List Creation
Task 1
10
How to Find Gene Name Synonyms?
  • Strategy
  • Compile a list of gene names from the text
  • Start with a list of gene names from LocusLink
    and MeSH
  • Use an n-gram-based approximate match algorithm
    to find alternative representations of these
    genes in Medline abstracts
  • Look for commonalities and regularities
  • Create a set of name transformation rules
  • Some are better than others

Task 1
11
Gene Expansion Sample Expansion Pairs
  • Matches whose Dice coefficient falls between 0.5
    and 1.0

Task 1
12
Gene ExpansionHigh Confidence Rules
  • Matches whose Dice coefficient falls between 0.5
    and 1.0
  • Rules determined by inspection

Task 1
13
Organism Filtering
Task 1
14
Organism FilteringStrategy
  • Problem
  • The query describes the organism name using the
    LocusLink terminology which differs from
    Medlines
  • Strategy
  • Semi-automatically determine the translation
  • For a given LocusLink organism name, search for
    that term against the MEDLINE title, abstract,
    and MeSH terms
  • Display the most frequent MeSH terms that result
  • The translation appeared as one of the top 3
  • Could be a useful strategy for other translation
    problems

Task 1
15
Organism FilteringResults
  • Sample Top-Ranked MeSH Terms

Task 1
16
GeneRIF Classification
Task 1
17
GeneRIF ClassificationTraining
  • Used for our second run
  • Motivation
  • Only Medline documents that have been assigned
    GeneRIFs are considered relevant
  • Strategy to improve precision
  • Identify documents likely to have a GeneRIF
    assigned
  • Naïve Bayes classifier (WEKA ML tools)
  • Training
  • 50 gene names, not in TREC training/testing set
  • Train on 1000 top-ranked documents for each gene

Task 1
18
GeneRIF ClassificationResults
Task 1
19
Document Ranking
Task 1
20
Document Ranking
  • DB2 Net Search Extender
  • Score weighted SUM
  • 1.0 (H compared to phrases in titles)
  • 1.0 (H compared to phrases in abstracts)
  • 0.015 (L compared to phrases in titles)
  • 0.015 (L compared to phrases in abstracts)
  • 1.4 (query MeSH compared to document MeSH)
  • H high confidence gene rules
  • L low confidence
  • Weights determined experimentally

Task 1
21
Document Retrieval and Ranking
Task 1
22
Task 1 TREC Evaluation
  • MAP on TREC training data
  • using GeneRIF classifier 0.5101
  • without GeneRIF classifier 0.5028
  • MAP on TREC testing data
  • using GeneRIF classifier 0.3912
  • without GeneRIF classifier 0.3753
  • Analysis
  • Using the classifier performs better on 27 out of
    50 queries ( on 12).
  • Tuning the parameters on the test set (tried
    afterwards) results in only minor improvement.

Task 1
23
Task 2
24
TREC Task 2
  • Problem Definition
  • Given GeneRIFS formatted as
  • 1    355    12107169    J Biol Chem 2002 Sep
    13277(37)34343-8.    the death effector domain
    of FADD is involved in interaction with Fas.
  • 2    355    12177303    Nucleic Acids Res 2002
    Aug 1530(16)3609-14.    In the case of
    Fas-mediated apoptosis, when we transiently
    introduced these hybrid-ribozyme libraries into
    Fas-expressing HeLa cells, we were able to
    isolate surviving clones that were resistant to
    or exhibited a delay in Fas-mediated apoptosis w
  • reproduce the GeneRIF from the MEDLINE record.  

Task 2
25
Preliminary study
  • Find the GeneRIF text in the abstract
  • 33,662 MEDLINE abstracts with GeneRIFs
  • Best match of the GeneRIF text in the abstract
  • Modified Unigram Dice coefficient
  • Accepted, if scored above 80

Task 2
26
Baseline
  • Baseline Pick the whole title verbatim
  • Motivation
  • the best match was a substring of the title
    46.30
  • the whole title was the best match in 65.10
  • Baseline Modified Unigram Dice score 53.39
  • Choose title vs. last sentence
  • Observation
  • the best match is the title OR the last sentence
    73.40
  • If we choose a whole sentence title vs. last
    sentence
  • Upper bound (best choice each time) 66.33
  • Lower bound (worst choice each time) 22.62

Task 2
27
Features
  • We experimented with the following features
  • Nominal features
  • words/stems
  • verbs (most frequent e.g. bind, block, accept
    etc. nominalized)
  • genes
  • gene_freq (number of gene names mentioned)
  • MeSH_unique_ID (e.g. D005796)
  • MeSH_codes (level 1 G14, or level 2 G14.330)
  • MeSH_semantic_type (e.g. cell, human, biological
    function)
  • journal
  • publication_date (month and year, e.g. 10_2003 )
  • Boolean features
  • target_gene (is the target gene mentioned?)
  • is_title (is the current sentence the title?)
  • is_last_sentence (is this the last sentence?)

Task 2
28
Best Features
  • Standard feature set
  • verbs (most frequent e.g. bind, block, accept
    etc. nominalized)
  • genes_freq (number of gene names mentioned)
  • MeSH_code (cut at level 2, e.g. G14.330)
  • target_gene (is the target gene mentioned?)
  • is_title (is the current sentence the title?)
  • Is_last_sentence (is this the last sentence?)
  • The last two were not used in the final tests.
  • Weighted using TF.IDF (except the Boolean
    features)

Task 2
29
Title vs. Last Sentence
  • Text classification
  • Choose title (class A) vs. last sentence (class
    B)
  • Naïve Bayes classifier (WEKA ML tools)
  • The standard features
  • Training and testing
  • Each document represents one example
  • Features extracted from the title and the last
    sentence only
  • Features for title and last sentence are
    undistinguishable.
  • Distinguishing them lowers the accuracy.
  • Training set Modified Dice Unigram overlap with
    the GeneRIF
  • Stratified 10-fold cross-validation

Task 2
30
Task 2 Evaluation
  • Training
  • Document collections
  • 1000, 2000, 10000, 20000, 33662
  • finally, limited the set to the 5 target journals
  • Classification algorithm selection
  • tried decision tree, boosting, kNN, logistic
    regression etc.
  • Feature selection tuning, for a fixed feature set
  • tuned the best minimum frequency thresholds for
    verbs and MeSH_codes 12 and 5, accordingly
  • TREC run
  • Training 5 journals except the 139 abstracts
    from the TREC test
  • Feature frequency thresholds as found during
    training 12 and 5

Task 2
31
Task 2 Results
Task 2
32
Discussion
  • Test sets are small and much harder than training
    sets
  • Task 1
  • Organism filter was very helpful
  • Noisy GeneRIF assignment limits the help given by
    the classifier
  • Initial runs supplied by other research groups
    were very helpful
  • Task 2
  • Sentence truncation could improve the results
  • Need ranking, rather than classification
    algorithms
  • Better feature selection needed
  • sensitivity to frequency thresholds
  • MeSH ambiguity
  • verb nominalization

33
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com