CANDID: A candidate gene identification tool - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

CANDID: A candidate gene identification tool

Description:

Help Sally. Look for genes that are involved with pancreatic cancer. ... Every publication has a 'Text Words' field that includes, when available, ... Title ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 42
Provided by: janna2
Category:

less

Transcript and Presenter's Notes

Title: CANDID: A candidate gene identification tool


1
CANDIDA candidate gene identification tool
  • Janna Hutz
  • jehutz_at_artsci.wustl.edu
  • March 19, 2007

2
Candidate genes
  • Positional
  • Linkage evidence
  • Deletion syndrome
  • Loss of heterozygosity
  • Disease-related amplification
  • Association
  • Biological
  • Pathways
  • Phenotypic characteristics

ACTA/GGGA
3
A case study acd
4
A case study acd
0 cM
acd
31 cM
Os
Es-1
82 cM
5
A case study acd
3/145
0/145
0/145
6
Which gene is acd?
7
Prioritization tools
  • Endocrinologist/Geneticist
  • Ensembl
  • RT-PCR
  • Sequencing

BINGO!
two years later.
8
How can we improve this?
9
Improve our tools
  • Clinician
  • Has memorized information about many disorders
    can name some relevant genes
  • Gets his/her information from

PubMed
10
PubMed
  • How do we use PubMed to analyze our candidates?
  • Enter our phenotypic keywords into PubMed. Read
    the papers that come up in the results. Make a
    list of genes.
  • Do PubMed searches for all the candidates. Read
    the papers that come up in the results. Rate the
    candidates.

Better Dont do it yourself
11
PubMed
  • Each publication has a PubMed ID
  • Each gene has a Gene ID
  • Wouldnt it be nice if we could link Gene IDs and
    PubMed IDs?
  • ftp//ftp.ncbi.nlm.nih.gov/gene/DATA
  • gene2pubmed.gz
  • TaxonomyID GeneID PubMedID

12
Who makes that file? (1)
  • From http//www.ncbi.nlm.nih.gov/entrez/query/stat
    ic/entrezlinks.html
  • Links between Gene and PubMed are the result of
    the following
  • 1. Manual curation within NCBI. Part of the
    process of generating a REVIEWED RefSeq is an
    analysis of the current literature. Papers that
    are seminal in defining the gene, its sequence,
    and its function are added to the record at that
    time. Alert users point out gaps or errors in
    papers associated with a Gene record. These
    messages are reviewed and implemented as required.

13
Who makes that file? (2)
  • 2. Integration of information from other public
    databases. Gene integrates gene-citation from
    resources external to NCBI such as model
    organism-specific databases, Gene Ontology (GO),
    groups curating interactions, and sequence
    databases. The assumption in using these source
    is that they report citations specific to a gene
    in a known species. Gene does not process
    citations from OMIM automatically, because many
    of citations in OMIM refer to studies of genes in
    species other than human.

14
Example 1
pancreatic cancer
sequence candidates

15
Help Sally.
  • Use CANDIDs literature criterion
  • http//dsgweb.wustl.edu/llfs/secure_html/hutz/inde
    x.html
  • User workshop Password perl031907

16
(No Transcript)
17
Help Sally.
  • Look for genes that are involved with pancreatic
    cancer.
  • What are some keywords we can use?

18
(No Transcript)
19
A measure of relevancy
  • Find relevant publications
  • Is Gene X linked to these publications?
  • How many publications match?
  • What percent of Gene Xs publications match?

20
By the numbers
  • Literature scores run from 0 to 1.
  • The score is

21
Matching
  • Every publication has a Text Words field that
    includes, when available,
  • Title
  • Abstract
  • Other abstract
  • MeSH terms
  • MeSH subheadings
  • Publication types
  • Substance names
  • Personal name as subject
  • MEDLINE secondary source
  • Other terms

22
Summary
23
Results
24
Exporting to Excel
  • Output file is a comma-separated file
  • Download it, and change the .output to .csv.
  • If Excel doesnt open it automatically when you
    click on it, paste the data into a new sheet and
    use the Text Import Wizard to separate the
    columns.

25
Drawbacks
  • What if a gene isnt associated with any
    publications?
  • Its not important
  • Its not yet characterized

26
What about those genes?
27
Analyzing the other genes
  • We dont have literature data.
  • We dont have expression data.
  • All we have is a sequence.

28
Fun with sequences
  • DNA
  • Cross-species conservation
  • RNA (cDNA)
  • Cross-species conservation
  • Protein sequence prediction
  • Protein conservation
  • Protein domain prediction

29
Protein domains
  • InterPro
  • Conserved Domain Database (NCBI)
  • Wouldnt it be nice if we could link Gene IDs and
    protein domains?

Interpro
ftp//ftp.ncbi.nlm.nih.gov/gene/DATA
30
Who makes those links?
  • From http//www.ncbi.nlm.nih.gov/entrez/query/stat
    ic/entrezlinks.html
  • Links between Entrez Gene and Conserved Domain
    Database (CDD) are calculated from the domains
    annotated by the CDD group on Reference Sequence
    proteins.

31
How can we use this?
  • The CDD domains have descriptions.
  • These descriptions can be searched

just like when we searched PubMed!
  1. CANDID finds domains containing our keywords.
  2. If a gene has one of those domains, it gets a
    score of 1.

32
How far back does our gene go?
  • Is our gene in mammals?
  • Fish?
  • Bacteria?

33
More sequence fun
  • Many measures of conservation
  • Nucleotide similarity (percentage, pairwise)
  • Amino acid similarity (percentage, pairwise)
  • etc., etc.

34
HomoloGene
  • Gets sequences
  • Uses amino acid AND nucleotide similarity
    measures
  • Plus lots more math, equals
  • A label that answers our question

35
Labels used in CANDID
  • Homo sapiens
  • Primates (chimp, gorilla)
  • Rodents (rat, mouse)
  • Eutherian mammals (dog, cow, cat)
  • Amniota (chicken)
  • Insects (mosquito, bee)
  • Bilateria (C. elegans)
  • Fungi
  • Eukaryotes

36
Example 2
Known and unknown genes
37
Array candidates
  • Lets increase the number of CANDID results we
    got in Example 1

38
Weighting system
  • Prioritize genes of known or unknown function
  • Modify weights for each category
  • Well-characterized genes higher literature
    weight
  • Uncharacterized genes higher domains,
    conservation weights

39
Example 3
  • Make up your own example!
  • Use literature, domains, and/or conservation
    criteria.

40
Next week
  • Expression data
  • Linkage data
  • Association data
  • CANDIDs efficiency
  • Anything else?

41
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com