UCSC Known Genes Version 3 Take 9 - PowerPoint PPT Presentation

About This Presentation
Title:

UCSC Known Genes Version 3 Take 9

Description:

Mark regions they hit as possible CDS. ... Use bestorf program for another possible CDS. Assign an ad-hoc score to each possible CDS, choose highest scoring. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 22
Provided by: jimk88
Category:
Tags: ucsc | cds | genes | known | take | version

less

Transcript and Presenter's Notes

Title: UCSC Known Genes Version 3 Take 9


1
UCSC Known Genes Version 3Take 9
2
Known Gene History
  • Initially based on Genie predictions constrained
    by BLAT mRNA alignments.
  • David Kulp got busy at Affy.
  • Switched to RefSeq
  • Jim got paranoid Riken RNAs would take over
  • Fan built KG 1
  • Mark got annoyed at low quality predictions
  • Fan Mark built KG 2
  • Jim got annoyed at missing genes
  • KG 3
  • The perfect set until KG 4.

3
Overall Pipeline
  • Get alignments etc. from database
  • Remove antibody fragments
  • Clean alignments, project to genome
  • Cluster into splicing graph
  • Add EST, Exoniphy, OrthoSplice info.
  • Walk unique transcripts out of graph.
  • Assign coding regions (CDS) to transcripts.
  • Classify into coding, antisense, noncoding.
  • Remove weak transcripts.
  • Assign accessions.
  • Build gene-centric database tables.

4
Genbank Alignment Issues
  • Using global instead of local near-best
    alignment, also higher stringency.
  • Including all Genbank RNA, not just mRNA
  • These changes not yet reflected in Genbank
    mRNA/RefSeq tracks.
  • Collect data such as selenocysteine substitutions
    and alternative start codons from Genbank. These
    data are in the .ra files but not the SQL
    database.

5
Removing Antibody Var Regions
  • Chromosomes 2,14,22 contain antibody regions.
  • Thousands of transcripts for these in Genbank.
  • Gaps are from genomic rearrangements, not
    splicing. Millions of possibilities.
  • Identify regions by
  • Searching for words like immunoglobulin
    variable to make initial set of Ab fragments.
  • Treat anything that overlaps these as Ab fragment
    too.
  • Cluster together putative Ab fragments.
  • Take 4 largest clusters as the 4 variable
    regions. (One is just a pseudogene of a real
    variable region.)
  • Remove all alignments in Ab clusters.
  • Replace with a single noncoding gene for each
    cluster near end of gene build.

6
Chr22 Ab Region (lambda light chain)
7
Cleaning, projecting alignments
  • BLAT sometimes leaves messy gappy ends.
  • New heuristic
  • For gaps 6 base or less on both mRNA and genome,
    just ignore gap, filling in with genome if
    necessary.
  • Try to turn other gaps into introns if they are
    not already by wiggling one base on either side
    of gap.
  • Break up alignments at remaining gaps that are
    not intronic. Intronic gaps are at least 16
    bases, and have gt/ag or gc/ag ends.
  • After break up throw away any pieces less than 18
    bases long.
  • For refSeq mRNA only, join pieces back together
    after breaking up. Other mRNA can be joined by
    other transcripts (which may not suffer the same
    problems from polymorphism/error)
  • Consider applying similar heuristic in mRNA
    track.

8
Cleaning and projecting
9
Cluster into splicing graph
  • Make graph where vertices are begin/ends of
    exons, edges are exons and introns.
  • Multiple input transcripts can share vertices and
    edges.
  • Went over this in some detail a few weeks back

10
Splicing graph and txWalk
11
Adding Evidence to Graph
  • Initial evidence for each edge comes from mRNAs.
  • If edge is supported by at least 2 ESTs. (Single
    EST likely is same clone as single RNA) Just
    use spliced ESTs
  • Make graph in mouse and map via chains. Reinforce
    orthologous human edges.
  • Reinforce exon edges that overlap Exoniphy
    predictions.
  • Evidence weight refSeq 100, each mRNA 2, est
    pair 1, mouse ortho 1, exoniphy 1.

12
Walking graph
  • Weight of 3 on an edge is good enough.
  • Rank input RNA by whether refSeq, and number of
    good edges they use.
  • If any good edges, output a transcript consisting
    of the edges used by the first RNA.
  • Output transcript based on next RNA if the good
    edges it uses have not been output in same order
    before.
  • Continue until reach last RNA.

13
Evidence, Walk, AltSplice
14
Assigning Coding Regions
  • Align UniProt and RefSeq proteins to txWalk
    transcripts. Mark regions they hit as possible
    CDS.
  • Align Genbank/RefSeq RNAs to txWalk transcripts,
    map CDS from RNA records as possible CDS.
  • Use bestorf program for another possible CDS.
  • Assign an ad-hoc score to each possible CDS,
    choose highest scoring.
  • More comparative genomics could really help here
    someday

15
CDS Mapping, Filtering
16
(No Transcript)
17
(No Transcript)
18
Classifying and Weeding
  • The transcripts are classified into
  • Coding CDS survives trimming stage
  • Near-coding overlap coding by at least 20 bases
    on same strand
  • Antisense overlap coding by at least 20 bases on
    opposite strand
  • Noncoding other transcripts
  • Near-coding transcripts that show signs of
    incomplete splicing (retained intron, bleeds gt
    100 bases into intron) are removed.

19
Assigning accessions
  • Initial temporary identifiers of form
    ltchromgt.ltclustergt.lttxgt.ltaccessiongt, eg
    chr22.210.5.AB209301
  • Make permanent identifiers of form TX12345678.
  • Find exact match in previous gene set, and reuse
    previous accession.
  • Find compatible match (all introns alike) in old
    gene set, reuse accession, bump version.
  • Make up new accession otherwise.
  • Record genes in old set not in new.
  • Version 7 -gt version 9 mapping actually a good
    test of this 53025 exact, 4732 lost, 3736 new,
    464 compatible.
  • Move to UC1234567 format in v. 10?

20
Building gene-centric tables
  • mmBlastTab, rnBlastTab etc. homolog tables.
    Blastp best plus syntenic weeding.
  • kgXref and knownToXxx tables to relate gene to
    other databases and tables.
  • kgAlias table to help search on gene names.
  • gnfAtlas2Distance to measure expression
    similarity between genes for Gene Sorter. 3 other
    expression distance tables
  • humanVidalP2P and humanWankerP2P protein network
    distance tables.
  • knownCanonical/knownIsoform tables to help people
    selectively view alt-splicing.
  • pbXXX tables for proteome browser.
  • In all about 10 hours of compute and indexing.

21
The Plan
  • Next week
  • test preliminary integration on hg18a
  • resolve issues with proteome browser
  • Tinker on take 10, maybe take 11
  • Week after
  • Integration of final gene build into hg18a
  • Move hg18.knownGenes to hg18.knownGenesOld
  • Swap hg18a tables into hg18.
  • Coming months
  • Continue to improve gene build.
  • Add new information from build into details
    pages.
  • Allow user filtering of which genes are shown
  • Allowing selection by names as well as IDs in
    table browser.
  • Present at Cold Spring Harbor. Write up paper.
Write a Comment
User Comments (0)
About PowerShow.com