UCSC Known Genes Version 3 Take 9 - PowerPoint PPT Presentation

About This Presentation

Title:

UCSC Known Genes Version 3 Take 9

Description:

Number of Views:61

Avg rating:3.0/5.0

Slides: 22

Provided by: jimk88

Learn more at: https://users.soe.ucsc.edu

Category:

more less

Transcript and Presenter's Notes

Title: UCSC Known Genes Version 3 Take 9

1
UCSC Known Genes Version 3Take 9
2
Known Gene History

3
Overall Pipeline

4
Genbank Alignment Issues

Using global instead of local near-best
alignment, also higher stringency.
Including all Genbank RNA, not just mRNA
These changes not yet reflected in Genbank
mRNA/RefSeq tracks.
Collect data such as selenocysteine substitutions
and alternative start codons from Genbank. These
data are in the .ra files but not the SQL
database.

5
Removing Antibody Var Regions

Chromosomes 2,14,22 contain antibody regions.
Thousands of transcripts for these in Genbank.
Gaps are from genomic rearrangements, not
splicing. Millions of possibilities.
Identify regions by
Searching for words like immunoglobulin
variable to make initial set of Ab fragments.
Treat anything that overlaps these as Ab fragment
too.
Cluster together putative Ab fragments.
Take 4 largest clusters as the 4 variable
regions. (One is just a pseudogene of a real
variable region.)
Remove all alignments in Ab clusters.
Replace with a single noncoding gene for each
cluster near end of gene build.

6
Chr22 Ab Region (lambda light chain)
7
Cleaning, projecting alignments

BLAT sometimes leaves messy gappy ends.
New heuristic
For gaps 6 base or less on both mRNA and genome,
just ignore gap, filling in with genome if
necessary.
Try to turn other gaps into introns if they are
not already by wiggling one base on either side
of gap.
Break up alignments at remaining gaps that are
not intronic. Intronic gaps are at least 16
bases, and have gt/ag or gc/ag ends.
After break up throw away any pieces less than 18
bases long.
For refSeq mRNA only, join pieces back together
after breaking up. Other mRNA can be joined by
other transcripts (which may not suffer the same
problems from polymorphism/error)
Consider applying similar heuristic in mRNA
track.

8
Cleaning and projecting
9
Cluster into splicing graph

10
Splicing graph and txWalk
11
Adding Evidence to Graph

Initial evidence for each edge comes from mRNAs.
If edge is supported by at least 2 ESTs. (Single
EST likely is same clone as single RNA) Just
use spliced ESTs
Make graph in mouse and map via chains. Reinforce
orthologous human edges.
Reinforce exon edges that overlap Exoniphy
predictions.
Evidence weight refSeq 100, each mRNA 2, est
pair 1, mouse ortho 1, exoniphy 1.

12
Walking graph

Weight of 3 on an edge is good enough.
Rank input RNA by whether refSeq, and number of
good edges they use.
If any good edges, output a transcript consisting
of the edges used by the first RNA.
Output transcript based on next RNA if the good
edges it uses have not been output in same order
before.
Continue until reach last RNA.

13
Evidence, Walk, AltSplice
14
Assigning Coding Regions

Align UniProt and RefSeq proteins to txWalk
transcripts. Mark regions they hit as possible
CDS.
Align Genbank/RefSeq RNAs to txWalk transcripts,
map CDS from RNA records as possible CDS.
Use bestorf program for another possible CDS.
Assign an ad-hoc score to each possible CDS,
choose highest scoring.
More comparative genomics could really help here
someday

15
CDS Mapping, Filtering
16
(No Transcript)
17
(No Transcript)
18
Classifying and Weeding

The transcripts are classified into
Coding CDS survives trimming stage
Near-coding overlap coding by at least 20 bases
on same strand
Antisense overlap coding by at least 20 bases on
opposite strand
Noncoding other transcripts
Near-coding transcripts that show signs of
incomplete splicing (retained intron, bleeds gt
100 bases into intron) are removed.

19
Assigning accessions

Initial temporary identifiers of form
ltchromgt.ltclustergt.lttxgt.ltaccessiongt, eg
chr22.210.5.AB209301
Make permanent identifiers of form TX12345678.
Find exact match in previous gene set, and reuse
previous accession.
Find compatible match (all introns alike) in old
gene set, reuse accession, bump version.
Make up new accession otherwise.
Record genes in old set not in new.
Version 7 -gt version 9 mapping actually a good
test of this 53025 exact, 4732 lost, 3736 new,
464 compatible.
Move to UC1234567 format in v. 10?

20
Building gene-centric tables

mmBlastTab, rnBlastTab etc. homolog tables.
Blastp best plus syntenic weeding.
kgXref and knownToXxx tables to relate gene to
other databases and tables.
kgAlias table to help search on gene names.
gnfAtlas2Distance to measure expression
similarity between genes for Gene Sorter. 3 other
expression distance tables
humanVidalP2P and humanWankerP2P protein network
distance tables.
knownCanonical/knownIsoform tables to help people
selectively view alt-splicing.
pbXXX tables for proteome browser.
In all about 10 hours of compute and indexing.

21
The Plan