Spaghetti Code, Soupy Logic - PowerPoint PPT Presentation

About This Presentation
Title:

Spaghetti Code, Soupy Logic

Description:

Simple human/rodent tree. Mutations that occur in rodent ancestor must be counted only once ... rodent. PhyloHMM on Drosophila ... – PowerPoint PPT presentation

Number of Views:259
Avg rating:3.0/5.0
Slides: 64
Provided by: jimk88
Category:

less

Transcript and Presenter's Notes

Title: Spaghetti Code, Soupy Logic


1
Spaghetti Code, Soupy Logic
Steaming fresh modules in sourceforge.net
Combinatorical assembly of transcription factors
in cell.
Jim Kent - University of California Santa Cruz
2
A Challenge Every Speaker Faces
  • Who is the audience?
  • Bioinformaticians
  • Biologists with bigger, better databases?
  • Geeks trading bits for bases?
  • Leading edge interdisciplinary super scientists?

3
Top 5 Reasons Biologists Go Into Bioinformatics
  • 5 - Microscopes and biochemistry are so 20th
    century.

4
Top 5 Reasons Biologists Go Into Bioinformatics
  • 5 - Microscopes and biochemistry are so 20th
    century.
  • 4 - Got started purifying proteins, but it turns
    out the cold room is really COLD.

5
Top 5 Reasons Biologists Go Into Bioinformatics
  • 5 - Microscopes and biochemistry are so 20th
    century.
  • 4 - Got started purifying proteins, but it turns
    out the cold room is really COLD.
  • 3 - After 23 years of school wanted to make MORE
    than 23,000/year as a postdoc.

6
Top 5 Reasons Biologists Go Into Bioinformatics
  • 5 - Microscopes and biochemistry are so 20th
    century.
  • 4 - Got started purifying proteins, but it turns
    out the cold room is really COLD.
  • 3 - After 23 years of school wanted to make MORE
    than 23,000/year as a postdoc.
  • 2 - Like to swear, _at_ttracted to _ Perl !!

7
Top 5 Reasons Biologists Go Into Bioinformatics
  • 5 - Microscopes and biochemistry are so 20th
    century.
  • 4 - Got started purifying proteins, but it turns
    out the cold room is really COLD.
  • 3 - After 23 years of school wanted to make MORE
    than 23,000/year as a postdoc.
  • 2 - Like to swear, _at_ttracted to _ Perl !!
  • 1 - Getting carpel tunnel from pipetting

8
Top 5 Reasons Computer People go into
Bioinformatics
  • 5 - Bio courses actually have some females.

9
Top 5 Reasons Computer People go into
Bioinformatics
  • 5 - Bio courses actually have some females.
  • 4 - Human genome more stable than Windows XP

10
Top 5 Reasons Computer People go into
Bioinformatics
  • 5 - Bio courses actually have some females.
  • 4 - Human genome more stable than Windows XP
  • 3 - Having mastered binary trees, quad trees, and
    parse trees ready for phylogenic trees.

11
Top 5 Reasons Computer People go into
Bioinformatics
  • 5 - Bio courses actually have some females.
  • 4 - Human genome more stable than Windows XP
  • 3 - Having mastered binary trees, quad trees, and
    parse trees ready for phylogenic trees.
  • 2 - Missing heady froth of the internet bubble.

12
Top 5 Reasons Computer People go into
Bioinformatics
  • 5 - Bio courses actually have some females.
  • 4 - Human genome more stable than Windows XP
  • 3 - Having mastered binary trees, quad trees, and
    parse trees ready for phylogenic trees.
  • 2 - Missing heady froth of the internet bubble.
  • 1 - Must augment humanity to defeat evil
    artificial intelligent robots.

13
The Paradox of Genomics
How does a long, static, one dimensional string
of DNA turn into the remarkably complex, dynamic,
and three dimensional human body?
GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATG
TAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATA
GCCAGCCAGCCACCACAGGCATGAGT
14
The Analogy of the Code of Life
  • DNA is popularly considered the code of life.
  • Computer programs are complex systems that
    ultimately are built up of 0s and 1s, perhaps
    they are a model for a genome built of A,C,G and
    T?
  • BUT.
  • Human genome lacks documentation, has accumulated
    3 billion years of cruft, and does not believe in
    local variables.
  • Therefore we must look to less than
    straightforward software programs as guides.

15
Bioperl CORBA module
sub new my ( class, _at_args) _at__ my
self class-gtSUPERnew(_at_args) my ( idl,
ior, orbname ) self-gt_rearrange( qw(IDL
IOR ORBNAME),
_at_args) self-gt'_ior'
ior 'biocorba.ior' self-gt'_idl'
idl ENVBIOCORBAIDL 'biocorba.idl'
self-gt'_orbname' orbname
'orbit-local-orb' CORBAORBitIDL_PATH
self-gt'_idl' my orb CORBAORB_init(or
bname) my root_poa orb-gtresolve_initial_r
eferences("RootPOA") self-gt'_orb'
orb self-gt'_rootpoa' root_poa
return self
16
Obfuscated C
define c(n,s)case nscontinue char
x"((((((((((((((((((((((",w "\b\b\b\b\b\b\b\
b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"char
r92,124,47,l2,3,1 ,0charT" ","
","\\/"," ",""char d1,p40,o40,k0,a,y
,z,g -1,G,X,PT4,f0unsigned int s0void
u(int i)int nprintf( "\233uH\233Lc\233uHc\
233uHs\23322uH_at_\23323uH \n",x-w,rd,x
w ,rd,X,P,pk,o)if(abs(p-x21)gtw21)exit(0
)if(g!G)struct itimerval t 0,0,0,0g((gltG)
ltlt1)-1t.it_interval.tv_usect.it_value.tv_usec72
000/((ggtgt 3)1)setitimer(0,t,0)fprintf("\e10
u",g24)fputchar(7)s(9-w21 )((ggtgt3)1
)opm(x)m(w)(nrand())255--wwif(!(
PPn7936)) while(abs((Xrand()76)-x2)-w
lt6)XPT(nrand()31)lt3(dn)!d--xlt w
(x,d)d2xwgt79(--x,--d)signal(i
,u)void e()signal(14, SIG_IGN)printf("\e0q\ec
Score u\n",s)system("stty echo -cbreak")int
main (int C,charV)atexit(e)(Clt2V1!113)
(f(C(int)getenv("TERM"))( int)0x756E696CC
(int)0x6C696E75)srand(getpid())system("stty
-echo cbreak" )h(0)u(14)for()switch(getchar()
)case 113return 0case 91case
98c(44,k -1)case 32case 110c(46,k0)case
93case 109c(47,k1)c(49,h(0))c(50,h(1 ))c(51,
h(2))c(52,h(3))
17
Reverse Engineering Microsoft
mouse
blue screen of death
Windows XP
keyboard
network
elaborate proprietary process
18
Looks like code not enough, must study actual
cells DNA
19
How DNA is Used by the Cell
20
Promoter Tells Where to Begin
Different promoters activate different genes
in different parts of the body.
21
A Computer in Soup
Idealized promoter for a gene involved in making
hair. Proteins that bind to specific DNA
sequences in the promoter region together turn a
gene on or off. These proteins are themselves
regulated by their own promoters leading to a
gene regulatory network with many of the same
properties as a neural network.
22
Genes can be transcription factors that
activate or repress other genes, leading to
regulatory networks such as this one from the
development of the central nervous system. (Image
from DHaeseleer Somogyi 1999)
23
The Decisions of a Cell
  • When to reproduce?
  • When to migrate and where?
  • What to differentiate into?
  • When to secrete something?
  • When to make an electrical signal?

The more rapid decisions usually are via the cell
membrane and 2nd messengers. The longer acting
decisions are usually made in the nucleus.
24
Nucleus Used to Appear Simple
  • Cheek cells stained with basic dyes. Nuclei are
    readily visible.

25
Mammalian Nuclei Stained in Various Ways
Image from Tom Misteli lab
26
Artists rendition of nucleus
Image from nuclear protein database
27
Chromatin
28
Turning on a gene
  • Getting DNA into the right compartment of the
    nucleus (may involve very diffuse signals in DNA
    over very long distances)
  • Loosening up chromatin structure (this involves
    activator and repressors which can act over
    relatively long distances)
  • Attracting RNA Polymerase II to the transcription
    start site (these involve relatively close
    factors both upstream and downstream of
    transcription start).

29
Methods for Studying Transcription
  • Genetics in model organisms
  • Promoters hooked to reporter genes
  • Gel shifts and DNAse footprinting.
  • Phylogenic footprinting
  • Motif searches in clusters of coregulated genes.

30
Drosophila Genetics
antennapediamutant
normal
31
Reporter Gene Constructs
promoter to study
easily seen gene
Drosophila embryo transfected with ftz promoter
hookedup to lacz reporter gene, creating stripes
where ftz promoteris active.
32
Biochemical Footprinting Assays
Gel showing selective protection of DNA from
nuclease digestion where transcription factor is
bound.
Txn factorfootprint
33
Comparative Genomics
Webb Miller
34
Comparative Genomics at BMP10
35
Conservation of Gene Features
  • Conservation pattern across 3165 mappings of
    human RefSeq mRNAs to the genome. A program
    sampled 200 evenly spaced bases across 500 bases
    upstream of transcription, the 5 UTR, the first
    coding exon, introns, middle coding exons,
    introns, the 3 UTR and 500 bases after
    polyadenylatoin. There are peaks of conservation
    at the transition from one region to another.

36
Detail Near Translation Start
Note the relatively conserved base 3 before
translation Start (constrained to be a G or an A
by the Kozak Consensus sequence, and the first
three translated bases (ATG).
37
Normalized eScores
38
Conservation Levels of Regulatory Regions in
Human/Mouse Alignments
39
Conservation in Multiple Alignments
  • As you add more species the phylogenic footprint
    gets sharper.
  • Currently genome.ucsc.edu shows multiple
    alignments between 8 species using Webb Millers
    multiz program on chained pairwise alignments.
  • The phylogenic tree has to be considered when
    calculating conservation levels.

40
Simple human/rodent tree
human
mouse
rodent
rat
  • Mutations that occur in rodent ancestor must be
    counted only once
  • Ideally should take into consideration varying
    mutation rates across species.
  • Conservation track at genome.ucsc.edu is based on
    Adam Siepels PhyloHMM

41
PhyloHMM on Drosophila
  • Drosophila proteasome alpha 7-1. In many genes
    like this one phylogenic footprint suggests
    promoter actually is downstream of transcription
    start site.

42
Genome Evolution
  • Duplication, deletion, and rearrangement is as
    important to genome evolution as base-level
    mutations.
  • Much of this is driven by transposons
  • Transposon relics are 50 of genome
  • Reverse transcriptase activity from transposons
    encourages processed pseudogene formation as
    well.
  • Transposons seed out of place recombination
    leading to tandem and segmental duplications,
    non-processed pseudogenes.
  • Only 5 of human genome seems functional.
  • This messiness provides opportunities for the
    development of new genes, but makes understanding
    the genome a challenge.

43
Pseudogene Data from Robert Baertsch, UCSC Grad
Student
44
Mouse/HumanRearrangement Statistics
Number of rearrangements of given type per
megabase excluding known transposons.
45
Chaining Alignments
  • Chaining bridges the gulf between syntenic blocks
    and base-by-base alignments.
  • Local alignments tend to break at transposon
    insertions, inversions, duplications, etc.
  • Global alignments tend to force non-homologous
    bases to align.
  • Chaining is a rigorous way of joining together
    local alignments into larger structures.

46
Chains join together related local alignments
Protease Regulatory Subunit 3
47
Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon
relics. Affine gap scores model red/blue plots as
straight lines.
48
Before and After Chaining
49
Chaining Algorithm
  • Input - blocks of gapless alignments from blastz
  • Dynamic program based on the recurrence
    relationship score(Bi) max(score(Bj)
    match(Bi) - gap(Bi, Bj))
  • Uses Millers KD-tree algorithm to minimize which
    parts of dynamic programming graph to traverse.
    Timing is O(N logN), where N is number of blocks
    (which is in hundreds of thousands)

jlti
50
Netting Alignments
  • Commonly multiple mouse alignments can be found
    for a particular human region, particularly for
    coding regions.
  • Net finds best match mouse match for each human
    region.
  • Highest scoring chains are used first.
  • Lower scoring chains fill in gaps within chains
    inducing a natural hierarchy.

51
Net Focuses on Ortholog
52
Net highlights rearrangements
A large gap in the top level of the net is filled
by an inversion containing two genes. Numerous
smaller gaps are filled in by local duplications
and processed pseudo-genes.
53
Useful in finding pseudogenes
Ensembl and Fgenesh automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein
must be interesting!
54
Other tools to cybernetically enhance your mind
at genome.ucsc.edu
55
UCSC Gene Sorter
  • Swiss army knife for dealing with gene sets.
  • Presents functional data on genes including
    microarray expression information.
  • Hilights relationships and connections between
    genes.
  • Powerful data mining tool.

56
UCSC Gene Sorter
Expression and other information on genes in a
big sorted, linked table
57
A Big Bioinformatics Web Site
  • genome.ucsc.edu gets gt 100,000 hits by gt 5000
    scientists each day.
  • Involves 570,000 lines of C code, bits of awk,
    perl, bash, tcsh, java, r and tcl.
  • 1200 CPUs and 12 Terabytes of disk
  • 12 full time staff, 18 part time, grad student
    and post-doc.

58
Site Architecture
  • 8 web servers running Apache and MySQL
  • CGIs written in C access genome data and user
    interface settings in MySQL.
  • Genome database is bottleneck, and is replicated
    on each server.
  • Cluster of 1000 CPUs, and smaller clusters of
    faster CPUs create annotation files which are
    loaded into database.

59
Site Sociology
  • 1/3 of group telecommutes.
  • Thursdays are devoted to reading and testing each
    others code and if necessary a one or two hour
    meeting.
  • We develop very incrementally, and do a new
    release once a week.
  • 1/4 of group is dedicated to quality assurance,
    Im wanting to increase this to 1/3.
  • User support is shared by everyone.

60
Parasol and Kilo Cluster
  • UCSC cluster has 1000 CPUs running Linux
  • 1,000,000 BLASTZ jobs in 25 hours for mouse/human
    alignment
  • We wrote Parasol job scheduler to keep up.
  • Very fast and free.
  • Jobs are organized into batches.
  • Error checking at job and at batch level.

61
Conclusions
  • Spaghetti code is not so helpful in
    understanding the genome.
  • Human genome suggests that trial and error
    development is likely to yield a robust version
    of windows within 3 billion years.
  • Understanding the flow of control in the genome
    is a problem that fascinates biologists and
    computer scientists alike.

62
Further Acknowledgements
  • Individuals
  • Institutions

NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers
in the US and worldwide. Baylor, Sanger, Wash U,
Whitehead, Stanford, JGI/ DOE, Vancouver GSC, UW
and the international sequencing centers. UCSC,
NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR,
Jackson Labs, Affymetrix, SwissProt.
Chuck Sugnet, Angie Hinrichs, Fan Hsu, Terry
Furey, Heather Trumbower, Kate Rosenbloom, Hiram
Clawson, Brian Raney, Rachel Harte, Bob Kuhn,
Mathieu Blanchette, Donna Karolchik, David
Haussler John Sulston, Richard Gibbs, Eric
Lander, Francis Collins, Roderic Guigo, Michael
Brent, Olivier Jaillon, David Kulp, Victor
Solovyev, Ewan Birney, Greg Schuler, Deanna
Church, Scott Schwartz, Ross Hardison, and
everyone else!
63
THE END
Write a Comment
User Comments (0)
About PowerShow.com