Title: Spaghetti Code, Soupy Logic
1Spaghetti Code, Soupy Logic
Steaming fresh modules in sourceforge.net
Combinatorical assembly of transcription factors
in cell.
Jim Kent - University of California Santa Cruz
2A Challenge Every Speaker Faces
- Who is the audience?
- Bioinformaticians
- Biologists with bigger, better databases?
- Geeks trading bits for bases?
- Leading edge interdisciplinary super scientists?
3Top 5 Reasons Biologists Go Into Bioinformatics
- 5 - Microscopes and biochemistry are so 20th
century.
4Top 5 Reasons Biologists Go Into Bioinformatics
- 5 - Microscopes and biochemistry are so 20th
century. - 4 - Got started purifying proteins, but it turns
out the cold room is really COLD.
5Top 5 Reasons Biologists Go Into Bioinformatics
- 5 - Microscopes and biochemistry are so 20th
century. - 4 - Got started purifying proteins, but it turns
out the cold room is really COLD. - 3 - After 23 years of school wanted to make MORE
than 23,000/year as a postdoc.
6Top 5 Reasons Biologists Go Into Bioinformatics
- 5 - Microscopes and biochemistry are so 20th
century. - 4 - Got started purifying proteins, but it turns
out the cold room is really COLD. - 3 - After 23 years of school wanted to make MORE
than 23,000/year as a postdoc. - 2 - Like to swear, _at_ttracted to _ Perl !!
7Top 5 Reasons Biologists Go Into Bioinformatics
- 5 - Microscopes and biochemistry are so 20th
century. - 4 - Got started purifying proteins, but it turns
out the cold room is really COLD. - 3 - After 23 years of school wanted to make MORE
than 23,000/year as a postdoc. - 2 - Like to swear, _at_ttracted to _ Perl !!
- 1 - Getting carpel tunnel from pipetting
8Top 5 Reasons Computer People go into
Bioinformatics
- 5 - Bio courses actually have some females.
9Top 5 Reasons Computer People go into
Bioinformatics
- 5 - Bio courses actually have some females.
- 4 - Human genome more stable than Windows XP
10Top 5 Reasons Computer People go into
Bioinformatics
- 5 - Bio courses actually have some females.
- 4 - Human genome more stable than Windows XP
- 3 - Having mastered binary trees, quad trees, and
parse trees ready for phylogenic trees.
11Top 5 Reasons Computer People go into
Bioinformatics
- 5 - Bio courses actually have some females.
- 4 - Human genome more stable than Windows XP
- 3 - Having mastered binary trees, quad trees, and
parse trees ready for phylogenic trees. - 2 - Missing heady froth of the internet bubble.
12Top 5 Reasons Computer People go into
Bioinformatics
- 5 - Bio courses actually have some females.
- 4 - Human genome more stable than Windows XP
- 3 - Having mastered binary trees, quad trees, and
parse trees ready for phylogenic trees. - 2 - Missing heady froth of the internet bubble.
- 1 - Must augment humanity to defeat evil
artificial intelligent robots.
13The Paradox of Genomics
How does a long, static, one dimensional string
of DNA turn into the remarkably complex, dynamic,
and three dimensional human body?
GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATG
TAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATA
GCCAGCCAGCCACCACAGGCATGAGT
14The Analogy of the Code of Life
- DNA is popularly considered the code of life.
- Computer programs are complex systems that
ultimately are built up of 0s and 1s, perhaps
they are a model for a genome built of A,C,G and
T? - BUT.
- Human genome lacks documentation, has accumulated
3 billion years of cruft, and does not believe in
local variables. - Therefore we must look to less than
straightforward software programs as guides.
15Bioperl CORBA module
sub new my ( class, _at_args) _at__ my
self class-gtSUPERnew(_at_args) my ( idl,
ior, orbname ) self-gt_rearrange( qw(IDL
IOR ORBNAME),
_at_args) self-gt'_ior'
ior 'biocorba.ior' self-gt'_idl'
idl ENVBIOCORBAIDL 'biocorba.idl'
self-gt'_orbname' orbname
'orbit-local-orb' CORBAORBitIDL_PATH
self-gt'_idl' my orb CORBAORB_init(or
bname) my root_poa orb-gtresolve_initial_r
eferences("RootPOA") self-gt'_orb'
orb self-gt'_rootpoa' root_poa
return self
16Obfuscated C
define c(n,s)case nscontinue char
x"((((((((((((((((((((((",w "\b\b\b\b\b\b\b\
b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"char
r92,124,47,l2,3,1 ,0charT" ","
","\\/"," ",""char d1,p40,o40,k0,a,y
,z,g -1,G,X,PT4,f0unsigned int s0void
u(int i)int nprintf( "\233uH\233Lc\233uHc\
233uHs\23322uH_at_\23323uH \n",x-w,rd,x
w ,rd,X,P,pk,o)if(abs(p-x21)gtw21)exit(0
)if(g!G)struct itimerval t 0,0,0,0g((gltG)
ltlt1)-1t.it_interval.tv_usect.it_value.tv_usec72
000/((ggtgt 3)1)setitimer(0,t,0)fprintf("\e10
u",g24)fputchar(7)s(9-w21 )((ggtgt3)1
)opm(x)m(w)(nrand())255--wwif(!(
PPn7936)) while(abs((Xrand()76)-x2)-w
lt6)XPT(nrand()31)lt3(dn)!d--xlt w
(x,d)d2xwgt79(--x,--d)signal(i
,u)void e()signal(14, SIG_IGN)printf("\e0q\ec
Score u\n",s)system("stty echo -cbreak")int
main (int C,charV)atexit(e)(Clt2V1!113)
(f(C(int)getenv("TERM"))( int)0x756E696CC
(int)0x6C696E75)srand(getpid())system("stty
-echo cbreak" )h(0)u(14)for()switch(getchar()
)case 113return 0case 91case
98c(44,k -1)case 32case 110c(46,k0)case
93case 109c(47,k1)c(49,h(0))c(50,h(1 ))c(51,
h(2))c(52,h(3))
17Reverse Engineering Microsoft
mouse
blue screen of death
Windows XP
keyboard
network
elaborate proprietary process
18Looks like code not enough, must study actual
cells DNA
19How DNA is Used by the Cell
20Promoter Tells Where to Begin
Different promoters activate different genes
in different parts of the body.
21A Computer in Soup
Idealized promoter for a gene involved in making
hair. Proteins that bind to specific DNA
sequences in the promoter region together turn a
gene on or off. These proteins are themselves
regulated by their own promoters leading to a
gene regulatory network with many of the same
properties as a neural network.
22Genes can be transcription factors that
activate or repress other genes, leading to
regulatory networks such as this one from the
development of the central nervous system. (Image
from DHaeseleer Somogyi 1999)
23The Decisions of a Cell
- When to reproduce?
- When to migrate and where?
- What to differentiate into?
- When to secrete something?
- When to make an electrical signal?
The more rapid decisions usually are via the cell
membrane and 2nd messengers. The longer acting
decisions are usually made in the nucleus.
24Nucleus Used to Appear Simple
- Cheek cells stained with basic dyes. Nuclei are
readily visible.
25Mammalian Nuclei Stained in Various Ways
Image from Tom Misteli lab
26Artists rendition of nucleus
Image from nuclear protein database
27Chromatin
28Turning on a gene
- Getting DNA into the right compartment of the
nucleus (may involve very diffuse signals in DNA
over very long distances) - Loosening up chromatin structure (this involves
activator and repressors which can act over
relatively long distances) - Attracting RNA Polymerase II to the transcription
start site (these involve relatively close
factors both upstream and downstream of
transcription start).
29Methods for Studying Transcription
- Genetics in model organisms
- Promoters hooked to reporter genes
- Gel shifts and DNAse footprinting.
- Phylogenic footprinting
- Motif searches in clusters of coregulated genes.
30Drosophila Genetics
antennapediamutant
normal
31Reporter Gene Constructs
promoter to study
easily seen gene
Drosophila embryo transfected with ftz promoter
hookedup to lacz reporter gene, creating stripes
where ftz promoteris active.
32Biochemical Footprinting Assays
Gel showing selective protection of DNA from
nuclease digestion where transcription factor is
bound.
Txn factorfootprint
33Comparative Genomics
Webb Miller
34Comparative Genomics at BMP10
35Conservation of Gene Features
- Conservation pattern across 3165 mappings of
human RefSeq mRNAs to the genome. A program
sampled 200 evenly spaced bases across 500 bases
upstream of transcription, the 5 UTR, the first
coding exon, introns, middle coding exons,
introns, the 3 UTR and 500 bases after
polyadenylatoin. There are peaks of conservation
at the transition from one region to another.
36Detail Near Translation Start
Note the relatively conserved base 3 before
translation Start (constrained to be a G or an A
by the Kozak Consensus sequence, and the first
three translated bases (ATG).
37Normalized eScores
38Conservation Levels of Regulatory Regions in
Human/Mouse Alignments
39Conservation in Multiple Alignments
- As you add more species the phylogenic footprint
gets sharper. - Currently genome.ucsc.edu shows multiple
alignments between 8 species using Webb Millers
multiz program on chained pairwise alignments. - The phylogenic tree has to be considered when
calculating conservation levels.
40Simple human/rodent tree
human
mouse
rodent
rat
- Mutations that occur in rodent ancestor must be
counted only once - Ideally should take into consideration varying
mutation rates across species. - Conservation track at genome.ucsc.edu is based on
Adam Siepels PhyloHMM
41PhyloHMM on Drosophila
- Drosophila proteasome alpha 7-1. In many genes
like this one phylogenic footprint suggests
promoter actually is downstream of transcription
start site.
42Genome Evolution
- Duplication, deletion, and rearrangement is as
important to genome evolution as base-level
mutations. - Much of this is driven by transposons
- Transposon relics are 50 of genome
- Reverse transcriptase activity from transposons
encourages processed pseudogene formation as
well. - Transposons seed out of place recombination
leading to tandem and segmental duplications,
non-processed pseudogenes. - Only 5 of human genome seems functional.
- This messiness provides opportunities for the
development of new genes, but makes understanding
the genome a challenge.
43Pseudogene Data from Robert Baertsch, UCSC Grad
Student
44Mouse/HumanRearrangement Statistics
Number of rearrangements of given type per
megabase excluding known transposons.
45Chaining Alignments
- Chaining bridges the gulf between syntenic blocks
and base-by-base alignments. - Local alignments tend to break at transposon
insertions, inversions, duplications, etc. - Global alignments tend to force non-homologous
bases to align. - Chaining is a rigorous way of joining together
local alignments into larger structures.
46Chains join together related local alignments
Protease Regulatory Subunit 3
47Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon
relics. Affine gap scores model red/blue plots as
straight lines.
48Before and After Chaining
49Chaining Algorithm
- Input - blocks of gapless alignments from blastz
- Dynamic program based on the recurrence
relationship score(Bi) max(score(Bj)
match(Bi) - gap(Bi, Bj)) - Uses Millers KD-tree algorithm to minimize which
parts of dynamic programming graph to traverse.
Timing is O(N logN), where N is number of blocks
(which is in hundreds of thousands)
jlti
50Netting Alignments
- Commonly multiple mouse alignments can be found
for a particular human region, particularly for
coding regions. - Net finds best match mouse match for each human
region. - Highest scoring chains are used first.
- Lower scoring chains fill in gaps within chains
inducing a natural hierarchy.
51Net Focuses on Ortholog
52Net highlights rearrangements
A large gap in the top level of the net is filled
by an inversion containing two genes. Numerous
smaller gaps are filled in by local duplications
and processed pseudo-genes.
53Useful in finding pseudogenes
Ensembl and Fgenesh automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein
must be interesting!
54Other tools to cybernetically enhance your mind
at genome.ucsc.edu
55UCSC Gene Sorter
- Swiss army knife for dealing with gene sets.
- Presents functional data on genes including
microarray expression information. - Hilights relationships and connections between
genes. - Powerful data mining tool.
56UCSC Gene Sorter
Expression and other information on genes in a
big sorted, linked table
57A Big Bioinformatics Web Site
- genome.ucsc.edu gets gt 100,000 hits by gt 5000
scientists each day. - Involves 570,000 lines of C code, bits of awk,
perl, bash, tcsh, java, r and tcl. - 1200 CPUs and 12 Terabytes of disk
- 12 full time staff, 18 part time, grad student
and post-doc.
58Site Architecture
- 8 web servers running Apache and MySQL
- CGIs written in C access genome data and user
interface settings in MySQL. - Genome database is bottleneck, and is replicated
on each server. - Cluster of 1000 CPUs, and smaller clusters of
faster CPUs create annotation files which are
loaded into database.
59Site Sociology
- 1/3 of group telecommutes.
- Thursdays are devoted to reading and testing each
others code and if necessary a one or two hour
meeting. - We develop very incrementally, and do a new
release once a week. - 1/4 of group is dedicated to quality assurance,
Im wanting to increase this to 1/3. - User support is shared by everyone.
60Parasol and Kilo Cluster
- UCSC cluster has 1000 CPUs running Linux
- 1,000,000 BLASTZ jobs in 25 hours for mouse/human
alignment - We wrote Parasol job scheduler to keep up.
- Very fast and free.
- Jobs are organized into batches.
- Error checking at job and at batch level.
61Conclusions
- Spaghetti code is not so helpful in
understanding the genome. - Human genome suggests that trial and error
development is likely to yield a robust version
of windows within 3 billion years. - Understanding the flow of control in the genome
is a problem that fascinates biologists and
computer scientists alike.
62Further Acknowledgements
NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers
in the US and worldwide. Baylor, Sanger, Wash U,
Whitehead, Stanford, JGI/ DOE, Vancouver GSC, UW
and the international sequencing centers. UCSC,
NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR,
Jackson Labs, Affymetrix, SwissProt.
Chuck Sugnet, Angie Hinrichs, Fan Hsu, Terry
Furey, Heather Trumbower, Kate Rosenbloom, Hiram
Clawson, Brian Raney, Rachel Harte, Bob Kuhn,
Mathieu Blanchette, Donna Karolchik, David
Haussler John Sulston, Richard Gibbs, Eric
Lander, Francis Collins, Roderic Guigo, Michael
Brent, Olivier Jaillon, David Kulp, Victor
Solovyev, Ewan Birney, Greg Schuler, Deanna
Church, Scott Schwartz, Ross Hardison, and
everyone else!
63THE END