Title: Decoding ENCODE
1Decoding ENCODE
- Jim Kent
- University of California Santa Cruz
2ENCODE Timeline
- ENCyclopedia of Dna Elements.
- Attempt to catalog as many functional elements in
human genome as possible using current
technologies. - Pilot project - finished 2007, covered 1 of
genome. - Production project - ramping up now. Genome-wide.
Should have major amounts of data in 6 months.
3ENCODE Experiments
- Chromatin state
- DNA Hypersensitivity assays
- Chromatin Immunoprecipitation (ChIP)
- Histones in various methylation states
- Sequence-specific transcription factors
- DNA methylation
- Chromatin conformation capture (5C)
- Functional RNA discovery
- Nuclear cytoplasmic, short long
- RNA Immunoprecipitation
- Comparative Genomics
- Human curated gene annotation
4Role of UCSC
- Display data in context of what else is known on
the UCSC Genome Browser and in other tools. - Facilitate analysis of the data with both
Web-based and command line tools.
5A Peek at the Pilot Project
6 ENCODE pilot data at genome.ucsc.edu
7Correlation at gene starts in enr221
8Transcription at enm221
9ENCODE Chromatin Immunoprecipitation
10Scientific Highlights of Pilot
- Transcription
- Lots of transcription outside of known genes.
- Outside of known genes transcribed areas not very
well conserved across species. - Lots of rare splice variants, also poorly
conserved. - DNA/Protein Interactions
- Good correlation between histone markers, gene
starts, and _active_ transcription. - Lots of occupied transcription factor binding
sites not conserved, near promoters etc. - Biological noise?
- Main controversy was whether to explain much of
the data as biological noise that was tolerated
but not necessary for function.
11From Pilot to Production Phase
12ENCODE Production Phase
- Moving from microarray based assays to assays
based on next-generation sequencing. (ChIP-chip
to ChIP-seq) - Genome-wide rather than regional.
- Broader set of cell lines used more consistently
between labs. - Broader set of antibodies.
- Some new technology development continues.
13ENCODE Cell Lines
- Tier 1 - used in ALL experiments
- GM12878 (lymphoblastoid cell line)
- K562 (chronic myeloid leukemia)
- Tier 2 - used in most experiments
- HepG2 (hepatocellular carcinoma)
- Hela-S3 (cervical carcinoma)
- HUVEC (umbilical vein endothelial cells)
- Keratinocyte (normal epidermal cells)
- Likely will do an embryonic stem cell too.
- Tier 3 - used in one or two experiments
- Many of these for assays such as DNAse
hypersensitivity, RNA measurements where dont
have to do separate experiment for each antibody.
14Simple Model of Eukaryotic Transcription
Regulation
- Initially chromatin opened to allow
transcription factors to access DNA - Multiple transcription factors bind to DNA in
combination. - Most factors have such small DNA binding sites
that by themselves they are not specific or the
binding even stable - The right combination of factors in open
chromatin leads to active transcription starting
at the initiation complex. - With the ENCODE experiments we can directly test
most aspects of this model.
15Chromatin Experiments
- In general applied across a large number of cell
lines. - DNAseI hypersensitivity
- Formaldehyde Assisted Isolation of Regulatory
Elements - Methylation of CpG Islands
- ChIP-seq of relevant factors
- H3K4me1,2,3 H3K9me3 H4K20me3, H3K27me3, H3K36me3,
RPol-II, etc.
16Transcription Factor ChIP
- Many antibodies in modest number of cell lines.
- Limited by good antibodies, hope for 100 or more.
- Current good antibodies include
- E2F1, E2F4, E2F6, KAP1, L3MBTL2, STAT1, CtBP1,
CtBP2, SETDB1, ZNF180, ZNF239, ZNF263, ZNF266,
ZNF317, ZNF342 - Part of project pipeline for raising and testing
antibodies.
17RNA measurement
- RNA-seq of poly-A selected RNA to measure mRNA
levels in many cell lines. - Sequencing of G-cap selected tags (CAGE)
- Sequencing 5 and 3 ends (paired end tags)
- Measurement of RNAs of several types in several
cell compartments of a few cell lines. - Long/short, polyA/nonPolyA, associated with
proteins/not associated with proteins - Nucleus, cytosol, polysomes, chromatin, nucleolus
18New Pilot Projects Starting to Sprout
19New Pilot Projects
- Immunoprecipitation of RNA binding proteins/RNA
sequencing. - Mapping silencers and enhancers with transient
transfection assays - Computational identification of active promoters
- Deep comparative sequencing in targeted regions
and conservation analysis. - Chromatin Conformation Capture Carbon Copy (5C)
to capture long range regulatory elements and
their targets.
20ENCODE Timeline
- Grants funded for 4 years starting Sept 2007.
- First production data just now starting to roll
into UCSC, not quite ready for public display. - Data should accumulate quickly over next few
years.
21Data Release Policy
- Once have reproducible data (where at least 2 of
3 replicates agree) should be released to public
within a month. - Data is still considered pre-publication!
- Ok to publish a paper using data on a few genes.
- Please wait for consortium papers before papers
doing full genome analysis. - Anyone can join ENCODE consortium analysis group
to help us write the papers. - We just have 1 year after data release to write
papers, after that fair game to publish full
genome analysis. - If in doubt please contact consortium via UCSC.
22Web Works for Mice and Men
23Mouse ES Cell Chromatin IP
- Brad Bernstein lab ChIP-seq based experiment on
methylated histones now on UCSC Genome Browser. - Shows some of the user interfaces that will be
used for the ENCODE data
24(No Transcript)
25List of mouse chromatin subtracks.
26(No Transcript)
27Signal densities of entire mouse chromatin data
set.
28The unending quest for genes
29Gencode Project
- Project to define structure (exons and introns)
for all common splice varients of all genes. - Human curators merge many lines of evidence
including - Computational gene predictions
- RNA/DNA alignments
- Paired end tags
- Cross-species alignments
- Possibly chromatin state data
- PI is Tim Hubbard
- Much of the work done by Havana group
30Data Mining with Table Browser
31Table Browser
- Complete access to UCSC Database with results in
tab-delimited format - Method for creating custom tracks by combining
and filtering existing tracks. - Sample query - getting a table of Ensembl gene
coordinates and associated Superfamily
annotations.
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37Selected fields from related tables results
Ensemble Gene (ensGene) and Superfamily
Description (sfDescription).
38Table Browser Filters
- Getting list of Ensembl genes that have SH3
domains.
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44Table Browser Intersection
- Getting list of Ensembl genes that dont
intersect UCSC Known Genes
45(No Transcript)
46(No Transcript)
47(No Transcript)
48Custom Track Output
- Useful for visualizing results of queries in
genome browser - The way to produce more complex queries.
- Here we look at how well genes that are Ensembl
but not UCSC are conserved across species.
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54681/3329 (20) of Ensemble not known also not
conserved 1728/33,666 (5) of Ensembl in general
not conserved
55UCSC Gene Sorter
- Swiss army knife for dealing with gene sets.
- Hilights relationships and connections between
genes. - Powerful data mining tool.
56Cytochrome P450 - a gene family important in drug
metabolism. The family is related in many ways.
Sorted by protein homology
57Various sorting methods let you focus on
different types of relationships between genes.
58Sorting by gene distance is a quick way to browse
candidate genes in a region.
59Clicking on row or gene name selects that gene.
60(No Transcript)
61(No Transcript)
62Configuration page controls column order and
display options.
63Also you can upload your own columns here.
64Controlling expression display
65GNF Atlas 2 column in median of replicates
mode. Actual Column includes 79 tissues, slide
only fits first half.
66Sorting based on expression similarity to
selected gene.
67(No Transcript)
68The filters page turns the Family Browser into a
powerful data mining tool.
69(No Transcript)
70Candidate Pancreatic Islet Membrane Genes
GO-annotated membrane proteins that are expressed
at least 8X in pancreatic islets cells and no
more than 4X elsewhere outside of pancreas and
central nervous system. These might be good
candidates for targets of the autoimmune response
that can cause Type I diabetes.
71Direct Data Access
72FTP or HTTP Download
- Sequence
- Multiple genome alignments
- Wiggle track data.
- Database as tab-separated files
- Follow downloads link from http//genome.ucsc.edu
- Via ftp//hgdownload.cse.ucsc.edu
73Public MySQL Access
- Query mirror of our database directly
- Host genome-mysql.cse.ucsc.edu
- User genome
- No password needed
- Best to use table browser to find relevant tables
in many cases. - Some tables are split by chromosomes
- chr1_est, chr2_est, etc.
- Some data (genome sequence, multiple alignments,
wiggles) are in files just referenced by SQL
tables. - For some purposes easier to use via UCSC C
library code than via SQL.
74The Sordid Details of the UCSC Genome Informatics
Code Base
Download via http//genome.ucsc.edu/admin/cvs.html
Many modules require MySQL to be installed.
75Lagging Edge Software
- C language - compilers still available!
- CGI Scripts - portable if not pretty.
- SQL database - at least MySQL is free.
76Problems with C
- Missing booleans and strings.
- No real objects.
- Must free things
77Coping with Missing Data Types in C
- define boolean int
- Fixing lack of real string type much harder
- lineFile/common modules and autoSql code
generator make parsing files relatively painless - dyString module not a horrible string class
78Object Oriented Programming in C
- Build objects around structures.
- Make families of functions with names that start
with the structure name, and that take the
structure as the first argument. - Implement polymorphism/virtual functions with
function pointers in structure. - Inheritance is still difficult. Perhaps this is
not such a bad thing.
79- struct dnaSeq
- / A dna sequence in one-letter-per-base format.
/ -
- struct dnaSeq next / Next in list. /
- char name / Sequence name. /
- char dna / as cs gs and ts. Null
terminated / - int size / Number of bases. /
-
- struct dnaSeq dnaSeqFromString(char string)
- / Convert string containing sequence and
possibly - white space and numbers to a dnaSeq. /
- void dnaSeqFree(struct dnaSeq pSeq)
- / Free dnaSeq and set pointer to NULL. /
- void dnaSeqFreeList(struct dnaSeq pList)
- / Free list of dnaSeqs. /
80- struct screenObj
- / A two dimensional object in a sleazy video
game. / -
- struct screenObj next / Next in list. /
- char name / Object name. /
- int x,y,width,height / Bounds of object.
/ - void (draw)(struct screenObj obj) / Draw
object / - boolean (in)(struct screenObj obj, int x,
int y) - / Return true if x,y is in
object / - void custom / Custom data for a
particular type / - void (freeCustom)(struct screenObj obj)
- / Free custom data. /
-
- define screenObjDraw(obj) (obj-gtdraw(obj))
- / Draw object. /
- void screenObjFree(struct screenObj pObj)
- / Free up screen object including custom part. /
81Relational Databases
- Relational databases consist of tables, indices,
and the Structured Query Language (SQL). - Tables are much like tab-separated files
chrom start end name strand score
chr22 14600000 14612345 ldlr
0.989 chr21 18283999 18298577 vldlr -
0.998Fields are simple - no lists or
substructures. - Can join tables based on a shared field. This is
flexible, but only as fast as the index. - Tables and joins are accessed a row at a time.
- The row is represented as an array of strings.
82Converting A Row to Object
struct exoFish exoFishLoad(char row) / Load a
exoFish from row fetched with select from
exoFish from database. Dispose of this with
exoFishFree(). / struct exoFish
ret AllocVar(ret) ret-gtchrom
cloneString(row0) ret-gtchromStart
sqlUnsigned(row1) ret-gtchromEnd
sqlUnsigned(row2) ret-gtname
cloneString(row3) ret-gtscore
sqlUnsigned(row4) return ret
83Motivation for AutoSql
- Row to object code is tedious at best.
- Also have save object, free object code to write.
- SQL create statement needs to match C structure.
- Lack of lists without doing a join can seriously
impact performance and complicate schema.
84AutoSql Data Declaration
table exoFish "An evolutionarily conserved region
(ecore) with Tetroadon" ( string chrom
"Human chromosome or FPC contig" uint
chromStart "Start position in chromosome"
uint chromEnd "End position in
chromosome" string name "Ecore name
in Genoscope database" uint score
"Score from 0 to 1000" )
See autoSql.doc for more details.
85Occasionally useful tools
86Unix Command Line
- BLAT - RNA/DNA and DNA/DNA alignment.
- featureBits - figure out number of bases covered
by a track or intersection of tracks, output
track intersections. - htmlCheck - check html tables and other basic web
page stuff. Look at form variables. - dbSnoop - summarize a MySQL database.
- autoSql - generate serialization C code for
relational databases/tab-separated files. - autoXml - generate XML parsers
- xmlToSql/sqlToXml - convert between XML and
relational database representations - parasol - manage jobs on computer cluster
87C Library Modules
- hdb - access UCSC genome database
- jksql - access SQL databases
- htmlPage - parse web pages, submit forms
- readers/writers for maf, psl, chain, net, bed,
2bit other formats used at UCSC - rangeTree binRange - fast interval intersection
tools - Hashes, lists, trees, etc.