Title: Formats and standards for sequencing data
1Formats and standards for sequencing data
Matúš Kalaš INF389, CBU, BCCS/UiB, Bergen Nov 12,
2010
2SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK
SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN
BFAST Lastz BLAT
454 Solexa/Illumina SOLiD
Genome Metagenome Gene annotation Gene
expression Binding sites Variation
Celera Newbler Velvet Euler SOAPdenovo
GenBank EMBL DDBJ Genome Catalogue SNPdb
NCBI SRA EMBL-EBI ENA Your databases
3454 output formats
.sff .fna .qual
4Illumina output formats
.seq.txt .prb.txt Illumina FASTQ (ASCII
64 is Illumina score) Qseq (ASCII 64 is Phred
score) Illumina single line format SCARF
5SOLiD output format(s)
CSFASTA
6Real (standard) FASTQ Sanger
FASTQ (ASCII 33 is Phred score)
7Example of dealing with diverse read formats
in Galaxy
(http//usegalaxy.org)
8If reads should be deposited in a public
repository
SRA (Short Read Archive) at NCBI ENA at EMBL-EBI
SRA format (XML) SRF format
Or should they be deleted?
9Common (standard) format for read alignments
SAM BAM ( binary SAM)
10Some common formats for results (Genome/Gene
annotation)
BED format (genome-browser tracks) GFF
format (gene/genome features) BioXSD (XML)
(any annotation under development)
11Deposit genome/metagenome in a public repository
INSDC databases GenBank, EMBL, DDBJ
Deposit genome/metagenome metadata
MIGS/MIMS standard by GSC
GCDML format (XML) (under development) following
the MIGS/MIMS standard
12MIGS Minimum Information about a Genome
SequenceMIMS Minimum Information about a
Metagenome Sequence/Sample
13MIGS/MIMS checklist
14 15MIGS/MIMSmetadata example
16Sequencing experiment metadata
MINSEQE standard by FGED Minimum Information
about a high-throughput Nucleotide SEQuencing
Experiment (under development)
17Take-home messages
- Use raw sequencing data when possible
- For base-call data, use standard FASTQ
(Sanger, Phred) - For read alignments, use SAM/BAM format
- Use common formats for your results (e.g. GFF or
BED format) - Hope for new, generic, extensible standard
format(s) - Submit MIGS/MIMS-compliant metadata of genome
sequences - Keep an eye on MINSEQE standard, store your
sequencing metadata