Title: Aucun%20titre%20de%20diapositive
1 A platform for pattern discovery in sets of
biological sequences
C. Alland, J. Nicolas
2Framework bioinformatics platform of Genopole
Ouest
http//www.sb-roscoff.fr/ BioInfo-GPO/
Sequencing Genotyping
O. Collin H. Leroy
Bioinformatics
Functional exploration
Biochips
- Coordination
- Data Bases
- Bioinformatics Software
- High Performance
Computing - Teaching
PCIO SunFire 6800 56 UltraSparc III 56 Go RAM
Proteomics
3Welcome Page of the bioinformatics platform
service
http//idefix.univ-rennes1.fr8080/ Serveur-GPO/
4Software Page of the bioinformatics platform
service
http//idefix.univ-rennes1.fr8080/
Serveur-GPO/services.php
5Aims of the project
Set of biological sequences
Common characteristic or discriminant pattern
- Annotation of genomes Discovery of new
genes/proteins - Characterization of functional families
- Experimental comparison of methods Choice of
complexities and representations of patterns
Copy/Implementation of several algorithms - Practical tool Parameter tuning Filtering
6Architecture of the platform
Visualization of results
Pattern Discovery Algorithms
Interface
Supervisor
Statistical Analysis of inter-motif regions
Refinement
Search of patterns
Practical Use
7Welcome page of the pattern discovery service
Regular languages inferring methods
Jonassen
Marsan
Pevzner
8Brazma hierarchy for (generalized) regular
patterns
- J full regular languages (finite automata)
9Example of the discovery of candidates in the
defensin family
Collaboration with GERM (C. Pineau, F.
Bourgeon) directed by B. Jégou, staffed with 40
people and specialized in researches on male
reproduction in mammals.
- Defensins are a major family of antimicrobial
peptides found in mammals, cationic peptides of
28-42 amino acids length containing 3
intramolecular disulfide bonds. - Starting point a set of 30 sequences (including
all organisms), 4 for human. - Aim discovery of new candidates
10Pratt principle of the algorithm
- One starts from a pattern graph containing all
the most specific allowed patterns covering at
least k of the n sequences in the training set - A pattern search tree is explored starting from
the most general one (empty pattern) and
specializing it by adding allowed components
(belonging to the pattern graph generalization
operators) while patterns obtain a better score.
Several scores and search strategies are
available - The most significant patterns are filtered and a
refinement phase may be applied to specialize
flexible wild card with ambiguous letters
11Pratt three levels of use
- Simple most parameters are fixed or simplified
- Expert all parameters available
- Meta Pratt is applied to sequences of patterns.
12Simple Pratt parameters
13Simple Pratt results
14Advanced Pratt parameters
15Advanced Pratt results
16Visualization of selected results
17Meta Pratt
18Search pattern in a databank
19Results of the search in a databank
20View of the search in a databank
21Statistical Analysis of inter-motif regions
22Results for refinment of patterns
23Reverse Search in a Genome
24Reverse Search in a Genome principle
- From the patterns and knowledge of exons/introns
splicing, a formal grammar may be inferred. - Genomes are translated in the six frames and
compiled in a suffix tree data structure. - Syntactical analysis is done with the help of
operations on suffix trees and results in
potential new candidates.
To jnicolas_at_irisa.fr Pattern
C-x(2,4)-G-x(1,3)-C-x(3,4)-C-x(7)-AG-HKNRST-C-
x(5,6)-C-C Organisme Chromosome Phase
Position LengthOcc Length Ch preOcc
Occ postOcc No match No match No match No
match No match No match
25Conclusion / Perspectives
- 10 new potential defensins discovered
- Importance of a complete environment coupling
highly expressive patterns with syntactical
search in banks - Current research meta level using
grammatical inference. Infer any regular language
from a set of positive AND negative instances. - Open questions Better filtering of patterns,
introduction of probabilities, long distance
interaction.
26(No Transcript)