Title: Conserved pathways within bacteria and yeast
1Conserved pathways within bacteria and yeast
- ------Something about the Term project for
Algorithmic Techniques for Biology
2General
- Given two protein graphs
- Weighted protein relations
- Combine two graphs to one relation graph
- Find a path of length k in the combination graph
with maximum(minimum) weight.
3Combining two graphs
G1
G2
a
1
30
2
b
d
20
3
66
c
4
5
5
e
a1
w3
c4
w2
Combination graph
w2
d2
w1
e2
4Combining graphs
- Each relation become a vertex in combination
graph. - Two vertices connected if
2
c
2
2
1
c
c
k
e
4
f
4
e
4
e
c2
c2
c2
W1
W2
W3
e4
e4
e4
5What we need to do
- Find G1 (Hpylo20040704.tab)
- Find G2 (Scere20040704.tab)
- Find protein relations between G1 and G2
- Combining two graphs
- Find the path in combination graph
6Where to get data and software
- Get data from http//dip.doe-mbi.ucla.edu/
- Choose Files?SPECIES to get
Hpylo20040704.tab and Scere20040704.tab - Choose Files?FASTA to get fasta20040704.seq
- (need register to get these files)
- Get software blast fromhttp//www.ncbi.nlm.nih.
gov/BLAST/ - Choose FAQs? Which BLAST program should I use?
?FTP location ftp//ftp.ncbi.nih.gov/blast/execut
ables/ ? to get such as blast-2.0.10-ia32-win
32.exe
7The Format of Hpylo20040704.tab
- DIP4305E DIP3048N PIRB64526 GI2313123 DIP30
47N SWPO24853 PIRA64520 GI2313078 - DIP4306E DIP3049N SWPO25122 PIRC64564 GI2313
456 DIP3047N SWPO24853 PIRA64520 GI2313078 - DIP4307E DIP3050N PIRH64618 GI2313921 DIP30
47N SWPO24853 PIRA64520 GI2313078 - DIP4308E DIP3051N PIRB64520 GI2313079 DIP30
51N PIRB64520 GI2313079 - DIP4309E DIP3052N SWPP56036 PIRH64669 GI2314
362 DIP3051N PIRB64520 GI2313079
Edge number
Node number
8The format of fasta20040704.seq
- gtDIP1NswP19527pirA21762gi112046
- KARMSSLARAELEKRIDSLMDEIAFLKKVHEEEIAELQAQIQYAQISVE
- gtDIP2NswpirA23003gi83621
- MKKQNLNSILLMYINYIINYFNNIHKNQLKKDWIMGYEYM
- gtDIP3NswP06778pirA23282gi83448
- MAFLSYFATENQQMQTRRLPRTAEGSGGFGVLLMNEIMDMDEKKPV
- gtDIP4NswP04925pirA23544gi91067
- MANLGYWLLALFVTMWTDVGLCKKRPKPGG
Node number
Protein sequence
9How to get protein relation-1
- Get protein sequence file hp.seq from
Hpylo20040704.tab and fasta20040704.seq - DIP4305E DIP3048N PIRB64526 GI2313123 DIP30
47N SWPO24853 PIRA64520 GI2313078 - DIP4306E DIP3049N SWPO25122 PIRC64564 GI2313
456 DIP3047N SWPO24853 PIRA64520 GI2313078 - gtDIP3047NswO24853pirA64520gi2313078
- MATRTQARGAVVELLYAFESGNEEIKKIASSMLEEKKIKNNQLA
- gtDIP3048NswpirB64526gi2313123
- MIQIYHADAFEIIKDFYQQNLKVDAIITDPPYNISVKNNFPT
- gtDIP3049NswO25122pirC64564gi2313456
- MKTKAPMKNIRNFSIIAHIDHGKSTLADCLISECNAISNREMKSQVMDT
10How to get protein relation-2
- The format of hp.seq file
- gtDIP3047NswO24853pirA64520gi2313078
- MATRTQARGAVVELLYAFESGNEEIKKIASSMLEEKKIKNNQLAFAL
- gtDIP3048NswpirB64526gi2313123
- MIQIYHADAFEIIKDFYQQNLKVDAIITDPPYNISVKNNFPTLKSAKRQG
I - gtDIP3049NswO25122pirC64564gi2313456
- MKTKAPMKNIRNFSIIAHIDHGKSTLADCLISECNA
11How to get protein relation-3
- Get protein sequence file database.seq from
Scere20040704.tab and fasta20040704.seq in the
same way with hq.seq. - Download blast and extract to one directory.
- Copy hp.seq, database.seq to the same dirictory
with blast
12How to get protein relation-4
- In command mode, go to the blast directory
- Input formatdb -i database.seq -p T -o T to
make the index file of database.seq - Input blastall -p blastp -d database.seq -i
hp.seq -o relation.out to create relation file
relation.out
13How to get protein relation-5
- The format of relation.out file
- Query DIP3549NswP71408pirE64653gi2314219
- (632 letters)
- Database database.seq
- 4772 sequences 2,345,789 total
letters -
Score E - Sequences producing significant alignments
(bits) Value - 200_database.seq
404 e-113 - 276_database.seq
397 e-111 - 828_database.seq
184 4e-047 - 2039_database.seq
182 9e-047 - 1871_database.seq
177 5e-045 - 4195_database.seq
176 1e-044
The node number of graph G1
Get node number of G2 from the sequence in
database.seq file
14How to get protein relation-6
- The database.seq file
- gtDIP801NswP29539pirS46157gi626910
- MSKDFSDKKKHTIDRIDQHILRRSQHDNYSNGSSPWMKTNLPPPSPQAH
- gtDIP802NswP39925pirS46611gi626985
- MMMWQRYARGAPRSLTSLSFGKASRISTVKPVLRSRMPVHQRLQTLS
-
- gtDIP2883NswP33299pirS34354gi422126
- MPPKEDWEKYKAPLEDDDKKPDDDKIVPLTEGDIQVLKSYGAAPYAAK
- gtDIP2884NswQ03656pirS55098gi1078546
- MGSSINYPGFVTKSAHLADTSTDASISCEEATSSQEAKKNFFQRDYNMMK
K
200 sequence, i.e. row 400.
G2
2039 sequence, i.e. row 4078.
G1
802N
3549N
2883N
15Summary
- Get data Hpylo20040704.tab,Scere20040704.tab
,fasta20040704.seq. - Get sequence file hp.seq, database.seq.
- Get protein relation file relation.out.
- Get combining graph
- Find maximum weighted path of length k in the
combination graph