Title: UniProt Non-redundant Reference Cluster (UniRef) Databases
1UniProt Non-redundant Reference Cluster (UniRef)
Databases
UniProt Reference Clusters (UniRef), UniRef100,
UniRef90 and UniRef50 are automatically generated
from UniProt Knowledgebase and selected UniParc
records. The databases provide complete coverage
of sequence space while hiding redundant
sequences from view. The non-redundancy allows
faster sequence similarity searches by using
UniRef90 and UniRef50
UniProtKB Sequences UniProtKB Isoform
Sequences Selected UniParc Sequences from
ENSEMBL, RefSeq and PDB databases
String Comparison Identifying sub-fragments and
identical sequences
UniRef100 Identical sequences and sub-fragments
with 11 or more residues are placed into a single
record
CD-HIT computation Clustering UniRef100
representative sequences at 90 level
UniRef90 40 size Reduction
UniRef90 Members of related UniRef100s at 90
level form a UniRef90 cluster. The
representative is selected based on the quality
of the entry, name, organism and sequence
length. Title and identifier are derived from the
representative sequence.
CD-HIT computation Clustering UniRef90
representative sequences at 50 level
UniRef50 Members of related UniRef90s at 50
level form a UniRef90 cluster. The
representative is selected based on the quality
of the entry, name, organism and sequence
length. Title and identifier are derived from the
representative sequence.
UniRef50 65 size Reduction
Generating data files for distribution
XML file
lt?xml version"1.0" encoding"ISO-8859-1"
?gt ltUniRef90 xmlns"http//uniprot.org/uniref"
ltentry id"UniRef90_P00439" updated"2006-05-16"gt
ltnamegtPhenylalanine-4-hydroxylase related
clusterlt/namegt ltrepresentativeMembergt ltdbReference
type"UniProtKB ID" id"PH4H_HUMAN"gt ltproperty
type"UniProtKB accession" value"P00439"/gt ltprope
rty type"UniProtKB accession" value"Q16717"/gt ltp
roperty type"UniProtKB accession"
value"Q8TC14"/gt ltproperty type"UniRef100 ID"
value"UniRef100_P00439"/gt ltproperty
type"protein name" value"Phenylalanine-4-hydroxy
lase"/gt ltproperty type"source organism"
value"Homo sapiens (Human)"/gt ltproperty
type"NCBI taxonomy" value"9606"/gt ltproperty
type"length" value"452"/gt lt/dbReferencegt ltsequen
ce length"452" checksum"018F00EBBBDDCE2F"gt MSTAV
LENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLF
EENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGA
TVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPV
YRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHA
CYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSS
RDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS
FGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKV
RNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSAL
QKIK lt/sequencegt lt/representativeMembergt
FASTA file
UniRef Release
gtUniRef90_P00439 Phenylalanine-4-hydroxylase
related cluster MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGA
ISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHL
DKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH
GQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGF
HEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQY
IRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSS
FGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKV
RNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSAL
QKIK
UniRef Usages
?Speeding up similarity search ?Reducing bias in
homology searches by providing more even sequence
space ?Using the clusters for family
classification ?Using the clusters to annotate
EST and other sequence databases ?Using the
clusters to check the consistency of UniProtKB
annotations
Swiss Institute of Bioinformatics
(SIB) European Bioinformatics Institute
(EMBL-EBI) Protein Information Resource (PIR)
UniProt is mainly supported by the National
Institutes of Health (NIH) grant 2 U01
HG02712-04. Additional support for the EBI's
involvement in UniProt comes from the European
Commission contract FELICS (021902) and from the
NIH grant 5 P41 HG02273-06. UniProtKB/Swiss-Prot
activities at the SIB are supported by the Swiss
Federal Government through the Federal Office of
Education and Science. PIR activities are also
supported by the NIH grants for NIAID proteomic
resource (HHSN266200400061C) and grid enablement
(NCI-caBIG-ICR), and National Science Foundation
grants for protein ontology (ITR-0205470) and
BioTagger (IIS-0430743).
Contact help_at_uniprot.org