Title: Intersection of Semantic Web and Life Sciences
1Intersection of Semantic Web and Life Sciences
- Kei Cheung
- Yale Center for Medical Informatics
Genomics and Bioinformatics (MBB 452a), November
2, 2005
2Outline
- Introduction
- Overview of RDF and LSID
- Semantic web applications
- Connotea
- Piggy Bank
- YeastHub
3Two scientific/technological endeavors that have
impacted the world greatly in the past 15 years
- Human Genome Project (HGP)
- International collaboration that began in 1990
and completed in 2003 - Understand the blueprint of life (moon-landing of
the nineties) - Sequence the entire human genome
- World Wide Web (WWW)
- It was born in 1989/1990 at CERNS (developed by
Tim Berners-Lee) - Revolutionize information access and sharing over
the Internet (Gutenbergs printing press) - Web browsers (e.g., IE, Netscape, FireFox)
4Relationship between HGP and WWW
- HGP transformed life sciences into an information
science, as large amounts of data have been
generated, which need to be stored and analyzed - GenBank, EMBL, and DDBJ have recently reached a
milestone of 100 billion bases from gt 165,000
organisms - Pubmed has gt 300,000 articles from gt 150 life
sciences journals - WWW has become the most popular medium for life
scientists to distribute, access, share, and
integrate different types of biological data over
the Internet - As of 2005, there are 719 publicly available
databases listed in NAR molecular biology
database compilation
5Spider-Man Spidey science gets a genetic makover
http//www.genomenewsnetwork.org/articles/05_02/sp
iderman.php
6Spider-Man (Tim Berners-Lee) Weaving the Web
7Semantic Web
- "The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in cooperation." -- Tim Berners-Lee, James
Hendler, Ora Lassila, The Semantic Web,
Scientific American, May 2001 - It provides a common framework that allows data
to be shared and reused across application,
enterprise, and community boundaries - It is based on the Resource Description Framework
(RDF), which integrates a variety of applications
using XML for syntax and URIs for naming.
8Semantic Web for Life Sciences(TBL, Bio-IT World
Conference, May 2005)
- also the people involved in the Semantic Web
pushing it along are also excited about getting
involved in the life sciences its one of those
areas that affect humankind, finding drugs,
curing AIDS and cancer, etc. There seems to be a
huge energy, and lots of practical technical
reasons why this area is crying out to be one of
the flagship areas that the Semantic Web really
takes off
9Data ? Information ? Knowledge
Navarro JD, Niranjan V, Peri S, Jonnalagadda CK,
Pandey A. (2003) From biological databases to
platforms for biomedical discovery. Trends
Biotechnol. (6)263-8.
10Problem with the Current WWW
11Problem with the Current Web
12Keyword Search regulatory variation mammals
13Data Heterogeneity
- Lack of standard detailed description of
resources - Data are exposed in different ways
- Programmatic interfaces
- Web forms or pages
- FTP directory structures
- Data are presented in different ways
- Structured text (e.g., tab delimited format and
XML format) - Free text
- Binary (e.g., images)
14Data Heterogeneity (contd)
- Nomenclature problem
- Gene/protein names (based on phenotype, sequence,
function, organisms, etc) - Armadillo (fruitflies) vs. i-catenin (mice)
- PSM1 (human) PSM2 (yeast) PSM1 (yeast) PSM2
(human) - Sonic Hedgehog
- ID proliferation
- Different ID schemes 1OF1Â (PDB ID) and P06478
(SwissProt ID) correspond to Herpes Thymidine
Kinase - Lexcial variation GO1234, GO1234, GO-1234
- Synonyms vs. homonyms
- Dopamine receptor D2 DRD2, DRD-2, D2
- PSA prostate specific antigen,
puromycin-sensitive aminopeptidase, psoriatric
arthritis, pig serum albumin - Biologists would rather share their toothbrush
than a gene name Gene nomenclature is beyond
redemption, said Michael Ashburner
15From Web to Semantic Web(contd)
- Human processing ? Machine processing
- Use of Metadata
- Free text description ? ontological description
- HTML ? XML ? RDF or its extensions
- Vision ? implementation
16HTML Example
Readme
1 1 0 0 1 1 1 2 0 0 2 0 1 3
1 2 2 0 1 4 1 2 1 0 1 5 1
2 1 1 1 6 1 2 1 0
17XML Example
What is XML?
- eXtensible Markup Language
- It is self describing
- It is hierarchical
- It is human- and computer-readable
- It is a World Wide Web Consortium (W3C) standard
- It can be validated using DTD or XSchema
- There is a large software base support
18Proliferation of Bio-XML Formats
Reasoning (machine intelligence)
19XML Representation of Proteomics Data
AGML
HUP-ML
20RDF Representation
21Resource Description Framework (RDF)
- It is a standard data model (directed acyclic
graph) for representing information (metadata)
about resources in the World Wide Web - In general, it can be used to represent
information about things that can be identified
(using URIs) on the Web - It is intended to provide a simple way to make
statements (descriptions) about Web resources
22RDF Statement
- A RDF statement consists of
- Subject resource identified by a URI
- Predicate property (as defined in a name space
identified by a URI) - Object property value (literal) or a resource
For example, the dbSNP Website is a subject,
creator is a predicate, NCBI is an object. A
resource can be described by multiple statements.
23Graphical XML Representation
http//www.ncbi.nlm.nih.gov/SNP
http//purl.org/dc/elements/1.1/creator
http//purl.org/dc/elements/1.1/language
http//www.ncbi.nlm.nih.gov
en
lt?xml version"1.0"?gt ltrdfRDF
xmlnsrdfhttp//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsdchttp//purl.org/dc/elements/1.1
xmlnsexhttp//www.example.org/termsgt ltrdfDe
scription abouthttp//www.ncbi.nlm.nih.gov/SNPgt
ltdccreator rdfresourcehttp//www.ncbi.nlm.nih
.govgtlt/dccreatorgt ltdclanguagegtenlt/dclanguagegt
dategt lt/rdfDescriptiongt lt/rdfRDFgt
24Life Sciences Identifiers (LSIDs)
- URL vs. URI vs. URN
- URL http//www.gleaners.org/faq.html
- URI http//www.gleaners.org/faq.htmlQ04
- URN www.gleaners.org/faq.htmlQ04
- LSID is a form of URN
25Problems of URIs
- The web server referenced by the URL may be
broken or become unavailable - The syntax of the URL may change over time as the
underlying data retrieval program evolves - The data returned by a URL may change over time
as the underlying database contents change.
26LISD Format and Examples
- URNLSIDnamespacedatabaseobject_idrevision_id
- Examples
- URNLSIDncbi.nlm.nih.govgenbankAF271072'
- URNLSIDchemacx.cambridgesoft.comACXCAS9675821
27LSID (contd)
- Globalness A LSID is a name with global scope
that does not imply a location. It has the same
meaning everywhere. - Uniqueness The same LSID will never be assigned
to two different objects. - Persistence It is intended that the lifetime of
an LSID be permanent. - Scalability LSIDs can be assigned to any data
element that might conceivably be available on
the network, for hundreds of years. - Legacy Support The LSID naming scheme must
permit the support of existing legacy naming
systems - Extensibility Any scheme for LSIDs must permit
future extensions to the scheme. - Independence It is solely the responsibility of
a name issuing authority to determine conditions
under which it will issue a name. - Resolution A URN will not impede resolution
i.e., translation to a URL..."
28Semantic Web Applications
- Connotea (on-line management of web resources)
- Piggy bank (semantic web browser)
- YeastHub yeast genome data integration
29(No Transcript)
30Connotea Online Reference Management Service
(Nature Publishing Group)www.connotea.org
- To keep links to the articles/websites of your
interest - To discover new articles and websites through
sharing your links with other users - It is web-accessible
31TBLs original vision of the Web
- Active vs. passive
- Collaborative vs. authoritative
- Decentralized vs. centralized
- Semantic vs. syntactic
32Connotea Online Reference Management Service
(Nature Publishing Group)
33ALFRED Population Sample
34Connotea (ALFRED Example)
35ALFRED Example
36Google Earth Example
37Data Integration Using RDF
atagccgtacctgcgagtctagaagct
humanhemoglobin
derives from
atagccgtacctgcgagtctagaagct
GenBank
derives from
humanhemoglobin
oxygentransportprotein
humanhemoglobin
oxygentransportprotein
is a
is a
Gene Ontology
has 3D structure
humanhemoglobin
has 3D structure
Unified view
Protein Data Bank
38Piggy Bank
- http//simile.mit.edu/piggy-bank
- It is an extension to the Firebox Web browser
- It turns the Firebox Web browser into a Semantic
Web browser - It supports tagging and links to Google Map
39RDF is the Common Currency
40Peggy Bank (Data Integration Example)
TRIPLES (Expr. Data)
HubMed
Keyword search
D2RQ
RDF Expr. Dataset
RDF Bib.. Info.
import
import
Pluggin
Browse/ query
41TRIPLES Expression Data in RDF
42Peggy Bank (PIM1 Gene)
43Semantic Bank
44Yeast Hub
45Yeast Hub Team
Kei Cheung
Mark Gerstein
Andrew Smith
Kevin Yip
Andy Masiar
Remko deKnikker
46RDF Technologies
- Description of data source using Rich Site
Summary (RSS) - Data Conversion into RDF
- Relational Database to RDF (D2RQ)
- Tabular-RDF-Conversion
- RDF Database (Sesame)
- RDF-based query languages
47Rich Site Summary (RSS)
User (Application)
Yeast Hub Resource
No RSS
No RSS
RSS
RSS
Resources
48Resource Description(Use of Dublin Core Metadata)
49RDF Metadata Example (RSS1.0)
50Data Conversion and Integration
51RDF Modeling of Tabular Data
52Tabular-RDF Data Conversion
53Example of Data Converted into RDF
54Motivating Example
- Genomic analysis of essentiality within protein
networks. - H Yu, D Greenbaum, H Xin Lu, X Zhu, M Gerstein
(2004) Trends Genet 20 227-31. - Jeong, H., Mason, S., Barabási, A.-L., and
Oltvai, Z. 2001. Lethality and centrality in
protein networks. Nature 411 4142 - Fraser, H., Hirsh, A., Steinmetz, L., Scharfe,
C., and Feldman, M. 2002. Evolutionary rate in
the protein interaction network. Science 296
750752 - Important but hard
55Example Integrated Query
56Query Form
57RQL Syntax and Query Results
58Next step Data Mining
- Whole yeast genome analysis (Y6K)
- Subcellular localization of the yeast proteome.
- A Kumar, S Agarwal, JA Heyman, S Matson, M
Heidtman, S Piccirillo, L Umansky, A Drawid, R
Jansen, Y Liu, KH Cheung, P Miller, M Gerstein,
GS Roeder, M Snyder (2002) Genes Dev 16 707-19. - A Bayesian system integrating expression data
with sequence patterns for localizing proteins
comprehensive application to the yeast genome. - A Drawid, M Gerstein (2000) J Mol Biol 301
1059-75. - Doing systematic dataming to predict the remining
3K localizations - Important but hard .
59Once the web has been sufficiently "populated"
with rich metadata, what can we expect? First,
searching on the web will become easier as search
engines have more information available, and thus
searching can be more focused. Doors will also be
opened for automated software agents to roam the
web, looking for information for us or
transacting business on our behalf. The web of
today, the vast unstructured mass of information,
may in the future be transformed into something
more manageable - and thus something far more
useful. (Ora Lassila)
60Automate humanely!
- No amount of automation will replace human
beings, but clumsy and belligerent automation
will alienate them and suppress their
creativity. - (Tony Kazic)
61Thanks!Questions?
62Semantic Graph
Find the most current image of Kei Cheung who is
affiliated with YCMI
affiliated with
Kei Cheung
YCMI
images
Files
member of
member of
File n
File 1
date
date
Feb 1, 2005
Oct 1, 1990
63Research/Technologies Related to Semantic Web
- Text mining
- Agent computing
- Web services
- Ontological research
64Knowledge representation
Jill
- A person (Joe) is an uncle iff
- Joe is male
- He has a parent (Jill) who has a second child
(Sue) who is parent
has_child
has_parent
Sue
Joe
has_child
?
65Other things to mention?
- Taxonomy vs. ontology
- OWL overview and example(s)