Title: The Gene Wiki, from a BioRDFnave perspective
1The Gene Wiki, from a BioRDF-naïve perspective
W3C / HCLSIGBioRDF SubgroupNovember 17, 2008
2Patterns of gene annotation
How do we efficiently annotate the function of
the 25,000 genes in the mammalian genome? Goal
Genome-wide functional genomics
P(k) k -a
44 of genes in Entrez Gene have zero linked
references. Over 75 have five or fewer linked
references.
3The Long Tail of Knowledge
- Traditional media revolves around the Short Head
a few number of publishers putting out lots of
content - Web 2.0 media revolves around community
generated content a huge population of
individuals each generating a (relatively) small
amount of content
The Short Head Newspapers TV/Hollywood Consumer
Reports Olympics Encyclopedia Britannica
The Long Tail Blogs YouTube Amazon
reviews American Idol Wikipedia
Community intelligence
4The Long Tail of encyclopedias
- Wiki a website that allows the visitors
themselves to easily add, remove, and otherwise
edit and change available content, typically
without the need for registration. - Wikipedia the free encyclopedia that anyone can
edit.
An expert-led investigation carried out by Nature
revealed numerous errors in both
encyclopaedias, but among 42 entries tested, the
difference in accuracy was not particularly
great the average science entry in Wikipedia
contained around four inaccuracies Britannica,
about three.
http//en.wikipedia.org/wiki/WikipediaSize_compar
isons, July 2008
5Advantages of a Gene Wiki
1) Existing gene portals are great for structured
content, but a wiki is suited for summarizing
unstructured content
Entrez Gene
Wikipedia
Unstructured content allows for free-text,
images, diagrams, photos, etc.
6Advantages of a Gene Wiki
2) Wiki articles enable two-way communication of
information, encouraging contributions and edits
from the community.
Dec 18, 2002
Jan 3, 2004
Dec 11, 2004
May 6, 2006
Wikipedia is rarely the last place you look, but
is often a good first place for an overview.
7Gene stubs
- Active MCB community at WP had already developed
650 gene articles - Can we accelerate this process through stub
creation? - In total, created 7500 new articles and edited
650 previously existing articles.
8Why Wikipedia?
- Critical mass of articles to which and from which
we could link gene pages - Critical mass of editors who were experienced in
wiki-related issues (fighting vandalism,
copyediting, governance) - Active group of molecular biologists at the MCB
WikiProject (http//en.wikipedia.org/wiki/WPMCB
) - Alternatives considered
- Home-built wiki
- Citizendium (citizendium.org)
9Gene wiki usage
Current have 9000 gene pages or stubs at
Wikipedia
50 of all edits to gene pages are to
newly-created pages
Gene Wiki pages are highly ranked at Google,
ensuring critical mass of users and editors
10Positive feedback loop
Gene wiki page utility
1
100
2
200
Number of readers
Number of editors
1125k gene-specific review articles?
- Reelin 33 editors, 221 edits since July 2002
- Heparin 175 editors, 320 edits since June 2003
- AMPK 44 editors, 84 edits since March 2004
- RNAi 232 editors, 708 edits since October 2002
Hyperlinks to related concepts
12Gene Wiki activity
- Steady (and growing?) edit rate over time
13Gene Wiki article growth
http//manyeyes.alphaworks.ibm.com/manyeyes/visual
izations/gene-wiki-top-2500-20081114
14Welcome to the semantic web
- The main concern with plaintext-on-Wikipedia is
that it's not an effective way to truly exploit
the long tail, since you're going to end up with
this massive plaintext disaster that will require
human collating (redundant work- just get it
right the first time). - - public-semweb-lifesci mailing list
15Primary emphases
- Providing useful content scientists will not
find or contribute to a wiki unless it is already
useful - Instant feedback wikis allow changes to be
effective immediately, without approval or
intermediary (e.g., corrections/additions to
NCBI/Ensembl?) - Emphasis on contributors, not data miners
emphasize getting data in, not on getting it out,
since complex protocols encourage
nonparticipation (e.g., MIAME) - Critical mass What will differentiate the Gene
Wiki from the many other wiki efforts that are
stagnant?
16Secondary emphases
- Reliability and accuracy do open and uncurated
data models produce trustworthy content? - Synergy with existing resource how can the Gene
Wiki make the growth of traditional annotation
more efficient? - Enabling semantic queries/structure how can we
structure unstructured content for data mining?
(Semantic Mediawiki? NLP?)
17Idealized information flow
Long tail scientific contributions
Direct semantic annotation by scientists
Wikipedia
Semantic structure
NCBI
Ensembl
Authoritative annotation databases
18Figure to scale?
Long tail scientific contributions
Semantic structure
19Summary
- Goal create a complementary resource to existing
tools, not competitive. - Primary emphasis will always be on maximizing
community participation. - How do we structure the unstructured
contributions?
20Acknowledgements
Serge Batalov Jason Boyer Jennifer Floyd Yue
Hu Jon Huss Jeff Janes Camilo Orozco Steve
Su Julia Turner Chunlei Wu David Delano James
Goodale Phil McClurg Richard Trager
Faramarz Valafar, SDSU Tim Vickers, Washington
Univ
Michael Cooke Pete Schultz
Funding NIGMS, NIH Novartis Research Foundation