Title: Information Extraction from Social Media
1Information Extraction from Social Media
- Tim Finin
- 10 October 2006
2Overview
- Motivation
- Blogs and feeds
- UMBC research
- Seedling opportunities
- Conclusion
3Motivation
- Social media describes the online tools and
platforms that people use to share opinions,
insights, experiences, and perspectives with each
other. - Wikipedia, Sept 06
Its a dynamic and growing area, that includes
blogs, wikis, forums, photo and video sharing
sites, etc.
4Motivation
- We started looking at blogs a year ago because
they were rich in metadata - Encoded in RDF and other formats
- Weve found that blogs and other social media are
a rich source of problems and opportunities,
including - Information integration on the Web
- Modeling trust
- Extracting facts, opinions and sentiment
- Event and trend detection
- If static pages form the Webs long term memory,
then the Blogosphere is its stream of
consciousness
5(No Transcript)
6Overview
- Motivation
- Blogs and feeds
- UMBC research
- Seedling opportunities
- Conclusion
7What are blogs?
- Term is derived from weblog
- Dated entries (posts)
- Reverse chronological order
- RSS feeds
- Comments, trackbacks
- Blogrolls, ads, links
- Profile of the blogger
- Categories and tags
- Types personal diary, topical, agenda oriented,
PR, business oriented - Blogging infrastructure platforms, pings, feeds,
blog search engines
8State of the Blogosphere
- 52 million blogs
- Doubling in size every six months
- 40 new blog posts per second
- 57 of online US teens generate content, 40 read
blogs, 20 have them - 53 of companies are blogging
- One third of blog posts are in English
- Sources
- State of the Blogosphere (Technorati), Fortune
500 Business Blogging Wiki , Pew, 11/05,
(Guideware 10/05), UMBC studies
9Weblogs Cumulative 03/03 07/06
10June 2006 Posts by language
11(No Transcript)
12(No Transcript)
13Profile of a blogger
- 54 are under 30
- 54 are men
- 37 have a degree
- 38 are students
- 51 have been blogging for lessthan one year
Source Bloggers A portrait of the nations new
storytellers, Pew Internet, July 2006
14Feeds
- RSS Really Simple Syndication, RichSite Summary
or RDF Site Summary - 1997 David Winer introduced an XMLsyndication
format for blogs - 1999 Netscape defined RSS using RDF
- Very important for blogs and other social media
- An efficient way to distribute new items,
changes, updates - Simplifies infrastructure, obviating crawling
- Google blogs search is really Google feed search
- Feeds for most recent blog posts, Wikipedia
changes, news articles, sensor information,
photos, data elements, etc.
15Overview
- Motivation
- Blogs and feeds
- UMBC research
- Seedling opportunities
- Conclusion
16Relevant UMBC Research
- Splog detection
- Feeds that matter
- BlogVox Extracting opinions from blogs
- Modeling influence in blog communities
- Semnews NLP for information extraction on the
Web - Semdis Modelling trust in social networks
17Knowing and influencing the market
- Your goal is to market Apples ipod phone
- How can you track the buzz about it?
- What are the relevant communities andblogs?
- Which communities are fans, which aresuspicious,
which are put off by the hype? - Is your advertising having an effect?
Thedesired effect? - Which bloggers are influential in this market? Of
these, which are already onboard and which are
lost causes? - To whom should you send details or evaluation
samples?
18Modeling influence in social media
- Key individuals in a social network are those
that are influential - Influential nodes often rely on connectors and
information propagators for new topics - Influence is topical
- Aggregated beliefs and opinions of the masses can
have an influence - Influence is polar
- Influence is temporal
19What is Influence?
- Main Entry influence Pronunciation
'in-"flü-n(t)s, esp Southern in-'Function
nounEtymology Middle English, from Middle
French, from Medieval Latin influentia, from
Latin influent-, influens, present participle of
influere to flow in, from in- fluere to flow --
more at FLUID1 a an ethereal fluid held to
flow from the stars and to affect the actions of
humans b an emanation of occult power held to
derive from stars2 an emanation of spiritual
or moral force3 a the act or power of
producing an effect without apparent exertion of
force or direct exercise of command b corrupt
interference with authority for personal gain4
the power or capacity of causing an effect in
indirect or intangible ways SWAY5 one that
exerts influence- under the influence affected
by alcohol DRUNK ltwas arrested for driving
under the influencegt
20(No Transcript)
21Many Dimensions of Influence
- Overall, what are the most influential blogs
- What are the influential blogs on topic X
- What is the influence of a blog on a community C
- Is this influence positive or negative.
- How do you model influence
- Link based
- Topical
- Readership-based
- Termporal influence
22Modeling influence in social media
- Key individuals in a social network are those
that are influential - Influential nodes often rely on connectors and
information propagators for new topics - Influence is topical
- Aggregated opinions of the masses can have an
influence - Influence is polar
- Influence is temporal
23Influence on the Blogosphere
24Influence Models for Blogs
Blog Graph
Influence Graph
1/3
U
2
2
1
3
3
2/5
1/3
V
1/3
1
1
1
1/5
5
5
2/5
4
4
1/2
1/2
Wu,v Cu,v / dv
U links to V gt U is Influenced by V
25Basic Influence Models
- Linear Threshold Model
- S wuv ?v
- w is the active neighbor of v
- Cascade Model
- Puv - probability with which a
- node can activate each of its
- neighbors, independent of
- history.
Influence Graph
1/3
Active
2
1
3
2/5
1/3
?v
1/3
1
1
1/5
5
2/5
Active
4
Inactive
1/2
1/2
26Greedy Node Selection Heuristic
- At each time step select the next node to be
added to the target set such that it maximizes - number of influential node
- adding the new node causes an increase in the
activated node set - consistent with Technorati rank
Influence Graph
1/3
2
1
3
2/5
1/3
1/3
Technorati Rank Count
lt100 40
100 - 500 27
500 - 5000 20
Rest 13
Total 100
1
1
1/5
5
2/5
4
1/2
1/2
Distribution of Technorati ranks in the 100 most
frequently selected nodes using greedy heuristics
(averaged over 50 runs)
27(No Transcript)
28Further Questions.
- Key individuals in a social network are those
that are influential. - Do communities cluster around these influential
nodes? - Better measures of influence?
- Does Greedy selection heuristic correlate with
conversation threads? - Influential nodes often rely on connectors and
information propagators for new topics. - What makes a node a connector?
- Does a new meme become epidemic only after it is
picked up by the influential nodes?
29Modeling influence in social media
- Key individuals in a social network are those
that are influential - Influential nodes often rely on connectors and
information propagators for new topics - Influence is topical
- Aggregated opinions of the masses can have an
influence - Influence is polar
- Influence is temporal
30Influence is topical
- Gizmodo is very popular
- Its influential for consumer electronics, e.g.,
PDAs, mobile phones, gadgets - DailyKOS is very popular
- Its influential for politics, especially liberal
politics - Whats a good ontology for blog topics?
- How can we categorize blogs w.r.t. a topic
ontology?
31Readership Based Influence
Feeds That Matter http//ftm.umbc.edu/
- 83K publicly listed subscribers
- 2.8M feeds, 500K are unique
- 26K users (35) use folders to organize
subscriptions - Data collected in May 2006
32General Statistics
The number of subscribers per feed follows a
power law distribution.
The number of folders per user. Most users tend
to use modest number of folders.
33General Statistics
This scatter plot shows the relation between the
number of folders and the number of subscribed
feeds. As subscriptions increase, users tend to
organize them into folders.
Feed readership vs. Mean Time To Post. This graph
shows that the popular feeds tend to post more
often on an average.
34Tag Cloud Before Merge
35Tag Cloud After Merge
36Tag Merging
Folder names are used as topics. Lower ranked
folder are merged into a higher ranked folder if
there is an overlap and a high cosine similarity.
37Finding Influential Feeds using Co-Citations
Feed recommendations
Leading blogs about Politics. Seed set is top
blogs in politics from bloglines and blog graph
used is from Blogpulse dataset..
38Further Questions.
- Influence is topical
- At an aggregate level
- Topic detection and
- Identifying communities
- At an individual bloggers level
- The bloggers influence in the topic
- Currently readership-based
- Can be strength of the node in the community for
topic X
39Modeling influence in social media
- Key individuals in a social network are those
that are influential. - Influential nodes often rely on connectors and
information propagators for new topics. - Influence is topical.
- Aggregated facts and opinions of the masses can
have an influence (wisdom of the crowds) - Influence is polar.
- Influence is temporal.
40Extracting facts and opinions
- 2006 TREC blog track finding opinionated blog
posts about a given topic - SemNews extracting facts from Web documents
using the OntoSem NLP system - Note there are several startups and other
companies trying to commercialize opinion mining
41(No Transcript)
42(No Transcript)
43TREC Opinion Extraction
- Finding opinionated posts, either positive or
negative, about a query - 2006 TREC Blog corpus
- 80K blogs
- 300K posts
- 50 test queries
44BlogVox Opinion Extraction
Result Scoring
SVM Score Combiner
Query Word Proximity Scorer
1
First Occurrence Scorer
4
Query Terms
Query Word Count Scorer
2
Context Words Scorer
5
Opinionated Ranked Results
Lucene Search Results
Title Word Scorer
3
Lucene Relevance Score
6
External Resources
Supporting Lexicons
Positive Word List
Google Context Words
Negative Word List
Amazon Review Words
45Separating Blog Wheat from Blog Chaff
- Data cleaning for
- Splog removal
- Post content identification
46Spam in the Blogosphere
- Types comment spam, ping spam, splogs
- Akismet 87 of all comments are spam
- 75 of update pings are spam (ebiquity 2005)
- 56 of blogs are spam (ebiquity 2005)
- 20 of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005) - Spam blogs (splogs) are weblogs used to promoting
affiliated websites or host ads - Spings, or ping spam, are pings that are sent
from spam blogs
1Wikipedia
47Motivation host ads
48Motivation index affiliates, promote pageRank
49Some queries returned mostly splogs
hybrid cars
cholesterol
50Influence of Splogs
51Post Content Identification
- Baseline Heuristic
- SVM Method
52Effect of sidebar content
53Preliminary results
54Modeling influence in social media
- Key individuals in a social network are those
that are influential - Influential nodes often rely on connectors and
information propagators for new topics - Influence is topical
- Aggregated opinions of the masses can have an
influence - Influence is polar
- Influence is temporal
55Link Polarity / Citation Signal
- Linking alone is not indicator of influence
- Polarity can indicate the type of influence
- All links not made equal
- Post
- Comment
- Trackback
- Blogroll
- Advertising
- Polarity useful in other applications like trust
and bias.
ltbooks,-0.9gt
D
ltMovies, 0.9gt
B
ltfood, 0.3gt
ltcars,0.5gt
ltMovies, 0.8gt
A
C
ltMusic, -0.6gt
56Modeling influence in social media
- Key individuals in a social network are those
that are influential - Influential nodes often rely on connectors and
information propagators for new topics - Influence is topical
- Aggregated opinions of the masses can have an
influence - Influence is polar
- Influence is temporal
57Unwind the Influence in Time
- Who started the initial wave?
- Who jumped on the story at the same time?
- How far did the wave propagate?
S
t1
t2
t3
t1
t4
t5
58Visualizing Influence in Time
59SemNews News to OWL
- Semantically Search and Browse news
- Aggregators collect the RSS news descriptions
form various sources. - The sentences are processed by OntoSem and are
converted into TMRs - And then into RDF and OWL
- Provides intelligent agents with the latest news
in a machine readable format - http//semnews.umbc.edu/
60(No Transcript)
61Agent understandable news
Provides RDF version of the news.
62Semantacizing RSS feeds
View structured representation of the RSS news
story.
Future versions would enable editing the facts
and provide provenance information
63News stories are ontologically linked
Find news stories by browsing through the OntoSem
ontology.
64Tracking Named Entities
Find stories on a specific named entity.
65Browsing Facts
Fact repository explorer for named entity
Mexico shows that it has a relation
nationality-of with CITIZEN-235
Fact repository explorer for instance CITIZEN-235
shows that the citizen is an agent of ESCAPE-EVENT
66Querying the semanticized RSS
RDQL Queries
Provides structured querying over text
repre-sented in RDF.
67Semantic Alerts
Alerts can be specified as ontological concepts/
keywords / RDQL queries. Subscribe to results of
structured queries
68Beyond keyword search
- Conceptually searching for content
- Find all news stories that have something to do
with a place and a terrorist activity. - Context based querying
- Find all events in which George Bush was the
speaker. - Reporting facts
- Find all politicians who traveled to Asia.
- Knowledge sharing
- Populating instances by mapping FOAF and DC to
OntoSem ontology.
69SEMDIS
On Homeland Security and the Semantic Weba
Provenance and Trust Aware Inference Framework
Semantic Association Discovery and Evaluation
Architecture
Motivation
2
1
- Semantic association between X and Bin Laden
- Provenance
- Multiple sources contribute unique fragments of
association - Multiple sources confirm a fragment in different
belief states - Rank only some discovered associations are
interesting - Trust some information sources not sufficiently
trustworthy
- Collaborative implementation
- University of Georgia
- Extracting knowledge from the Web
- Discovering complex semantic association
- Ranking semantic association by content
- UMBC
- Tracking provenance of semantic association
- Trusting semantic association by context
- Enabling best-first search using trust heuristics
70SEMDIS
On Homeland Security and the Semantic Weba
Provenance and Trust Aware Inference Framework
Semantic Association Discovery and Evaluation
Trust
Provenance
3
4
Trustworthiness of an RDF graph The hypothesis
Mr X is associated with Bin Laden is proved by
a four-triple semantic association (SA), how to
evaluate SAs trustworthiness. S1 egMrX
egisPresidentOf egcompanyA S2
egorganizationB eginvests egcompanyA S3
egorganizationB egisOwnedBy egMrY S4 egMrY
egrelatesTo egBinLaden Trust relation between
agents helps propagate belief states case1
(belief concatenation) exact one source per
triple case2 belief aggregation multiple
sources for a triple case3 social
dependency sources are dependent through social
network We assume all triples are
semantically independent
- Provenance of an RDF graph or sub-graph
- Three sources of a RDF graph, G
- where-provenance the web documents that
serialize G - whom-provenance the person who
created/published G - why-provenance the RDF graphs which logically
imply G
- RDF graph provenance service
- Observations
- provenance information is part of context
information - provenance is not required for most inference
tasks - provenance is useful for context based trust
analysis - provenance can be used to group knowledge
- Approach
- provide a stand alone service that queries
provenance of a given RDF graph or sub-graph
71Overview
- Motivation
- Blogs and feeds
- UMBC research
- Seedling opportunities
- Conclusion
72Some opportunities
- Modeling bias in information sources
- Place information sources in a space defined by
their opinions on 1000 issues - Mining sentiments and opinions from online
communities - Who cares about what? Trend spotting
- Extracting facts and beliefs from social media
- Was flight 94 shot down?
- Detecting events and new issues
- e.g. Lodon bombing (July 2005), E. coli and
produce
73Overview
- Motivation
- Blogs and feeds
- UMBC research
- Seedling opportunities
- Conclusion
74Conclusions
- Social media increasingly important
- Its greatest growth is outside the U.S.
- Its possible to extract lots of potentially
valuable information - Metadata, social networks, opinions and beliefs
- If static pages form the Webs long term memory,
then the Blogosphere is its stream of
consciousness - Interests, fears, obsessions, questions, etc.
75http//ebiquity.umbc.edu/
76Ontological Semantics
OntoSem is a Natural Language Processing System
that processes the text and converts them into
facts. Supported by a constructed world model
encoded in a rich Ontology.
77Ontological Semantics
78Static Knowledge Sources
- Ontology
- 8000 concepts
- Avg 16 properties each
- Lexicons
- English 45000 entries
- Spanish 40000 entries
- Chinese 3000 entries
- Fact repository
- 20000 facts
- Onomasticon
- NNNNN names
79The OntoSem Ontology
FILLER
PROPERTY
FACET
ONTOLOGY CONCEPT CONCEPT ROOT
OBJECT-OR-EVENT PROPERTY SLOT
PROPERTY FACET FILLER
80Text Meaning Representation (TMR)
Word sense addressed disambiguated
A persistent fact stored in the FR
Semantic dependency established
81Text Meaning Representation (TMR)
REQUEST-ACTION-69 AGENT HUMAN-72
THEME ACCEPT-70 BENEFICIARY
ORGANIZATION-71 SOURCE-ROOT-WORD ask
TIME (lt (FIND-ANCHOR-TIME)) ACCEPT-70
THEME WAR-73 THEME-OF REQUEST-ACTION-69
SOURCE-ROOT-WORD authorizeORGANIZATION-71
HAS-NAME United-Nations BENEFICIARY-OF
REQUEST-ACTION-69 SOURCE-ROOT-WORD
UNHUMAN-72 HAS-NAME Colin Powell
AGENT-OF REQUEST-ACTION-69 SOURCE-ROOT-WORD
he reference resolution has been carried
outWAR-73 THEME-OF ACCEPT-70
SOURCE-ROOT-WORD war
He asked the UN to authorize the war.
82Mapping OntoSem to web based KR
Fact Repository
NL Text
OntoSem
TMR
TMRs In OWL
Lexicon
OntoSem2OWL
Ontology
OWL Ontology