Information Extraction from Social Media - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction from Social Media

Description:

Types: personal diary, topical, agenda oriented, PR, business oriented ... Of these, which are already onboard and which are lost causes? ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 73
Provided by: filipp8
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction from Social Media


1
Information Extraction from Social Media
  • Tim Finin
  • 10 October 2006

2
Overview
  • Motivation
  • Blogs and feeds
  • UMBC research
  • Seedling opportunities
  • Conclusion

3
Motivation
  • Social media describes the online tools and
    platforms that people use to share opinions,
    insights, experiences, and perspectives with each
    other.
  • Wikipedia, Sept 06

Its a dynamic and growing area, that includes
blogs, wikis, forums, photo and video sharing
sites, etc.
4
Motivation
  • We started looking at blogs a year ago because
    they were rich in metadata
  • Encoded in RDF and other formats
  • Weve found that blogs and other social media are
    a rich source of problems and opportunities,
    including
  • Information integration on the Web
  • Modeling trust
  • Extracting facts, opinions and sentiment
  • Event and trend detection
  • If static pages form the Webs long term memory,
    then the Blogosphere is its stream of
    consciousness

5
(No Transcript)
6
Overview
  • Motivation
  • Blogs and feeds
  • UMBC research
  • Seedling opportunities
  • Conclusion

7
What are blogs?
  • Term is derived from weblog
  • Dated entries (posts)
  • Reverse chronological order
  • RSS feeds
  • Comments, trackbacks
  • Blogrolls, ads, links
  • Profile of the blogger
  • Categories and tags
  • Types personal diary, topical, agenda oriented,
    PR, business oriented
  • Blogging infrastructure platforms, pings, feeds,
    blog search engines

8
State of the Blogosphere
  • 52 million blogs
  • Doubling in size every six months
  • 40 new blog posts per second
  • 57 of online US teens generate content, 40 read
    blogs, 20 have them
  • 53 of companies are blogging
  • One third of blog posts are in English
  • Sources
  • State of the Blogosphere (Technorati), Fortune
    500 Business Blogging Wiki , Pew, 11/05,
    (Guideware 10/05), UMBC studies

9
Weblogs Cumulative 03/03 07/06
10
June 2006 Posts by language
11
(No Transcript)
12
(No Transcript)
13
Profile of a blogger
  • 54 are under 30
  • 54 are men
  • 37 have a degree
  • 38 are students
  • 51 have been blogging for lessthan one year

Source Bloggers A portrait of the nations new
storytellers, Pew Internet, July 2006
14
Feeds
  • RSS Really Simple Syndication, RichSite Summary
    or RDF Site Summary
  • 1997 David Winer introduced an XMLsyndication
    format for blogs
  • 1999 Netscape defined RSS using RDF
  • Very important for blogs and other social media
  • An efficient way to distribute new items,
    changes, updates
  • Simplifies infrastructure, obviating crawling
  • Google blogs search is really Google feed search
  • Feeds for most recent blog posts, Wikipedia
    changes, news articles, sensor information,
    photos, data elements, etc.

15
Overview
  • Motivation
  • Blogs and feeds
  • UMBC research
  • Seedling opportunities
  • Conclusion

16
Relevant UMBC Research
  • Splog detection
  • Feeds that matter
  • BlogVox Extracting opinions from blogs
  • Modeling influence in blog communities
  • Semnews NLP for information extraction on the
    Web
  • Semdis Modelling trust in social networks

17
Knowing and influencing the market
  • Your goal is to market Apples ipod phone
  • How can you track the buzz about it?
  • What are the relevant communities andblogs?
  • Which communities are fans, which aresuspicious,
    which are put off by the hype?
  • Is your advertising having an effect?
    Thedesired effect?
  • Which bloggers are influential in this market? Of
    these, which are already onboard and which are
    lost causes?
  • To whom should you send details or evaluation
    samples?

18
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential
  • Influential nodes often rely on connectors and
    information propagators for new topics
  • Influence is topical
  • Aggregated beliefs and opinions of the masses can
    have an influence
  • Influence is polar
  • Influence is temporal

19
What is Influence?
  • Main Entry influence Pronunciation
    'in-"flü-n(t)s, esp Southern in-'Function
    nounEtymology Middle English, from Middle
    French, from Medieval Latin influentia, from
    Latin influent-, influens, present participle of
    influere to flow in, from in- fluere to flow --
    more at FLUID1 a an ethereal fluid held to
    flow from the stars and to affect the actions of
    humans b an emanation of occult power held to
    derive from stars2 an emanation of spiritual
    or moral force3 a the act or power of
    producing an effect without apparent exertion of
    force or direct exercise of command b corrupt
    interference with authority for personal gain4
    the power or capacity of causing an effect in
    indirect or intangible ways SWAY5 one that
    exerts influence- under the influence affected
    by alcohol DRUNK ltwas arrested for driving
    under the influencegt

20
(No Transcript)
21
Many Dimensions of Influence
  • Overall, what are the most influential blogs
  • What are the influential blogs on topic X
  • What is the influence of a blog on a community C
  • Is this influence positive or negative.
  • How do you model influence
  • Link based
  • Topical
  • Readership-based
  • Termporal influence

22
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential
  • Influential nodes often rely on connectors and
    information propagators for new topics
  • Influence is topical
  • Aggregated opinions of the masses can have an
    influence
  • Influence is polar
  • Influence is temporal

23
Influence on the Blogosphere
24
Influence Models for Blogs
Blog Graph
Influence Graph
1/3
U
2
2
1
3
3
2/5
1/3
V
1/3
1
1
1
1/5
5
5
2/5
4
4
1/2
1/2
Wu,v Cu,v / dv
U links to V gt U is Influenced by V
25
Basic Influence Models
  • Linear Threshold Model
  • S wuv ?v
  • w is the active neighbor of v
  • Cascade Model
  • Puv - probability with which a
  • node can activate each of its
  • neighbors, independent of
  • history.

Influence Graph
1/3
Active
2
1
3
2/5
1/3
?v
1/3
1
1
1/5
5
2/5
Active
4
Inactive
1/2
1/2
26
Greedy Node Selection Heuristic
  • At each time step select the next node to be
    added to the target set such that it maximizes
  • number of influential node
  • adding the new node causes an increase in the
    activated node set
  • consistent with Technorati rank

Influence Graph
1/3
2
1
3
2/5
1/3
1/3
Technorati Rank Count
lt100 40
100 - 500 27
500 - 5000 20
Rest 13
Total 100
1
1
1/5
5
2/5
4
1/2
1/2
Distribution of Technorati ranks in the 100 most
frequently selected nodes using greedy heuristics
(averaged over 50 runs)
27
(No Transcript)
28
Further Questions.
  • Key individuals in a social network are those
    that are influential.
  • Do communities cluster around these influential
    nodes?
  • Better measures of influence?
  • Does Greedy selection heuristic correlate with
    conversation threads?
  • Influential nodes often rely on connectors and
    information propagators for new topics.
  • What makes a node a connector?
  • Does a new meme become epidemic only after it is
    picked up by the influential nodes?

29
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential
  • Influential nodes often rely on connectors and
    information propagators for new topics
  • Influence is topical
  • Aggregated opinions of the masses can have an
    influence
  • Influence is polar
  • Influence is temporal

30
Influence is topical
  • Gizmodo is very popular
  • Its influential for consumer electronics, e.g.,
    PDAs, mobile phones, gadgets
  • DailyKOS is very popular
  • Its influential for politics, especially liberal
    politics
  • Whats a good ontology for blog topics?
  • How can we categorize blogs w.r.t. a topic
    ontology?

31
Readership Based Influence
Feeds That Matter http//ftm.umbc.edu/
  • 83K publicly listed subscribers
  • 2.8M feeds, 500K are unique
  • 26K users (35) use folders to organize
    subscriptions
  • Data collected in May 2006

32
General Statistics
The number of subscribers per feed follows a
power law distribution.
The number of folders per user. Most users tend
to use modest number of folders.
33
General Statistics
This scatter plot shows the relation between the
number of folders and the number of subscribed
feeds. As subscriptions increase, users tend to
organize them into folders.
Feed readership vs. Mean Time To Post. This graph
shows that the popular feeds tend to post more
often on an average.
34
Tag Cloud Before Merge
35
Tag Cloud After Merge
36
Tag Merging
Folder names are used as topics. Lower ranked
folder are merged into a higher ranked folder if
there is an overlap and a high cosine similarity.
37
Finding Influential Feeds using Co-Citations
Feed recommendations
Leading blogs about Politics. Seed set is top
blogs in politics from bloglines and blog graph
used is from Blogpulse dataset..
38
Further Questions.
  • Influence is topical
  • At an aggregate level
  • Topic detection and
  • Identifying communities
  • At an individual bloggers level
  • The bloggers influence in the topic
  • Currently readership-based
  • Can be strength of the node in the community for
    topic X

39
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential.
  • Influential nodes often rely on connectors and
    information propagators for new topics.
  • Influence is topical.
  • Aggregated facts and opinions of the masses can
    have an influence (wisdom of the crowds)
  • Influence is polar.
  • Influence is temporal.

40
Extracting facts and opinions
  • 2006 TREC blog track finding opinionated blog
    posts about a given topic
  • SemNews extracting facts from Web documents
    using the OntoSem NLP system
  • Note there are several startups and other
    companies trying to commercialize opinion mining

41
(No Transcript)
42
(No Transcript)
43
TREC Opinion Extraction
  • Finding opinionated posts, either positive or
    negative, about a query
  • 2006 TREC Blog corpus
  • 80K blogs
  • 300K posts
  • 50 test queries

44
BlogVox Opinion Extraction
Result Scoring
SVM Score Combiner
Query Word Proximity Scorer
1
First Occurrence Scorer
4
Query Terms

Query Word Count Scorer
2
Context Words Scorer
5
Opinionated Ranked Results
Lucene Search Results
Title Word Scorer
3
Lucene Relevance Score
6
External Resources
Supporting Lexicons
Positive Word List
Google Context Words
Negative Word List
Amazon Review Words
45
Separating Blog Wheat from Blog Chaff
  • Data cleaning for
  • Splog removal
  • Post content identification

46
Spam in the Blogosphere
  • Types comment spam, ping spam, splogs
  • Akismet 87 of all comments are spam
  • 75 of update pings are spam (ebiquity 2005)
  • 56 of blogs are spam (ebiquity 2005)
  • 20 of indexed blogs by popular blog search
    engines is spam (Umbria 2006, ebiquity 2005)
  • Spam blogs (splogs) are weblogs used to promoting
    affiliated websites or host ads
  • Spings, or ping spam, are pings that are sent
    from spam blogs

1Wikipedia
47
Motivation host ads
48
Motivation index affiliates, promote pageRank
49
Some queries returned mostly splogs
hybrid cars
cholesterol
50
Influence of Splogs
51
Post Content Identification
  • Baseline Heuristic
  • SVM Method

52
Effect of sidebar content
53
Preliminary results
54
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential
  • Influential nodes often rely on connectors and
    information propagators for new topics
  • Influence is topical
  • Aggregated opinions of the masses can have an
    influence
  • Influence is polar
  • Influence is temporal

55
Link Polarity / Citation Signal
  • Linking alone is not indicator of influence
  • Polarity can indicate the type of influence
  • All links not made equal
  • Post
  • Comment
  • Trackback
  • Blogroll
  • Advertising
  • Polarity useful in other applications like trust
    and bias.

ltbooks,-0.9gt
D
ltMovies, 0.9gt
B
ltfood, 0.3gt
ltcars,0.5gt
ltMovies, 0.8gt
A
C
ltMusic, -0.6gt
56
Modeling influence in social media
  • Key individuals in a social network are those
    that are influential
  • Influential nodes often rely on connectors and
    information propagators for new topics
  • Influence is topical
  • Aggregated opinions of the masses can have an
    influence
  • Influence is polar
  • Influence is temporal

57
Unwind the Influence in Time
  • Who started the initial wave?
  • Who jumped on the story at the same time?
  • How far did the wave propagate?

S
t1
t2
t3
t1
t4
t5
58
Visualizing Influence in Time
59
SemNews News to OWL
  • Semantically Search and Browse news
  • Aggregators collect the RSS news descriptions
    form various sources.
  • The sentences are processed by OntoSem and are
    converted into TMRs
  • And then into RDF and OWL
  • Provides intelligent agents with the latest news
    in a machine readable format
  • http//semnews.umbc.edu/

60
(No Transcript)
61
Agent understandable news
Provides RDF version of the news.
62
Semantacizing RSS feeds
View structured representation of the RSS news
story.
Future versions would enable editing the facts
and provide provenance information
63
News stories are ontologically linked
Find news stories by browsing through the OntoSem
ontology.
64
Tracking Named Entities
Find stories on a specific named entity.
65
Browsing Facts
Fact repository explorer for named entity
Mexico shows that it has a relation
nationality-of with CITIZEN-235
Fact repository explorer for instance CITIZEN-235
shows that the citizen is an agent of ESCAPE-EVENT
66
Querying the semanticized RSS
RDQL Queries
Provides structured querying over text
repre-sented in RDF.
67
Semantic Alerts
Alerts can be specified as ontological concepts/
keywords / RDQL queries. Subscribe to results of
structured queries
68
Beyond keyword search
  • Conceptually searching for content
  • Find all news stories that have something to do
    with a place and a terrorist activity.
  • Context based querying
  • Find all events in which George Bush was the
    speaker.
  • Reporting facts
  • Find all politicians who traveled to Asia.
  • Knowledge sharing
  • Populating instances by mapping FOAF and DC to
    OntoSem ontology.

69
SEMDIS
On Homeland Security and the Semantic Weba
Provenance and Trust Aware Inference Framework
Semantic Association Discovery and Evaluation
Architecture
Motivation
2
1
  • Semantic association between X and Bin Laden
  • Provenance
  • Multiple sources contribute unique fragments of
    association
  • Multiple sources confirm a fragment in different
    belief states
  • Rank only some discovered associations are
    interesting
  • Trust some information sources not sufficiently
    trustworthy
  • Collaborative implementation
  • University of Georgia
  • Extracting knowledge from the Web
  • Discovering complex semantic association
  • Ranking semantic association by content
  • UMBC
  • Tracking provenance of semantic association
  • Trusting semantic association by context
  • Enabling best-first search using trust heuristics

70
SEMDIS
On Homeland Security and the Semantic Weba
Provenance and Trust Aware Inference Framework
Semantic Association Discovery and Evaluation
Trust
Provenance
3
4
Trustworthiness of an RDF graph The hypothesis
Mr X is associated with Bin Laden is proved by
a four-triple semantic association (SA), how to
evaluate SAs trustworthiness. S1 egMrX
egisPresidentOf egcompanyA S2
egorganizationB eginvests egcompanyA S3
egorganizationB egisOwnedBy egMrY S4 egMrY
egrelatesTo egBinLaden Trust relation between
agents helps propagate belief states case1
(belief concatenation) exact one source per
triple case2 belief aggregation multiple
sources for a triple case3 social
dependency sources are dependent through social
network We assume all triples are
semantically independent
  • Provenance of an RDF graph or sub-graph
  • Three sources of a RDF graph, G
  • where-provenance the web documents that
    serialize G
  • whom-provenance the person who
    created/published G
  • why-provenance the RDF graphs which logically
    imply G
  • RDF graph provenance service
  • Observations
  • provenance information is part of context
    information
  • provenance is not required for most inference
    tasks
  • provenance is useful for context based trust
    analysis
  • provenance can be used to group knowledge
  • Approach
  • provide a stand alone service that queries
    provenance of a given RDF graph or sub-graph

71
Overview
  • Motivation
  • Blogs and feeds
  • UMBC research
  • Seedling opportunities
  • Conclusion

72
Some opportunities
  • Modeling bias in information sources
  • Place information sources in a space defined by
    their opinions on 1000 issues
  • Mining sentiments and opinions from online
    communities
  • Who cares about what? Trend spotting
  • Extracting facts and beliefs from social media
  • Was flight 94 shot down?
  • Detecting events and new issues
  • e.g. Lodon bombing (July 2005), E. coli and
    produce

73
Overview
  • Motivation
  • Blogs and feeds
  • UMBC research
  • Seedling opportunities
  • Conclusion

74
Conclusions
  • Social media increasingly important
  • Its greatest growth is outside the U.S.
  • Its possible to extract lots of potentially
    valuable information
  • Metadata, social networks, opinions and beliefs
  • If static pages form the Webs long term memory,
    then the Blogosphere is its stream of
    consciousness
  • Interests, fears, obsessions, questions, etc.

75
http//ebiquity.umbc.edu/
76
Ontological Semantics
OntoSem is a Natural Language Processing System
that processes the text and converts them into
facts. Supported by a constructed world model
encoded in a rich Ontology.
77
Ontological Semantics
78
Static Knowledge Sources
  • Ontology
  • 8000 concepts
  • Avg 16 properties each
  • Lexicons
  • English 45000 entries
  • Spanish 40000 entries
  • Chinese 3000 entries
  • Fact repository
  • 20000 facts
  • Onomasticon
  • NNNNN names

79
The OntoSem Ontology
FILLER
PROPERTY
FACET
ONTOLOGY CONCEPT CONCEPT ROOT
OBJECT-OR-EVENT PROPERTY SLOT
PROPERTY FACET FILLER
80
Text Meaning Representation (TMR)
Word sense addressed disambiguated
A persistent fact stored in the FR
Semantic dependency established
81
Text Meaning Representation (TMR)
REQUEST-ACTION-69   AGENT HUMAN-72
THEME ACCEPT-70   BENEFICIARY
ORGANIZATION-71   SOURCE-ROOT-WORD ask
TIME (lt (FIND-ANCHOR-TIME)) ACCEPT-70  
THEME WAR-73   THEME-OF REQUEST-ACTION-69
  SOURCE-ROOT-WORD authorizeORGANIZATION-71
  HAS-NAME United-Nations  BENEFICIARY-OF
REQUEST-ACTION-69   SOURCE-ROOT-WORD
UNHUMAN-72   HAS-NAME Colin Powell 
AGENT-OF REQUEST-ACTION-69 SOURCE-ROOT-WORD
he reference resolution has been carried
outWAR-73   THEME-OF ACCEPT-70
  SOURCE-ROOT-WORD war
He asked the UN to authorize the war.
82
Mapping OntoSem to web based KR
Fact Repository
NL Text
OntoSem
TMR
TMRs In OWL
Lexicon
OntoSem2OWL
Ontology
OWL Ontology
Write a Comment
User Comments (0)
About PowerShow.com