Geolocating Blogs From Their Textual Content - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Geolocating Blogs From Their Textual Content

Description:

... location based on a collection of ground truth blogs with ... sn = In s2i DL-1. Nodes lower in hierarchy with more sub-nodes will be higher scoring ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 18
Provided by: Fin105
Category:

less

Transcript and Presenter's Notes

Title: Geolocating Blogs From Their Textual Content


1
Geolocating Blogs From Their Textual Content
  • Clay Fink, Christine Piatko, James Mayfield,
    Danielle Chou (APL), Tim Finin, Justin Martineau
    (UMBC)

2
Motivation
  • Where are people blogging?
  • Why do we want to know?
  • Support for analyzing the geographic dimension of
    sentiment
  • To extract locations for geo-spatial reasoning
  • Indentify sources of hyperlocal content
  • Useful for mashups

3
Geolocation Issues For Blogs
  • IP of blogger is not usually available for crawls
    and screen scraping
  • Few blogs are self-hosted
  • Most blogs are hosted by services such as
    Blogger, Wordpress, etc.
  • From our crawl of 800,000 blogs (weblogs.com,
    2008)
  • Only 3 with unique IPs
  • 82 were hosted on IPs with at least 100 other
    crawled blogs

4
Metadata
  • Metadata tags such as ICBM and geo.position are
    not used widely in blogs
  • Found 900 blogs with such tags in 800,000 crawled
    blogs

ltmeta name'ICBM content'38.9906657,
-77.0260880' /gt
ltmeta name"geo.position" content"
'38.9906657-77.0260880 "gt ltmeta
name"geo.placename" content Silver Spring,
MDgt ltmeta name"geo.region" content US"gt
5
About Me And Profiles
  • No consistent way across blogging platforms for
    expressing the bloggers location

6
Blogging Behavior and Location
  • People can indicate their position from their
    linguistic behavior
  • 54 of bloggers worldwide write about personal
    experience
  • Can we leverage the presence of named location
    entities in text to determine a bloggers
    location?

7
Textual Clues In Blog Posts
8
Geolocating Blogs From Textual Clues
  • Smith, et al (2001) and Amitay, et al (2004)
    describe how to extract the geographic focus of
    digital documents in just such a way
  • Extract location entities
  • Extract toponyms for entities from gazetteer
  • Disambiguate location entities to a particular
    toponym
  • Determine focus from clustering of disambiguated
    places
  • Apply to blog posts
  • Can the geographic focus of a blog can be derived
    the same way by accumulating disambiguated entity
    mentions across all posts?
  • We tested how well the geographic focus extracted
    from a blog matches a bloggers location based on
    a collection of ground truth blogs with known
    locations

9
Toponyms From Post Location Mentions
  • One US blog with known location
  • 294 posts processed
  • 211 unique entities
  • 3429 toponyms

10
Disambiguated Post Location Mentions
  • Disambiguated to 67 locations

11
Predicted/Actual Location of Blog
Predicted Washington County, OR
Actual Beaverton, OR
  • Predicted location is Washington County, Oregon
  • Subsumes true location of Beaverton, Oregon

12
Method
  • We collected posts from blogs with know locations
    as our ground truth
  • Our crawl looked for blogs with geolocating meta
    tags (ICBM, geo.position) in the home page
  • Also used a list of blogs with reported locations
    from feedmap.net
  • Used a named entity recognizer to extract
    location entities from posts
  • Used an online gazetteer to get toponyms for the
    location entities
  • Disambiguated locations
  • Determined geographic focus by looking for
    clustering of disambiguated locations in a
    particular locale

13
Method
  • NER
  • Used the Stanford Named Entity Recognizer
  • Used the Geonames online gazetteer
  • Disambiguation
  • Assume continent, country, or first-level
    administrative area for matching location names
  • Look for qualified location names
  • Vancouver, BC
  • Disambiguate remaining mentions to most populous
    toponym
  • Assume one sense per discourse across a given
    blogs posts

14
Method
  • Determining Geographic Focus
  • Populate a tree of the disambiguated places using
    the geographic inclusion hierarchy from the
    gazetteer
  • Implemented tree using simple OWL ontology that
    describes a location hierarchy
  • Set initial scores for hierarchy nodes
  • Apply scoring function (Amitay et al, 2004) to
    accumulate scores up hierarchy
  • sn In ? s2i DL-1
  • Nodes lower in hierarchy with more sub-nodes will
    be higher scoring
  • Success if highest scoring node subsumes known
    location, or is within 100 miles, or is within
    the same state or province

15
Results
  • Processed posts from 844 English language, US
    blogs published between January 1, 2005, and
    April 24, 2009
  • 526 hits out of 829 acceptable results
  • 63 accuracy based on acceptance criteria
  • 97 correctly identified as in the US
  • Used an initial node score or 0.5 and a decay
    constant of 0.8
  • Success correlated with number of disambiguated
    locations near known location (0.5)
  • 95 accuracy if number of nearby locations is gt
    10

16
Results Geolocated US Blogs
17
Discussion
  • Issues
  • NER model we used was not trained on blog text
  • Disambiguation techniques were very simple
  • No confidence measure for assigned geographic
    focus
  • No comparison to human judgments
  • Worth Further Study
  • Integrate into sentiment analysis
  • Investigate use of locative language for finding
    references to proximate locations
  • Use to populate an ontology instance implementing
    more complex geospatial relationships
  • Apply to other social media, such as Twitter
Write a Comment
User Comments (0)
About PowerShow.com