Title: Geolocating Blogs From Their Textual Content
1Geolocating Blogs From Their Textual Content
- Clay Fink, Christine Piatko, James Mayfield,
Danielle Chou (APL), Tim Finin, Justin Martineau
(UMBC)
2Motivation
- Where are people blogging?
- Why do we want to know?
- Support for analyzing the geographic dimension of
sentiment - To extract locations for geo-spatial reasoning
- Indentify sources of hyperlocal content
- Useful for mashups
3Geolocation Issues For Blogs
- IP of blogger is not usually available for crawls
and screen scraping - Few blogs are self-hosted
- Most blogs are hosted by services such as
Blogger, Wordpress, etc. - From our crawl of 800,000 blogs (weblogs.com,
2008) - Only 3 with unique IPs
- 82 were hosted on IPs with at least 100 other
crawled blogs
4Metadata
- Metadata tags such as ICBM and geo.position are
not used widely in blogs - Found 900 blogs with such tags in 800,000 crawled
blogs
ltmeta name'ICBM content'38.9906657,
-77.0260880' /gt
ltmeta name"geo.position" content"
'38.9906657-77.0260880 "gt ltmeta
name"geo.placename" content Silver Spring,
MDgt ltmeta name"geo.region" content US"gt
5About Me And Profiles
- No consistent way across blogging platforms for
expressing the bloggers location
6Blogging Behavior and Location
- People can indicate their position from their
linguistic behavior - 54 of bloggers worldwide write about personal
experience - Can we leverage the presence of named location
entities in text to determine a bloggers
location?
7Textual Clues In Blog Posts
8Geolocating Blogs From Textual Clues
- Smith, et al (2001) and Amitay, et al (2004)
describe how to extract the geographic focus of
digital documents in just such a way - Extract location entities
- Extract toponyms for entities from gazetteer
- Disambiguate location entities to a particular
toponym - Determine focus from clustering of disambiguated
places - Apply to blog posts
- Can the geographic focus of a blog can be derived
the same way by accumulating disambiguated entity
mentions across all posts? - We tested how well the geographic focus extracted
from a blog matches a bloggers location based on
a collection of ground truth blogs with known
locations
9Toponyms From Post Location Mentions
- One US blog with known location
- 294 posts processed
- 211 unique entities
- 3429 toponyms
10Disambiguated Post Location Mentions
- Disambiguated to 67 locations
11Predicted/Actual Location of Blog
Predicted Washington County, OR
Actual Beaverton, OR
- Predicted location is Washington County, Oregon
- Subsumes true location of Beaverton, Oregon
12Method
- We collected posts from blogs with know locations
as our ground truth - Our crawl looked for blogs with geolocating meta
tags (ICBM, geo.position) in the home page - Also used a list of blogs with reported locations
from feedmap.net - Used a named entity recognizer to extract
location entities from posts - Used an online gazetteer to get toponyms for the
location entities - Disambiguated locations
- Determined geographic focus by looking for
clustering of disambiguated locations in a
particular locale
13Method
- NER
- Used the Stanford Named Entity Recognizer
- Used the Geonames online gazetteer
- Disambiguation
- Assume continent, country, or first-level
administrative area for matching location names - Look for qualified location names
- Vancouver, BC
- Disambiguate remaining mentions to most populous
toponym - Assume one sense per discourse across a given
blogs posts
14Method
- Determining Geographic Focus
- Populate a tree of the disambiguated places using
the geographic inclusion hierarchy from the
gazetteer - Implemented tree using simple OWL ontology that
describes a location hierarchy - Set initial scores for hierarchy nodes
- Apply scoring function (Amitay et al, 2004) to
accumulate scores up hierarchy - sn In ? s2i DL-1
- Nodes lower in hierarchy with more sub-nodes will
be higher scoring - Success if highest scoring node subsumes known
location, or is within 100 miles, or is within
the same state or province
15Results
- Processed posts from 844 English language, US
blogs published between January 1, 2005, and
April 24, 2009 - 526 hits out of 829 acceptable results
- 63 accuracy based on acceptance criteria
- 97 correctly identified as in the US
- Used an initial node score or 0.5 and a decay
constant of 0.8 - Success correlated with number of disambiguated
locations near known location (0.5) - 95 accuracy if number of nearby locations is gt
10
16Results Geolocated US Blogs
17Discussion
- Issues
- NER model we used was not trained on blog text
- Disambiguation techniques were very simple
- No confidence measure for assigned geographic
focus - No comparison to human judgments
- Worth Further Study
- Integrate into sentiment analysis
- Investigate use of locative language for finding
references to proximate locations - Use to populate an ontology instance implementing
more complex geospatial relationships - Apply to other social media, such as Twitter