Steps Towards Mapping e-Research and Measuring Impact - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Steps Towards Mapping e-Research and Measuring Impact

Description:

To analyse the data in order to capture snapshot of e-Social Science ... Using 40 harvesters still takes about 4h. All but 230 pages harvested. 1.3GB of data ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 18
Provided by: alex111
Category:

less

Transcript and Presenter's Notes

Title: Steps Towards Mapping e-Research and Measuring Impact


1
Steps Towards Mapping e-Research and Measuring
Impact
  • Alex Voss, Rob Procter, Peter Halfpenny, Meik
    Poschen, Marzieh Asgari-Targhi

AHM08 Workshop on Profiling e-Research
Mapping Communities and Measuring
Impacts Edinburgh, 10th September 2008
2
Aims
  • To compile a comprehensive database of e-Social
    Science activities in the UK and elsewhere
  • To analyse the data in order to capture snapshot
    of e-Social Science
  • To provide a monitoring tool that flags up new
    content
  • To provide an infrastructure for further research

3
Problem
  • What I would call e-Social Science is not always
    labeled e-Social Science
  • Simply googling for the term will provide only a
    partial view
  • Need to establish a network of relevant nodes
    with context information on the web and expand
    search from there

4
Approach
  • Using lists of conference and workshop attendees
  • Search for relevant URLs
  • Review resulting data
  • Harvest web pages connected to these
  • Extract key terms
  • Visualise results
  • Further steps

5
Seed List
  • Data about attendees of events (Intl. Conference
    and Agenda Setting)
  • 226 individuals
  • Removal of duplicates and erroneous entries
  • Import into SQL database

6
Search
  • Using Yahoo Search API, generating list of URLs
    matching name, surname and affiliation
  • Restricted to .ac.uk, .edu and .nhs.uk and
    .gov.uk
  • Results in 30k hits for 226 people
  • Extraction of hostnames from URL

7
Removing False Positives
  • Clustering of hostnames by frequency showed some
    systematic false positives through long lists of
    names on some sites
  • e.g., lists of alumni, sports teams etc.
  • Manually removing these for the top 80 hostnames
    reduced number of URLs by 10k to 20k

8
Review
  • Clustering of hostnames by frequency (after
    cleaning)
  • select count(host) as size, host from url group
    by host order by size desc
  • -------------------------------------------
  • size host
  • -------------------------------------------
  • 211 www.geog.leeds.ac.uk
  • 204 www.nottingham.ac.uk
  • 140 www.shef.ac.uk
  • 126 www.ncess.ac.uk
  • 109 www.manchester.ac.uk
  • 97 www.lancs.ac.uk
  • 95 www.psychology.nottingham.ac.uk
  • 93 redress.lancs.ac.uk
  • 92 www.cs.bris.ac.uk
  • 91 www.comlab.ox.ac.uk

9
Review (II)
  • Clustering of URLs by number of persons mentioned
    (after cleaning)
  • -------------------------------------------------
    --------------------
  • size url

    -------------------------------------------------
    --------------------
  • 24 http//ess.si.umich.edu/papers.htm
  • 17 http//www.ncess.ac.uk/events/ASW/visuali
    sation/
  • 17 http//www.ncess.ac.uk/events/conference/
    2006/papers/
  • 12 http//ess.si.umich.edu/committee.htm
  • 12 http//redress.lancs.ac.uk/resources/
  • 10 http//www.kato.mvc.mcc.ac.uk/rss-wiki/Vi
    zNET
  • 10 http//www.informatics.manchester.ac.uk/a
    boutus/staff/
  • 8 http//www.ncess.ac.uk/about_us/people/?c
    entre
  • 7 http//www.geog.leeds.ac.uk/people/a.turn
    er/personal/blog/

10
Checking Completeness
  • select id from url where url 'http//ess.si.umic
    h.edu/committee.htm'
  • gt 59765
  • select surname, name from delegate join
    delegate_url on id delegate_id where url_id
    59765
  • This returns a list of 12 people but actual list
    of conference PC is much longer
  • Missing people who are in the database but also
    people missing in the database
  • Potential to expand list of people involved in
    e-Social Science

11
Harvesting Content
  • Harvesting 20k web pages takes time
  • Using multithreaded code to mask latency
  • Using 40 harvesters still takes about 4h
  • All but 230 pages harvested
  • 1.3GB of data

12
Amending Seed Data
  • Extracting email addresses
  • Finding mailto links actually works quite well
  • Not much need to deal with obfuscation (such as
    alex.voss-at-ncess.ac.uk)
  • But doing this may improve results
  • How to deal with multiple valid emails
  • Extracting affiliations
  • Again, surprising how effective this was but ho
  • Again, how to deal with multiple affiliations
  • Affiliation does not map 11 to research area

13
Key Term Extraction
  • Using NaCTeMs Termine (using website at the
    moment, web service soon)
  • Rank Term5 e-social science10 national
    centre11 rob procter12 social
    science13 marina jirotka14 international
    conference15 social sciences18 mark
    rouncefield19 computer science22 research
    centre27 science studies unit35 lancaster
    university40 computer supported cooperative
    work46 text mining48 paul luff

14
Key Term Extraction (II)
  • Next steps
  • Change code to use web services API
  • Repeat key term extraction for 226 individuals
  • Create unified key term list
  • Review and create stop-list
  • Factor this into tailored Termine service
  • Named entity recognition to extend seed list

15
Social Map
Co-occurrence of names on web pages
16
Further Next Steps
  • Add weights to social map how strongly are
    people connected?
  • Drawing social network graphs for interactive
    analysis using information about link structure
  • Repeating Yahoo searches to flag up new data
    appearing
  • RSS feed on whats new in e-Social Science
  • Doing Yahoo searches on the top key terms emerging

17
Next Steps?
  • FOAF type semantic data on e-Social Science
    projects
  • What incentives could we leverage to get people
    to provide the information we are interested in?
  • Combining with bibliometric work
  • New kinds of entities
  • Publications
  • Projects, Organisations
Write a Comment
User Comments (0)
About PowerShow.com