4th International Conference on e-Social Science: - PowerPoint PPT Presentation

About This Presentation
Title:

4th International Conference on e-Social Science:

Description:

Option 2 using a genetic algorithm to guide the search. Various ways to do this. An algorithm ... Can be done by restarting from the previous best optimisation ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 45
Provided by: andy207
Category:

less

Transcript and Presenter's Notes

Title: 4th International Conference on e-Social Science:


1
Reconstruction of the entire UK population using
microsimulation
  • Andy Turner
  • http//www.geog.leeds.ac.uk/people/a.turner/

2
Overview
  • Introduction
  • What
  • Why
  • How
  • What next

3
Introduction
  • The title is a bit odd and vague
  • reconstruction using microsimulation
  • I can only guess what this is.
  • I dont think this presentation addresses that.
  • Hopefully it does address something relevant and
    of interest!

4
This presentation focuses on
  • Developing digital demographic data for the UK
  • A reconstruction of data which has existed for
    2001 since around 2003.
  • MoSeS Genetic Algorithm that attempts to
    reconstruct individual level data for every
    individual in the UK in 2001
  • How you can reconstruct the MoSeS reconstruction

5
What is MoSeS?
  • Modelling and Simulation for e-Social Science
  • http//www.ncess.ac.uk/research/nodes/MoSeS/
  • e-Social Science being the application of
    e-Science concepts to social science problem
    domains
  • e-Science is enhanced science that uses the
    Internet, software tools and structured
    information for collaborative work
  • A first phase research node of NCeSS
  • Part of a UK collaborative partnership developing
    e-Social Science
  • The key part of its program of work is to
    develop an individually based demographic model
    of the UK for 2001 to 2031
  • MoSeS people

6
What am I on about and what do we want?
  • UK demographic data reconstruction for 2001.
  • The UK demographic data we want largely exists as
    2001 human population (census) data, but it is
    not available as 2001 census output

7
Why do we want it?
  • Reconstructed data is input into a dynamic model
    that operates at the individual and household
    level to simulate population change for MoSeS
    applications.
  • Belinda and/or Mark will be talking about the
    dynamic model work later on
  • It is theorised that in order to be realistic and
    of use in local service and transport planning,
    the demographic models have to operate at this
    individual and household level.

8
Enriching the base population
  • Efforts are being made to enrich the census
    reconstruction with additional data from other
    sources (e.g. British Household Panel Survey)
  • The results of this data integration are new
    constructions, data that has not previously
    existed.
  • The idea is to add non-census variables to the
    base census data reconstruction.
  • Chengchao Zuo is doing some of this work, but is
    not presenting it here.

9
Introducing Census Data
  • In reconstructing the census data it is necessary
    to
  • know some details of the available published
    data
  • consider the different ways of doing it.
  • So Im going to describe the available census
    data and then introduce a couple of ways of
    reconstructing the individual census data for all
    individuals.

10
2001 UK Human Population CensusScope and
general characteristics
  • Attempt to collect demographic data about all
    individuals in the UK at a specific time.
  • Data collected via a paper form and digitised.
  • Includes (in the region of) a hundred variables
    that detail each individual.

11
2001 UK Human Population CensusKey Units and
references
  • Data collected for households and communal
    establishments
  • For each household there is a household reference
    person (HRP) and there are some variables that
    inform of the relationships between each
    household individual and this HRP
  • Communal establishments include hospitals,
    hospices, prisons etc and in Scotland,
    residential schools.
  • The definition and difference between households
    and communal establishments is important.
  • Output Areas (OAs)
  • Smallest regions of aggregated data dissemination
  • Grouped into MSOA, Wards, Regions
  • New to 2001
  • A typical OA might contain 300 people and about a
    hundred households and may contain a communal
    establishment.

12
Households
13
Communal Establishments
14
2001 UK Human Population CensusAnonymisation
and the individual data
  • Digitised data was anonymised
  • A new version was produced that had names and
    addresses removed.
  • Data with names and addresses is more useful than
    the anonymised form, but due to various concerns
    the file that would link individual records with
    the name and address information is classified.
  • In MoSeS we have not been concerned with trying
    to assign the correct names to our individual
    data.
  • It is the anonymised data that we are trying to
    reconstruct.

15
2001 UK Human Population CensusThe individual
data exists!
  • The individual data are not available due to
    concerns over abuse of the data.
  • This is a legitimate concern, but it could be
    harmless to allow some way to link other data on
    names and addresses with this individual census
    data.
  • This has been done for some epidemiological work
  • It is not routine to do this even in controlled
    facilities AFAIK
  • For similar reasons of concern the anonymised
    data is subjected to further obfuscation by
    Disclosure Control Measure (DCMs)

16
2001 UK Human Population CensusVariable
aggregation
  • For the different data products variables (e.g.
    age) are aggregated into groups differently.
  • Consequently reconstruction is non-trivial.
  • NB. Although the full address is removed from the
    data, for some outputs it is necessary to know
    which Output Area or higher spatial unit an
    individual is from.

17
2001 UK Human Population CensusAvailable census
outputs
  • Sample of Anonymised Records (SARs) and Small
    Area Microdata (SAM)
  • Census Aggregate Statistics (CAS)
  • Special Transport Statistics (STS)
  • Special Migration Statistics (SMS)
  • Longitudinal Study (LS)
  • Commissioned Tables

18
HSAR
  • The 2001 Household SAR is available for England
    and Wales only.
  • 1 stratified sample of households
  • 225436 household records
  • 525715 individual records
  • Individual records are available only for
    households with 11 or fewer residents
  • There are 60 variables some of which are
    aggregated.
  • Age is in 2 year bands

19
ISAR
  • The 2001 Individual SAR is for all of the UK.
  • 3 Sample
  • 1843525 Records
  • Includes people from the Communal Establishment
    Population (CEP)
  • Very similar variables to HSAR, but some cruicial
    differences (e.g. Age)

20
CAS
  • Census Aggregate Statistics
  • Available at Output Area Level (and larger
    aggregate spatial units) for all the UK
  • Various table types
  • Key Statistics
  • Univariate
  • Standard
  • Multivariate
  • Themed

21
2001 UK Human Population CensusDCMs again
  • Disclosure control measures (DCM) on CAS add
    additional and unknown levels of error to the
    data
  • The Small Cell Adjustment Measure (SCAM) ensures
    that no count in any aggregate table that is
    disseminated is 1 or 2.
  • This DCM is notorious for adding unwanted error
    (making the census very difficult to use)
  • Among other issues it raises, it has the
    undesirable effect that counts from different
    tables that represent the same thing, will not
    necessarily match.

22
2 ways to reconstruct individual level data
  1. Take the CAS and create synthetic individuals
    that match the aggregate characteristics
  2. Select from the Individual and Household SAR
    populations such that the aggregate
    characteristics closely match those in the CAS

23
General limitations
  • It is not possible to be sure that the data for
    individuals assigned to any location exactly
    matches the characteristics of the individuals
    that were there at the time of the census.
  • In doing 1 it is possible to make a perfect match
    for every area, but in doing 2, it might not be
    possible for any area.

24
Option 1 (Synthetic Individuals)
  • Constraints can be added to try to make the data
    reasonable
  • (e.g. someone aged 85 and with limiting long term
    illness probably does not work).
  • This is either arbitrary or non-trivial.
  • There is no census data that can be used to
    inform if there exist individuals with the
    synthetically assigned characteristics
    (combination of age group, ethnicity,
    socio-economic group, educational attainment,
    health status etc...) except for the SAR, which
    is Option 2.
  • Scales well in that it is not much more work to
    produce outputs for regions containing much
    larger populations.

25
Option 2 Selecting from the SARs
  • It is too much to consider every combination of
    individuals from the SARs for the average Output
    Area (and there are 223060 OAs). Indeed, the
    number of combinations increases for regions with
    larger populations and greater numbers of
    households.
  • NAreas (NRecords in SAR Population of area)
  • Some heuristic or strategy is needed to help
    select a good solution.

26
Option 2 using a genetic algorithm to guide the
search.
  • Various ways to do this.
  • An algorithm
  • Select Household Population (HP) from Household
    SAR records and Communal Establishment Population
    (CEP) from the Individual SAR a number of times
  • Measure performance
  • Select a number of the best performing sets
  • Breed these sets by swapping some HP and CEP
  • Repeat Steps 2 to 5 until convergence

27
Enhancements Constraints
  • 2 types of constraint
  • Control constraints
  • These things must be met for a solution to be
    viable
  • From CAS003 constrain by age of HRP for HP
  • From CAS001 constrain by age for CEP
  • Optimisation constraints
  • Can be any number of variables from the 60 or so
    in the SARs that are also in CAS
  • Done in the performance measure
  • Some are household population based
  • Some total population based

28
Swapping records in breeding
  • This becomes harder the more control constraints
    are applied
  • The aggregate constraint characteristics from the
    set being swapped must match those selected
  • Being able to swap multiple records is a big
    advantage
  • More breadth of search
  • Less chance of getting stuck in a local minima

29
HSAR
Aggregate HPControl Characteristics
ISAR
Aggregate CEP Control Characteristics
30
Breeding parameters
  • Need to not swap too much HP or CEP
  • Else optimisation is slow
  • Swapping a random amount each time is good, and
    swapping up to about a third of the HP and CEP
    seems OK
  • Good to keep a diversity in the breeding
    population of solutions
  • Especially in the early iterations

31
Re-constraining
  • There are a limit to the number of control
    constraints that can be used
  • New optimisation constraints can be added and
    others removed by modifying the fitness function
  • e.g. For some applications it might be more
    important to get household composition right
    rather than socio-economic group

32
Results
  • Sorry, no results to show here!
  • Results for Leeds produced optimise constraining
    on household compoition, employment, health, age
    and gender.
  • The same type of result for the UK is nearly
    available
  • A week away
  • I have produced graphs that indicate how well the
    results perform
  • Maps of the residuals can also be produced and
    any spatial patterns may provide clues for
    improvement

33
Reconstructing the reconstructions
  • Each HSAR record and ISAR record and Output Area
    have unique IDs and these can be publicly
    disseminated.
  • Using a simple structure of two lists, one for
    the HP (either all records or just the HRP), the
    other for the CEP for each OA it is
    straightforward to recreate the result.

34
Plans in the near future
  • Archive what we have done (results and code) and
    run for the UK again with some additional
    transport variables included in the optimisation.
  • Can be done by restarting from the previous best
    optimisation
  • Do some experiments with modifying the
    optimisation function during training.

35
Acknowledgements and Thanks
  • Thanks to MoSeS researchers, collaborators and
    funders.
  • Thanks to all involved in eResearch for improving
    our hardware, software and data resources so that
    we can all do our bit to better understand and
    plan our future.
  • Thank you for listening!

36
More Background on MoSeS follows in the next 6
slides
37
Initial Tasks
  • Develop methods to generate individual human
    population data for the UK from 2001 UK human
    population census data
  • Develop a Toy Model
  • Dynamic agent based microsimulation modelling
    toolkit and apply it to simulate change in the UK
  • Develop applications for
  • Health
  • Business
  • Transport

38
Challenges
  • Grid enabling the data and tools
  • Visualisation
  • Google Earth
  • Computer Games
  • Collaboration
  • Retaining a problem focus
  • Design and Development

39
Generic MoSeS Approach
  • MoSeS to date has approached Modelling and
    Simulation from a specific angle
  • Geographic
  • Demographic
  • Contemporary
  • About the UK
  • Targeted towards supporting a developing set of
    applications
  • It is not a requirement to make it clear what
    steps can be followed by other Social Scientists
    wanting to Model and Simulate something different
  • However, the generic work of MoSeS should be
    relevant and we are working towards this

40
MoSeS Vision
  • Suppose that computational power and data storage
    were not an issue what would you build?
  • SimCity
  • http//en.wikipedia.org/wiki/SimCity
  • For real on a national scale

41
MoSeS Rationale
  • The idea is to provide planners, policy makers
    and the public with a tool to help them analyse
    the potential impacts and the likely effect of
    planning and policy changes.
  • Example Application
  • There may be a housing policy to do with joint
    ownership, taxation and planning restriction
    legislation that can be developed to alleviate
    problems to do with lack of affordable housing
    and workers without precipitating a crash in the
    housing market and economy as a whole
  • A balanced policy may be easier to develop by
    running a large number of simulations within a
    system like SimCity for real to understand the
    sensitivities involved

42
MoSeS First Steps
  • The development of a national demographic model
  • The development of 3 applications
  • Health care
  • Transport
  • Business
  • The development of a portal interface to support
    the development and resulting applications by
    providing access to the data, models and
    simulations and presenting information to users
    (application developers) in a secure way

43
(No Transcript)
44
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com