Title: 4th International Conference on e-Social Science:
1Reconstruction of the entire UK population using
microsimulation
- Andy Turner
- http//www.geog.leeds.ac.uk/people/a.turner/
2Overview
- Introduction
- What
- Why
- How
- What next
3Introduction
- The title is a bit odd and vague
- reconstruction using microsimulation
- I can only guess what this is.
- I dont think this presentation addresses that.
- Hopefully it does address something relevant and
of interest!
4This presentation focuses on
- Developing digital demographic data for the UK
- A reconstruction of data which has existed for
2001 since around 2003. - MoSeS Genetic Algorithm that attempts to
reconstruct individual level data for every
individual in the UK in 2001 - How you can reconstruct the MoSeS reconstruction
5What is MoSeS?
- Modelling and Simulation for e-Social Science
- http//www.ncess.ac.uk/research/nodes/MoSeS/
- e-Social Science being the application of
e-Science concepts to social science problem
domains - e-Science is enhanced science that uses the
Internet, software tools and structured
information for collaborative work - A first phase research node of NCeSS
- Part of a UK collaborative partnership developing
e-Social Science - The key part of its program of work is to
develop an individually based demographic model
of the UK for 2001 to 2031 - MoSeS people
6What am I on about and what do we want?
- UK demographic data reconstruction for 2001.
- The UK demographic data we want largely exists as
2001 human population (census) data, but it is
not available as 2001 census output
7Why do we want it?
- Reconstructed data is input into a dynamic model
that operates at the individual and household
level to simulate population change for MoSeS
applications. - Belinda and/or Mark will be talking about the
dynamic model work later on - It is theorised that in order to be realistic and
of use in local service and transport planning,
the demographic models have to operate at this
individual and household level.
8Enriching the base population
- Efforts are being made to enrich the census
reconstruction with additional data from other
sources (e.g. British Household Panel Survey) - The results of this data integration are new
constructions, data that has not previously
existed. - The idea is to add non-census variables to the
base census data reconstruction. - Chengchao Zuo is doing some of this work, but is
not presenting it here.
9Introducing Census Data
- In reconstructing the census data it is necessary
to - know some details of the available published
data - consider the different ways of doing it.
- So Im going to describe the available census
data and then introduce a couple of ways of
reconstructing the individual census data for all
individuals.
102001 UK Human Population CensusScope and
general characteristics
- Attempt to collect demographic data about all
individuals in the UK at a specific time. - Data collected via a paper form and digitised.
- Includes (in the region of) a hundred variables
that detail each individual.
112001 UK Human Population CensusKey Units and
references
- Data collected for households and communal
establishments - For each household there is a household reference
person (HRP) and there are some variables that
inform of the relationships between each
household individual and this HRP - Communal establishments include hospitals,
hospices, prisons etc and in Scotland,
residential schools. - The definition and difference between households
and communal establishments is important. - Output Areas (OAs)
- Smallest regions of aggregated data dissemination
- Grouped into MSOA, Wards, Regions
- New to 2001
- A typical OA might contain 300 people and about a
hundred households and may contain a communal
establishment.
12Households
13Communal Establishments
142001 UK Human Population CensusAnonymisation
and the individual data
- Digitised data was anonymised
- A new version was produced that had names and
addresses removed. - Data with names and addresses is more useful than
the anonymised form, but due to various concerns
the file that would link individual records with
the name and address information is classified. - In MoSeS we have not been concerned with trying
to assign the correct names to our individual
data. - It is the anonymised data that we are trying to
reconstruct.
152001 UK Human Population CensusThe individual
data exists!
- The individual data are not available due to
concerns over abuse of the data. - This is a legitimate concern, but it could be
harmless to allow some way to link other data on
names and addresses with this individual census
data. - This has been done for some epidemiological work
- It is not routine to do this even in controlled
facilities AFAIK - For similar reasons of concern the anonymised
data is subjected to further obfuscation by
Disclosure Control Measure (DCMs)
162001 UK Human Population CensusVariable
aggregation
- For the different data products variables (e.g.
age) are aggregated into groups differently. - Consequently reconstruction is non-trivial.
- NB. Although the full address is removed from the
data, for some outputs it is necessary to know
which Output Area or higher spatial unit an
individual is from.
172001 UK Human Population CensusAvailable census
outputs
- Sample of Anonymised Records (SARs) and Small
Area Microdata (SAM) - Census Aggregate Statistics (CAS)
- Special Transport Statistics (STS)
- Special Migration Statistics (SMS)
- Longitudinal Study (LS)
- Commissioned Tables
18HSAR
- The 2001 Household SAR is available for England
and Wales only. - 1 stratified sample of households
- 225436 household records
- 525715 individual records
- Individual records are available only for
households with 11 or fewer residents - There are 60 variables some of which are
aggregated. - Age is in 2 year bands
19ISAR
- The 2001 Individual SAR is for all of the UK.
- 3 Sample
- 1843525 Records
- Includes people from the Communal Establishment
Population (CEP) - Very similar variables to HSAR, but some cruicial
differences (e.g. Age)
20CAS
- Census Aggregate Statistics
- Available at Output Area Level (and larger
aggregate spatial units) for all the UK - Various table types
- Key Statistics
- Univariate
- Standard
- Multivariate
- Themed
212001 UK Human Population CensusDCMs again
- Disclosure control measures (DCM) on CAS add
additional and unknown levels of error to the
data - The Small Cell Adjustment Measure (SCAM) ensures
that no count in any aggregate table that is
disseminated is 1 or 2. - This DCM is notorious for adding unwanted error
(making the census very difficult to use) - Among other issues it raises, it has the
undesirable effect that counts from different
tables that represent the same thing, will not
necessarily match.
222 ways to reconstruct individual level data
- Take the CAS and create synthetic individuals
that match the aggregate characteristics - Select from the Individual and Household SAR
populations such that the aggregate
characteristics closely match those in the CAS
23General limitations
- It is not possible to be sure that the data for
individuals assigned to any location exactly
matches the characteristics of the individuals
that were there at the time of the census. - In doing 1 it is possible to make a perfect match
for every area, but in doing 2, it might not be
possible for any area.
24Option 1 (Synthetic Individuals)
- Constraints can be added to try to make the data
reasonable - (e.g. someone aged 85 and with limiting long term
illness probably does not work). - This is either arbitrary or non-trivial.
- There is no census data that can be used to
inform if there exist individuals with the
synthetically assigned characteristics
(combination of age group, ethnicity,
socio-economic group, educational attainment,
health status etc...) except for the SAR, which
is Option 2. - Scales well in that it is not much more work to
produce outputs for regions containing much
larger populations.
25Option 2 Selecting from the SARs
- It is too much to consider every combination of
individuals from the SARs for the average Output
Area (and there are 223060 OAs). Indeed, the
number of combinations increases for regions with
larger populations and greater numbers of
households. - NAreas (NRecords in SAR Population of area)
- Some heuristic or strategy is needed to help
select a good solution.
26Option 2 using a genetic algorithm to guide the
search.
- Various ways to do this.
- An algorithm
- Select Household Population (HP) from Household
SAR records and Communal Establishment Population
(CEP) from the Individual SAR a number of times - Measure performance
- Select a number of the best performing sets
- Breed these sets by swapping some HP and CEP
- Repeat Steps 2 to 5 until convergence
27Enhancements Constraints
- 2 types of constraint
- Control constraints
- These things must be met for a solution to be
viable - From CAS003 constrain by age of HRP for HP
- From CAS001 constrain by age for CEP
- Optimisation constraints
- Can be any number of variables from the 60 or so
in the SARs that are also in CAS - Done in the performance measure
- Some are household population based
- Some total population based
28Swapping records in breeding
- This becomes harder the more control constraints
are applied - The aggregate constraint characteristics from the
set being swapped must match those selected - Being able to swap multiple records is a big
advantage - More breadth of search
- Less chance of getting stuck in a local minima
29HSAR
Aggregate HPControl Characteristics
ISAR
Aggregate CEP Control Characteristics
30Breeding parameters
- Need to not swap too much HP or CEP
- Else optimisation is slow
- Swapping a random amount each time is good, and
swapping up to about a third of the HP and CEP
seems OK - Good to keep a diversity in the breeding
population of solutions - Especially in the early iterations
31Re-constraining
- There are a limit to the number of control
constraints that can be used - New optimisation constraints can be added and
others removed by modifying the fitness function - e.g. For some applications it might be more
important to get household composition right
rather than socio-economic group
32Results
- Sorry, no results to show here!
- Results for Leeds produced optimise constraining
on household compoition, employment, health, age
and gender. - The same type of result for the UK is nearly
available - A week away
- I have produced graphs that indicate how well the
results perform - Maps of the residuals can also be produced and
any spatial patterns may provide clues for
improvement
33Reconstructing the reconstructions
- Each HSAR record and ISAR record and Output Area
have unique IDs and these can be publicly
disseminated. - Using a simple structure of two lists, one for
the HP (either all records or just the HRP), the
other for the CEP for each OA it is
straightforward to recreate the result.
34Plans in the near future
- Archive what we have done (results and code) and
run for the UK again with some additional
transport variables included in the optimisation. - Can be done by restarting from the previous best
optimisation - Do some experiments with modifying the
optimisation function during training.
35Acknowledgements and Thanks
- Thanks to MoSeS researchers, collaborators and
funders. - Thanks to all involved in eResearch for improving
our hardware, software and data resources so that
we can all do our bit to better understand and
plan our future. - Thank you for listening!
36More Background on MoSeS follows in the next 6
slides
37Initial Tasks
- Develop methods to generate individual human
population data for the UK from 2001 UK human
population census data - Develop a Toy Model
- Dynamic agent based microsimulation modelling
toolkit and apply it to simulate change in the UK - Develop applications for
- Health
- Business
- Transport
38Challenges
- Grid enabling the data and tools
- Visualisation
- Google Earth
- Computer Games
- Collaboration
- Retaining a problem focus
- Design and Development
39Generic MoSeS Approach
- MoSeS to date has approached Modelling and
Simulation from a specific angle - Geographic
- Demographic
- Contemporary
- About the UK
- Targeted towards supporting a developing set of
applications - It is not a requirement to make it clear what
steps can be followed by other Social Scientists
wanting to Model and Simulate something different - However, the generic work of MoSeS should be
relevant and we are working towards this
40MoSeS Vision
- Suppose that computational power and data storage
were not an issue what would you build? - SimCity
- http//en.wikipedia.org/wiki/SimCity
- For real on a national scale
41MoSeS Rationale
- The idea is to provide planners, policy makers
and the public with a tool to help them analyse
the potential impacts and the likely effect of
planning and policy changes. - Example Application
- There may be a housing policy to do with joint
ownership, taxation and planning restriction
legislation that can be developed to alleviate
problems to do with lack of affordable housing
and workers without precipitating a crash in the
housing market and economy as a whole - A balanced policy may be easier to develop by
running a large number of simulations within a
system like SimCity for real to understand the
sensitivities involved
42MoSeS First Steps
- The development of a national demographic model
- The development of 3 applications
- Health care
- Transport
- Business
- The development of a portal interface to support
the development and resulting applications by
providing access to the data, models and
simulations and presenting information to users
(application developers) in a secure way
43(No Transcript)
44(No Transcript)