Title: Data Capture Overview
1- Data Capture Overview
- United Nations Statistics Division
2Overview of Presentation
- Definition of data capture
- Methods of data capture
- Different Methods
- Advantages and disadvantages
- Issues to consider
3Whats Data Capture?
- Data capture is the system used to convert the
information obtained in the census to a format
that can be interpreted by a computer. - Source United Nations Principles and
Recommendations for Population and Housing
Censuses, Rev. 2, p.68.
4Data Capture Methods
- Keyboard data entry
- Optical mark recognition/reading (OMR)
- Optical character recognition/intelligent
character recognition (OCR/ICR) - Personal digital assistant (PDA)
- Internet
- Advantages/disadvantages/costs/impacts at both
data capture and later stages - Combination of more than one of the above methods
5Keyboard Data Entry
- Response codes from census form are manually
entered into computers - Sophisticated version involves computer assisted
key entry where operator selects a response from
options displayed on the screen - Use of method based on time and cost
considerations, and feasibility to implement more
sophisticated technology - Method also used to process textual responses
into classification categories
6Advantages and Disadvantages of Keyboard Data
Entry
- Advantages
- Method requires simple software systems and
low-end computing hardware - Less costly (depending on the costs of manpower)
- There will be a large number of PCs available for
other uses after census
- Disadvantages
- Requires more staff
- Task takes much longer time to complete than with
automated data entry - Potential for errors during data entry
- Standardization of operations is difficult as
performance may be individually dependant
7Data Capture Technologies
- Imaging and intelligent character recognition
offer great potential and benefits for data
capture - Use of technology for data capture should be to
enhance effective and efficient data capture and
not for technologys sake - Awareness of long lead times and technology
infrastructure required for successful
implementation of intelligent character
recognition
8Optical Mark Recognition/Reading (OMR)
- OMR is a form-scanning method whereby responses
are read into a computer without a keyboard - OMR technology reads responses to tick-box type
questions on specially designed paper - Only presence or absence of a mark is detected by
the machine - The scanned responses are transformed into codes
- Handwritten responses must be manually entered or
coded using computer-assisted methods
9Advantages and Disadvantages of OMR
- Advantages
- Improved data accuracy
- Data capture faster than keyboard data entry
- Equipment is relatively inexpensive
- Relatively simple to install and run
- A well-established technology thats been used in
many countries
- Disadvantages
- Restrictions as to form design
- Restrictions on type of paper and ink
- Precision required in printing process/cutting of
sheets - Response boxes should be correctly marked with
appropriate pen or pencil - Wont capture textual responses
10Optical Character Recognition (OCR)/ Intelligent
Character Recognition (ICR)
- OCR and ICR combine scanning and character
recognition technology to scan the whole form and
interpret the responses - OCR technology recognizes machine-printed
characters only - ICR technology reads both machine-printed and
hand-written responses in specific locations of
the page and transforms the responses into codes - For OCR, handwritten responses must be manually
entered or coded using computer-assisted methods
11Advantages of OCR/ICR
- Form design is not as stringent as for OMR
- Processing time can be reduced due to automated
nature of the process - Allow for digital filing of questionnaires
resulting in efficiency of storage and retrieval
of questionnaires for future use - Some handwritten responses can be automatically
coded thereby improving data quality
12Disadvantages of OCR/ICR
- Higher costs of equipment (sophisticated
hardware/software required) - High calibre IT staff required to support the
system - Handwriting on census forms be as close as
possible to the model handwriting to avoid
recognition error - Possibility for error during character
substitution which would affect data quality - Tuning of recognition engine to accurately
recognize characters is critical with trade-off
between quality and cost
13Personal Digital Assistant (PDA)
- Contents of the census form are stored onto the
PDA so that the questions appear sequentially on
the screen - Data are entered into a hand-held computer
instead of onto a paper census form - Data are then electronically transmitted to an
NSO database for further processing
14Advantages and Disadvantages of use of the PDA
- Advantages
- Instant data capturing at the point of
collection, reducing manual input errors - Immediate data validation, reducing
re-verifications at later stage - Time effective with real time logical validation
rules, reducing logical errors - Faster processing of census information leading
to timely availability of results -
- Disadvantages
- Setting up of process may take a long time as it
requires extensive testing - Requires that enumerators have ability to use the
device which may require administering a test - Requires intensive training of enumerators on use
of device (training is more complicated) - Need to recharge the battery which could run out
during enumeration - Possibility of equipment failure
15Internet-based Data Collection
- Use of the Internet for census data collection is
growing - However, the method is always complementary to
other more established methods - Like with PDAs, the on-line form is not a
downloadable version of the paper form - Use of this method requires a password in order
to access and fill in the form - Development of the internet system for data
collection is generally outsourced for lack of
in-house expertise
16Advantages/Disadvantages of use of the Internet
- Advantages
- Reduced resources necessary for form handling and
data capture - Better opportunity to enumerate difficult to
reach and to enumerate geographic area and
population groups - Automatic filtering of irrelevant questions
- Better quality data due to in-built interactive
verification mechanism - Faster availability of census results through
simplified data entry and editing -
- Disadvantages
- Requires that respondents have a computer with
Internet access - Management of responses can be problematic, e.g.,
that households have responded once and only once - Requires high security system to ensure safe
transfer of data - Need to build parallel processing system as not
everyone will use the Internet - Requires mechanism to check for omitted and
duplicate submissions - Is costly and requires a lot of resources for
setting up and adequately test the system
17Issues to Consider in Choosing a Method
- Method to use is dependant on national
circumstances - Choice of method should be part of the overall
strategic objective of the census in terms of
timeliness, accuracy and cost - Choice of processing system and technology to use
need to be established early in census cycle - Enough time is required to test and implement the
system - When imaging technology is used for data capture,
extensive testing is required well in advance of
the census - Possibility to outsource when the required
expertise is not available in-house
18Issues to consider (cont.)
- Extensive testing of the system is also critical
when data collection is either by PDA or via the
Internet - Design and paper quality of census form should be
linked to method of data capture - When imaging technology is to be used, adequate
training of enumerators on how to properly fill
in the forms is crucial
19