Title: Census Data Capture Challenge Intelligent Document Capture Solution
1Census Data Capture Challenge Intelligent
Document Capture Solution
Amir Angel Director of Government Projects
- UNSD Workshop - Minsk Dec 2008
2The evolution of data capture in census projects
Five steps
From OCR into IDR Solution
eFLOW
3The evolution of data capture in census projects
- Manual data entry (Key from paper)
- Slow process
- High error rate in the data entry process
- Recruitment, training and management of personnel
- Key from Image
- Archive
- Approx 20 faster than key from paper
4The evolution of data capture in census projects
- OMR (Hardware readers for checkbox)
- Requires special scanners and specially printed
forms - Cannot handle handwritten/printed data
- Forms are not user-friendly
- OMR requires more answers gt more space gt
increased paper expenditures gt more handling and
printing costs - Not flexible, difficult to adjust to other
applications once census is over - No possibility to add business rules imputation,
validations, coding
5The evolution of data capture in census projects
- Automated Data Capture
- Requires less human intervention, enables to
complete the census data capture much faster
(less space, less salaries, less hardware) - Full flexibility in the type of data gathered
(checkbox, OMR, handwritten, alpha and numeric,
barcode) - Ensures data integrity enables the use of
automatic AND manual online validations,
exception handling, coding - The most advanced and proven technology for
Censuses, recommended by the UN and used by all
modern countries for census projects - Creates a correlation between the image and the
actual form - Remote capabilities enable all forms to be
scanned locally and then sent to a central site
for processing
eFLOW
5
6The evolution of data capture in census projects
- Intelligent data capture platform (IDR)
- by using OCR/ICR/OMR/PDA/Web/email
- Automated data capture
- Automatic classification for documents
- understands and differentiates between various
types of documents and languages and Based on
state-of-the-art Machine Learning algorithms - Artificial intelligence algorithms which provides
enough information for the system to find the
location of the fields on its own
eFLOW
7Traditional Data Capture
Back-Office
Mail Room
Scanning
Data Entry
End Users
Document prep Sorting
Manual Key from image
8Intelligent Document Capture
Back-Office
Mail Room
Scanning
Data Entry
End Users
Document prep No sorting
Reduce manual data entry by 40-70
Increase accuracy and consistency
9- India 2001
- Turkey 1997
- Brazil 2000
- South Africa 2001
- Ireland 2002
- Italy 2002
- Cyprus 2002
- Turkey 2000
- Kenya 2000
- Slovak Republic 2001
- Hong Kong 2001
- Thailand 2008(Community)
- Slovenia 2006
- Hong Kong 2006
- South Africa Survey 2007
- Ireland 2006
9
10Manual
Automated Data Capture time saving
Saving of 25
Saving of 50
(Source CSO Central Statistic Office Ireland)
11The technology is there
- No need to invent the wheel
- Reducing risks by using an Off the shelf
technologies.
12Data Types
OCR
ICR
OMR
13Automatic Recognition
14Improve Recognition Voting mechanism
15Voting Single Engine vs. Virtual Engines
16Figure Of Merit Example
- A system recognizes 90 of the characters
contained in a batch, but misclassifies 4 - 90 - (104) 50
- The Figure Of Merit in this example is 50
- A system recognizes 80 of the characters
contained in a batch, but misclassifies 1 - 80- (101) 70
- The Figure Of Merit in this example is 50
- The second system is more efficient
17Benefits of Multiple ICRs
2 8 9 5 6 3 7 4 3 1 6 7 8 5
18Unique Tiling station Checking for false
positives
- Identify false positives
- Alpha Numeric fields
- Highlight for verifications
- Quality control for ICR
19Voting Methods Example
- Assume we have a V. engine that includes 4
engines - We want to identify the following number 253478
- The results of each engine are displayed on the
right - The final results of the V. engines will be
- Safe 28
- Normal 2578
- Majority 253478
- Order 255378
- Equalizer ??????
20Processing Example
3
3
8
3
21Automatic Recognition Time Completion Time
Correction Time THROUGHPUT
22Fuzzy/Approximate Search
Recognition
Image
Completion
23Image
Recognition
Completion
24Other Approaches
- Auto Coding
- Coding tasks and data validations performed on
the data capture platform a cost-effective
solution - Use artificial intelligent statistic software's
for understand sentences - Q What do you do for living?
- A I am guiding children
Teacher 2030 - Use Approximate Search tools for improving
results via DB (Exorbyte)
25Process integrality, Questioner integrity - a
work flow according to the client needs
MFlexibilityctivator
Scanning
OCR
Validation
Export
25
26Flexibility
27Flexibility
28Thank You
Census Data Capture Platform