Title: European Conference on Quality in Official Statistics
1European Conference on Quality in Official
Statistics
- Different Quality Tests on the Automatic Coding
Procedure for the Economic Activities
Descriptions - A. Ferrillo, S. Macchia, P. Vicari
- ferrillo_at_istat.it, macchia_at_istat.it,
vicari_at_istat.it - ISTAT Italian Institute of Statistics
- Rome, 8 11 July 2008
2The automated coding system used in IstatACTR
Automatic Coding by Text Recognition
- Developed by Statistics Canada.
- It is a generalised system independent from the
classification and the language - To be used, it is necessary to customise it ? to
build the dictionary (reference file), define
synonymous and adapt it to language. - The construction of the coding dictionary is the
heaviest activity, since its quality deeply
affects the performance of automatic coding.
3ACTR
- The coding activity is preceded by a quite
sophisticated text standardisation phase, called
parsing, providing 14 different parsing
functions (character mapping, deletion of
trivial words, definition of synonymous, suffixes
removal, etc) able to remove grammatical or
syntactical differences so that any two different
descriptions, with the same semantic content,
become identical. - The parsed response to be coded is then compared
to the parsed descriptions of the dictionary. If
by this search, a perfect match is found, that is
a direct matching (score 10) is realised,
then a unique code is assigned, otherwise the
software runs an algorithm to look for the most
suitable partial matches (indirect matching).
4ACTR
- As a result the software returns
- unique matches, when a unique code is assigned to
a response phrase - multiple matches, when several possible codes are
proposed - failed matches, when no matches are found
- Its performances are measured through two
indicators - Recall rate (coding rate) ? percentage of codes
automatically assigned - Precision rate ? percentage of correct codes
automatically assigned
5Automated coding applications developed in Istat
- The most important applications built in Istat,
already used in different surveys, are referred
to the following variables - Occupation
- Economic Activities
- Education level
- Country/Nationality
- Municipalities.
- The coding rate obtained for Economic Activities
varies from 50 for households surveys to 80 for
business surveys.
6ATECO 2007 The new economic activities
classification
- ATECO 2007 is the national version of NACE Rev.
2, the European economic activities
classification - The new NACE is deeply different from the
previous one - NACE and its impact on the official statistics
the four digit codes that split in two or more
new codes are 45 - the five digit codes that split are 35
7ATECO 2007 and ACTR
- The ACTR application updating for ATECO 2007 was
a complex process made of different steps and
problems - only a part of the old classification at five
digit level (around the 65) directly translated
in the new one. The other part had to be checked
description by description, - since the classification was very different, some
descriptions have been completely re-examined in
some case it was necessary to divide old
descriptions (for example Repair and
installation of pumps) because a part is now in
a code (Repair, group 33.1) and the other part is
in a different code (Installation, group 33.2), - completely new activities were introduced,
- it was necessary to delete some old descriptions
because completely obsolete (281 texts).
8ATECO and ACTR
9ACTR aims of the new application
- ACTR for surveys and Census
- ACTR was already used for Census 2001 and other
surveys, it is already set on the descriptions
given from this type of respondents - ACTR for administrative sources
- These descriptions are different from those of
statistical surveys because they are quite often
very long and there are no specifications or
rules on how to describe the companys activity - These texts have been treated in a specific way
in order a) to obtain descriptions shorter than
the original ones, b) to delete redundancies and
useless information - ACTR on web
- As a new tool for all the users in order to find
their economic activities code
10Quality tests for ACTR 2007
- In order to measure the quality of the procedure
to be used to code not homogeneous descriptions,
different quality tests have been planned. They
are different both for the methodologies they use
and the samples they treat. - Three tests will be described for two of them,
the correctness of codes assigned by the
automatic coding application is stated by the
analysis of expert coders, while in the third one
the assigned codes are compared to codes deriving
from some special surveys.
111) Quality test on descriptions of the Industry
Census
A stratified random sample has been extracted
from the 1,130,662 descriptions. The methodology
adopted in drawing this sample optimises the
analysis of results, so as to examine only once
very similar texts (DOrazio, Macchia 2002).
Texts were stratified according to their
frequency of occurrence then, within each
stratum, a simple random sample of texts was
selected.
121) Quality test on descriptions of the Industry
Census
Results in terms of recall rate
The recall rate (78.47) is absolutely
satisfactory, also if analysed in details. As a
matter of fact, while the Unique matches are
distributed among all the classes, there are not
Failed matches in classes over 180 occurrences.
In addition, the 71.25 of Unique matches have
a score equal 10, which means that they
correspond to direct matches, and more than the
53 of them belong to classes of occurrences
greater than 91, which means that the dictionary
enrichment was made consistently with the way
respondents are used to express themselves.
131) Quality test on descriptions of the Industry
Census
To state the precision, coded descriptions of
this sample were submitted to expert coders
Results in terms of precision rate
As it can be seen, the precision rate is
higher than 95 and, if analysed per score, the
98.09 of direct matches are corrected, which is
surely a satisfactory result. In addition, it has
been verified that the percentage of correct and
non correct codes is uniformly distributed among
all the classes of occurrences.
142) Quality test on descriptions of the Chambers
of Commerce
- In order to update the Business Register Istat
used different methodologies and sources - ACTR was involved in the analysis of the
descriptions of the Chamber of Commerce. A recall
rate 61, corresponding to 84,117 coded
descriptions, was obtained. - Sector Studies are an administrative source that
covers more than the 70 of the Business
Register the quality of this source is
particularly high. Sector Studies assign a five
digit code through a specific methodology not
based on the text analysis.
152) Quality test on descriptions of the Chambers
of Commerce
- For this test
- The descriptions corresponding to codes at
maximum level of detail were extracted from the
ACTR coded dataset. - They were compared with those assigned through
the Sector Studies (it was assumed that
coinciding codes had to be considered correct as
two different methodologies came to the same
conclusion). - The results showed that the 67 of the extracted
descriptions had equal codes, which can be
considered a good indicator of quality. - The quality analysis regarded the remaining
descriptions corresponding to different codes,
but, due to its huge quantity (17,746
descriptions), a sample has been extracted. - Due to the characteristics of these texts, it was
not considered suitable to adopt the same
sampling strategy used for the first quality
test. So, frequency classes of descriptions with
not coinciding codes have been defined and then a
sample has been extracted proportionally within
each class, with a margin of error of 0.014.
162) Quality test on descriptions of the Chambers
of Commerce
Comparison between code assigned through ACTR
and through Sector Studies Quality control sample
172) Quality test on descriptions of the Chambers
of Commerce
- To state the precision, coded descriptions of
this sample were submitted to expert coders who
classified them as - (A) correct codes according to ACTR
- (C) correct codes according to Chambers of
Commerce - (E) wrong codes according to both the
methodologies - (D) doubt codes according to both the
methodologies
Precision rate
As it can be seen the precision rate is high,
from 80 to 94 in all classes, apart from that
coinciding only with the first digit (This is
due to the fact that this class is widely
populated with very generic descriptions owing to
the Construction sector, which has been strongly
revised in the new classification).
183) Quality test on special surveys
- When ATECO 2007 was almost finalised, it became
evident that it was necessary to realize specific
surveys in those sectors where information was
not available or the activities included were
completely new. Particularly, it was decided to
send a questionnaire to the enterprises in the
fields of - Information and Communication (section J)
- Architectural and engineering activities
technical testing and analysis (division 71) - Research and experimental development on natural
sciences and engineering (group 72.1) - Specialised design activities (group 74.1)
- Services to buildings and landscape activities
(division 81) - Other professional, scientific and technical
activities n.e.c. Office administrative and
support activities Business support services
activities n.e.c. (74.9 82.1 82.9).
193) Quality test on special surveys
- The surveys were sent to around 45 thousands
enterprises all the enterprises larger than 10
employees and a sample of the smallest
enterprises (1 9 employees). - The questionnaires were very simple. At the
beginning of every questionnaire a description of
the economic activity not longer than 200 bytes
was required.
203) Quality test on special survey
- The respondents were around 30.
- In order to realize a quality test on these
activities only the questionnaires where it was
possible to attribute an ATECO code, analysing
the answers to the survey, were considered
(around 52). - The coding rate was not so good (44.5), but it
was not considered a failure as the survey
regarded specific sectors for which it was
already known that the dictionary had to be
enriched. On the other hand, the precision rate
was quite high (88,2). - The main purpose of the survey was to enrich the
ACTR dictionary in order to improve the quality
in terms of performance.
21Conclusions and results
- The performances of the application are
satisfactory both in terms of recall rate and
precision rate - The ACTR on WEB tool is having a big success (in
the first weeks an average of 9,000 queries) - The dictionary continues to be enriched using
both descriptions given by ACTR on WEB and those
written in the special surveys questionnaires