European Conference on Quality in Official Statistics - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

European Conference on Quality in Official Statistics

Description:

... the analysis of expert coders, while in the third one the assigned codes are ... the precision, coded descriptions of this sample were submitted to expert coders ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 22
Provided by: IST80
Category:

less

Transcript and Presenter's Notes

Title: European Conference on Quality in Official Statistics


1
European Conference on Quality in Official
Statistics
  • Different Quality Tests on the Automatic Coding
    Procedure for the Economic Activities
    Descriptions
  • A. Ferrillo, S. Macchia, P. Vicari
  • ferrillo_at_istat.it, macchia_at_istat.it,
    vicari_at_istat.it
  • ISTAT Italian Institute of Statistics
  • Rome, 8 11 July 2008

2
The automated coding system used in IstatACTR
Automatic Coding by Text Recognition
  • Developed by Statistics Canada.
  • It is a generalised system independent from the
    classification and the language
  • To be used, it is necessary to customise it ? to
    build the dictionary (reference file), define
    synonymous and adapt it to language.
  • The construction of the coding dictionary is the
    heaviest activity, since its quality deeply
    affects the performance of automatic coding.

3
ACTR
  • The coding activity is preceded by a quite
    sophisticated text standardisation phase, called
    parsing, providing 14 different parsing
    functions (character mapping, deletion of
    trivial words, definition of synonymous, suffixes
    removal, etc) able to remove grammatical or
    syntactical differences so that any two different
    descriptions, with the same semantic content,
    become identical.
  • The parsed response to be coded is then compared
    to the parsed descriptions of the dictionary. If
    by this search, a perfect match is found, that is
    a direct matching (score 10) is realised,
    then a unique code is assigned, otherwise the
    software runs an algorithm to look for the most
    suitable partial matches (indirect matching).

4
ACTR
  • As a result the software returns
  • unique matches, when a unique code is assigned to
    a response phrase
  • multiple matches, when several possible codes are
    proposed
  • failed matches, when no matches are found
  • Its performances are measured through two
    indicators
  • Recall rate (coding rate) ? percentage of codes
    automatically assigned
  • Precision rate ? percentage of correct codes
    automatically assigned

5
Automated coding applications developed in Istat
  • The most important applications built in Istat,
    already used in different surveys, are referred
    to the following variables
  • Occupation
  • Economic Activities
  • Education level
  • Country/Nationality
  • Municipalities.
  • The coding rate obtained for Economic Activities
    varies from 50 for households surveys to 80 for
    business surveys.

6
ATECO 2007 The new economic activities
classification
  • ATECO 2007 is the national version of NACE Rev.
    2, the European economic activities
    classification
  • The new NACE is deeply different from the
    previous one
  • NACE and its impact on the official statistics
    the four digit codes that split in two or more
    new codes are 45
  • the five digit codes that split are 35

7
ATECO 2007 and ACTR
  • The ACTR application updating for ATECO 2007 was
    a complex process made of different steps and
    problems
  • only a part of the old classification at five
    digit level (around the 65) directly translated
    in the new one. The other part had to be checked
    description by description,
  • since the classification was very different, some
    descriptions have been completely re-examined in
    some case it was necessary to divide old
    descriptions (for example Repair and
    installation of pumps) because a part is now in
    a code (Repair, group 33.1) and the other part is
    in a different code (Installation, group 33.2),
  • completely new activities were introduced,
  • it was necessary to delete some old descriptions
    because completely obsolete (281 texts).

8
ATECO and ACTR
9
ACTR aims of the new application
  • ACTR for surveys and Census
  • ACTR was already used for Census 2001 and other
    surveys, it is already set on the descriptions
    given from this type of respondents
  • ACTR for administrative sources
  • These descriptions are different from those of
    statistical surveys because they are quite often
    very long and there are no specifications or
    rules on how to describe the companys activity
  • These texts have been treated in a specific way
    in order a) to obtain descriptions shorter than
    the original ones, b) to delete redundancies and
    useless information
  • ACTR on web
  • As a new tool for all the users in order to find
    their economic activities code

10
Quality tests for ACTR 2007
  • In order to measure the quality of the procedure
    to be used to code not homogeneous descriptions,
    different quality tests have been planned. They
    are different both for the methodologies they use
    and the samples they treat.
  • Three tests will be described for two of them,
    the correctness of codes assigned by the
    automatic coding application is stated by the
    analysis of expert coders, while in the third one
    the assigned codes are compared to codes deriving
    from some special surveys.

11
1) Quality test on descriptions of the Industry
Census
A stratified random sample has been extracted
from the 1,130,662 descriptions. The methodology
adopted in drawing this sample optimises the
analysis of results, so as to examine only once
very similar texts (DOrazio, Macchia 2002).
Texts were stratified according to their
frequency of occurrence then, within each
stratum, a simple random sample of texts was
selected.
12
1) Quality test on descriptions of the Industry
Census
Results in terms of recall rate
The recall rate (78.47) is absolutely
satisfactory, also if analysed in details. As a
matter of fact, while the Unique matches are
distributed among all the classes, there are not
Failed matches in classes over 180 occurrences.
In addition, the 71.25 of Unique matches have
a score equal 10, which means that they
correspond to direct matches, and more than the
53 of them belong to classes of occurrences
greater than 91, which means that the dictionary
enrichment was made consistently with the way
respondents are used to express themselves.
13
1) Quality test on descriptions of the Industry
Census
To state the precision, coded descriptions of
this sample were submitted to expert coders
Results in terms of precision rate
As it can be seen, the precision rate is
higher than 95 and, if analysed per score, the
98.09 of direct matches are corrected, which is
surely a satisfactory result. In addition, it has
been verified that the percentage of correct and
non correct codes is uniformly distributed among
all the classes of occurrences.
14
2) Quality test on descriptions of the Chambers
of Commerce
  • In order to update the Business Register Istat
    used different methodologies and sources
  • ACTR was involved in the analysis of the
    descriptions of the Chamber of Commerce. A recall
    rate 61, corresponding to 84,117 coded
    descriptions, was obtained.
  • Sector Studies are an administrative source that
    covers more than the 70 of the Business
    Register the quality of this source is
    particularly high. Sector Studies assign a five
    digit code through a specific methodology not
    based on the text analysis.

15
2) Quality test on descriptions of the Chambers
of Commerce
  • For this test
  • The descriptions corresponding to codes at
    maximum level of detail were extracted from the
    ACTR coded dataset.
  • They were compared with those assigned through
    the Sector Studies (it was assumed that
    coinciding codes had to be considered correct as
    two different methodologies came to the same
    conclusion).
  • The results showed that the 67 of the extracted
    descriptions had equal codes, which can be
    considered a good indicator of quality.
  • The quality analysis regarded the remaining
    descriptions corresponding to different codes,
    but, due to its huge quantity (17,746
    descriptions), a sample has been extracted.
  • Due to the characteristics of these texts, it was
    not considered suitable to adopt the same
    sampling strategy used for the first quality
    test. So, frequency classes of descriptions with
    not coinciding codes have been defined and then a
    sample has been extracted proportionally within
    each class, with a margin of error of 0.014.

16
2) Quality test on descriptions of the Chambers
of Commerce
Comparison between code assigned through ACTR
and through Sector Studies Quality control sample
17
2) Quality test on descriptions of the Chambers
of Commerce
  • To state the precision, coded descriptions of
    this sample were submitted to expert coders who
    classified them as
  • (A) correct codes according to ACTR
  • (C) correct codes according to Chambers of
    Commerce
  • (E) wrong codes according to both the
    methodologies
  • (D) doubt codes according to both the
    methodologies

Precision rate
As it can be seen the precision rate is high,
from 80 to 94 in all classes, apart from that
coinciding only with the first digit (This is
due to the fact that this class is widely
populated with very generic descriptions owing to
the Construction sector, which has been strongly
revised in the new classification).
18
3) Quality test on special surveys
  • When ATECO 2007 was almost finalised, it became
    evident that it was necessary to realize specific
    surveys in those sectors where information was
    not available or the activities included were
    completely new. Particularly, it was decided to
    send a questionnaire to the enterprises in the
    fields of
  • Information and Communication (section J)
  • Architectural and engineering activities
    technical testing and analysis (division 71)
  • Research and experimental development on natural
    sciences and engineering (group 72.1)
  • Specialised design activities (group 74.1)
  • Services to buildings and landscape activities
    (division 81)
  • Other professional, scientific and technical
    activities n.e.c. Office administrative and
    support activities Business support services
    activities n.e.c. (74.9 82.1 82.9).

19
3) Quality test on special surveys
  • The surveys were sent to around 45 thousands
    enterprises all the enterprises larger than 10
    employees and a sample of the smallest
    enterprises (1 9 employees).
  • The questionnaires were very simple. At the
    beginning of every questionnaire a description of
    the economic activity not longer than 200 bytes
    was required.

20
3) Quality test on special survey
  • The respondents were around 30.
  • In order to realize a quality test on these
    activities only the questionnaires where it was
    possible to attribute an ATECO code, analysing
    the answers to the survey, were considered
    (around 52).
  • The coding rate was not so good (44.5), but it
    was not considered a failure as the survey
    regarded specific sectors for which it was
    already known that the dictionary had to be
    enriched. On the other hand, the precision rate
    was quite high (88,2).
  • The main purpose of the survey was to enrich the
    ACTR dictionary in order to improve the quality
    in terms of performance.

21
Conclusions and results
  • The performances of the application are
    satisfactory both in terms of recall rate and
    precision rate
  • The ACTR on WEB tool is having a big success (in
    the first weeks an average of 9,000 queries)
  • The dictionary continues to be enriched using
    both descriptions given by ACTR on WEB and those
    written in the special surveys questionnaires
Write a Comment
User Comments (0)
About PowerShow.com