Title: Combining Central Crop Databases for EURISCO
1Combining Central Crop Databases for EURISCO
2Content
- background
- gathering the data
- conversion to EURISCO format
- analysis of consolidated data
- Documentation Quality Index (DQI)
- comparison CCBDs versus National Inventories
(Ni) - conclusions
3Background
- part EPGRIS project
- combination of all available ECP/GR CCDBs
- create test-data-set for EURISCO
- create data-sets per country for use by focal
persons - comparison / completion National Inventories
4Background
5Gathering the data
- 45 CCDBs listed on ECP/GR site (July 2002)
- 20 downloaded from Internet site
- 15 received following Email correspondence
- 10 not received (under development or no reply)
- 35 data sets were combined
- included the largest crops (small grains)
6Conversion to EURISCO format
- data were converted into EURSICO format
- EURISCO format MCPD 5 additional fields
- NICODE (identification of national inventory)
- BREDDESCR (description of breeder)
- DONORDESCR (description of donor)
- DUPLDESCR (description of duplication site)
- ACCEURL (hot link to detailed accession
information)
7Conversion to EURISCO format
- steps conversion
- importing dbf, mdb or txt formatted files into
Excel - match columns to EURISCO descriptors
- check format of all columns
- VBA macros
- transform format when required
- as far as possible VBA macros
8Conversion to EURISCO format
- main conversion issues
- institution codes had to be FAO Institute Codes
- automatic transformation if ECP acronym was
used, otherwise transfer to corresponding DESCR
column - in case of INSTCODE (mandatory) more effort to
complete - contributors section of the CCDB sites
- search the FAO-list manually
9Conversion to EURISCO format
- main conversion issues
- requirements taxonomic fields
- no standardization concerning classification
system - authority's name was moved to appropriate fields
10Conversion to EURISCO format
- main conversion issues
- country codes had to be on extended ISO 3166-1
list - NICODE was copied from INSTCODE
- incorrect ORIGCTY was corrected via COLLSITE,
where feasible - correction of standard errors, examples
- ROM -gt ROU (Romania)
- JAP -gt JPN (Japan)
- GER -gt DEU (Germany)
11Conversion to EURISCO format
- main conversion issues
- dates ware transformed into YYYYMMDD format
- very many formats appeared, even within one
database - examples from one database
- YY
- YYYY
- YYMM
- YYMMDD
- DDMMYY
- DD.MM.YY
- DD.MM.YYYY
12Conversion to EURISCO format
- main conversion issues
- number- and name-fields were not changed
- for example, if ACCNAME contained local (5993
times) this was untouched - low integrity of accession numbers
13Conversion to EURISCO format
- main conversion issues
- collection site was often compiled from other
fields - example COLLSITE PROVENCE STATE
14Conversion to EURISCO format
- main conversion issues
- longitude and latitude appeared in many formats
- in case of doubt about minutes or seconds they
were replaced by hyphens
15Conversion to EURISCO format
- main conversion issues
- coded information (SAMPSTAT, COLLSRC, STORAGE)
not always followed MCPD v2 - MCPD v1 was transformed
- other codes were transformed on basis of
additional information
16Analysis of consolidated data
- Documentation Quality Index (DQI)
- tool for analyzing quality/completeness of data
sets - higher index -gt more complete data
- each type (SAMPSTAT) of accession can reach same
maximum - based on the occurrence of values in fields, an
index is calculated for each record - quality of the value is not considered
17Analysis of consolidated data
- Documentation Quality Index (DQI)
- examples of calculation
- if SPECIES has a value 25 points
- if SPAUTHOR has a value 2 points provided that
SPECIES has a value - if LONGITUDE has a value 7 points provided that
SAMPSTAT starts with 1 (wild) or 2 (weedy),
and LATITUDE has a value 5 points provided that
SAMPSTAT starts with 3 (landrace), and LATITUDE
has a value 2 points if SAMPSTAT has no value
otherwise 0 points - valuation of fields is subjective !!
- how important is knowing the SPAUTHOR relative to
ALTITUDE ?
18Analysis of consolidated data
- total number of accessions 507732
- number of GENUS/SPECIES combinations 1360
- 420 of these only have 1 accession
- 941 of these have less then 10 accessions
- largest CCDBs barley (129507 acc.) and wheat
(108229 acc.) - smallest CCDB Trisetum (87 acc.)
- CCDB with most genera Umbellifer (47 genera)
- oldest accession a white cabbage from 1726
- highest origin barley from Tibet collected at
4650 m
19Analysis of consolidated data
- distribution over sample status
- nearly half unknown
- weedy occurs only 445 times (0)
20Analysis of consolidated data
- occurrence of values per descriptor
- 10 of 34 descriptors gt 50, 11 descriptors lt10
21Analysis of consolidated data
- DQI in CCDBs
- theoretical maximum is 150
- possible for all types of material
- theoretical minimum is 0
- only mandatory fields have a value
- varied from 0 to 129, average 57.0
22Analysis of consolidated data
- DQI per holding country
- varied from 37 to 84 (countries with gt 1000 acc)
23Analysis of consolidated data
- DQI per crop
- varied from 28 to 73 (crops with gt 1000 acc)
24Comparison NI lt-gt CCDB
- comparison National Inventory lt-gt CCDB slice
- number of accessions
- average DQI for common accessions
- results
- expected size of EURISCO
- effect of conversion via CCDBs into EURISCO
format versus conversion into NI
25Comparison NI lt-gt CCDB
- number of accessions
- CCDB contains between 38 and 74 of accessions
in current NIs
26Comparison NI lt-gt CCDB
- DQI of common accessions
- DQI decreases in case of HUN (14000 acc) and NLD
(13000 acc) - not in the case of DEU (53000 acc)
27Conclusions
- many CCDBs have low data quality
- various formats in one column, irrelevant values,
etc. - conversion of data generally reduces quality
- more accessions in NIs then in CCDBs
- expected number of accessions in EURISCO is
1,000,000 300,000
28Acknowledgements
- CCDB NI curators
- Vanessa Fens Theo van Hintum (CGN)
- Attila Simon (Institute for Agrobotany,
Tápiószele, Hungary)