Title: Converting Passport Data to the FAOIPGRI MCPD standard
1Converting Passport Data to the FAO/IPGRI MCPD
standard
-
- An approach using the Perl programming language
- T. Metz
- September 2003
2FAO/IPGRI MCPD an emergingde facto standard
- EPGRIS input to the MCPD revision
- Wide adoption
- Standards are compromises
- Database standard vs. data exchange standard
- Tools and methodologies for implementing/conformin
g - Cookbook approach vs. textbook approach
3Approaches to converting data
GUIScript
4Choices for moving from A to B
GUIScript
5Data flow data conversion
ASCII file with tab-delimited passport data,
formatted as per genebank documentation system
Genebank documentation system
Perl conversion program
ASCII file with tab-delimited passport data,
formatted as per FAO/IPGRI MCPD standard
NI
EURISCO
6The Perl interpreter software
C\perlgtdir PERL EXE 16,384 09-23-99
534p PERL.EXE PERL DLL 741,376
09-23-99 534p PERL.DLL C\perlgt
7Running the conversion program
C\passport2mcpdgtdir DATA1IN DAT 3,080,192
06-22-03 606p data1in.dat CONVERT1 PL
1,272 09-08-03 1140p convert1.pl C\passport2m
cpdgtc\perl\perl convert1.pl C\passport2mcpdgtdir
DATA1OUT DAT 3,735,554 09-09-03 1243a
data1out.dat DATA1IN DAT 3,080,192
06-22-03 606p data1in.dat CONVERT1 PL
1,272 09-08-03 1140p convert1.pl C\passpor
t2mcpdgt
8Input data (data1in.dat)
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
- 3 Curcurbis melo ABC003
- 4 Daucus carota ABC004
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
- 3 Curcurbis melo ABC003
- 4 Daucus carota ABC004
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
9Elements of the conversion program- files for
input and output -
- open(INPUT, "ltdata1in.dat" )
- open(OUTPUT,"gtdata1out.dat")
-
- close(INPUT)
- close(OUTPUT)
10Elements of the conversion program- reading
input and writing output -
- while(defined (linein ltINPUTgt))
- chomp(linein)
- (row, genus, species, accenumb)
split("\t", linein) -
- lineout join("\t", mcpd_instcode,
mcpd_accenumb, - mcpd_genus,
mcpd_species) . "\n" - print OUTPUT lineout
11Elements of the conversion program- converting
input to output -
- mcpd_accenumb accenumb
- mcpd_genus FirstCapital(genus)
- mcpd_species lc(species)
- mcpd_instcode NLD001"
12Elements of the conversion program- user-defined
conversion function -
- FirstCapital
-
- This function expects one character string as
input. - It converts the entire string first to lower
case - characters and then the first character to
upper - case.
-
- sub FirstCapital
- my(inword) _at__
- return ucfirst(lc(inword))
-
13Elements of the conversion program- other
conversion functions -
- Splitting a field
- Merging two fields
- Removing blank spaces
- Recoding
- Unit conversion
- Converting geo-reference values (longitude,
latitude) - Converting date values
- Transliteration / character set conversion
14Elements of the conversion program- advanced
functionality -
- Direct database query (SQL)
- Result uploading (FTP, HTTP)
- Scheduling (CRON)
15Input data (data1in.dat)
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
- 3 Curcurbis melo ABC003
- 4 Daucus carota ABC004
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
- 3 Curcurbis melo ABC003
- 4 Daucus carota ABC004
- 1 Hordeum vulgare ABC001
- 2 ZEA Mays ABC002
16Output data (data1out.dat)
- instcode accenumb genus species
- NLD001 ABC001 Hordeum vulgare
- NLD001 ABC002 Zea mays
- NLD001 ABC003 Curcurbis melo
- NLD001 ABC004 Daucus carota
- NLD001 ABC001 Hordeum vulgare
- NLD001 ABC002 Zea mays
- NLD001 ABC003 Curcurbis melo
- NLD001 ABC004 Daucus carota
- NLD001 ABC001 Hordeum vulgare
- NLD001 ABC002 Zea mays
17GUI vs. programming approach
- Programming requires higher level of IT skills
than GUI - GUI is convenient for small datasets and unique
conversions - Programming is applicable for large datasets and
repeated conversion - Remote diagnosis and support is easier for
programming - Problem solutions are easier transferable for
programming
18GUI vs. programming approach
- Programming approach has high reliability and
repeatability - Programming has higher initial investment, GUI
has higher repeat cost - Programming is more resilient to staff changes
and skills erosion - No data size limitation for programming approach
19Why Perl ?
- Free open source software
- Available on almost any combination of hardware
and software - Programs are portable
- Developed for text manipulation
- Perl is easy to start with, especially for
programming beginners - Huge user and developer community long term
availability
20Conclusions
- NIs and EURISCO should be complete and up-to-date
(not snapshots) - Repeated tasks e.g. data transformation should be
automated - A programming approach can help to reduce
manual transformation work - Beware of the 80/20 rule
- Perl is a suitable solution for the programming
approach
21Now what ?
- Training manual (cookbook) under development
- Early adopters ?
- Remote assistance and support
22