Title: Yvan Rose
1The Phon and PhonBank initiatives A scheme for
sharing and archiving child language phonology
data
- Yvan Rose
- Memorial University of Newfoundland
- Laboratoire Dynamique Du Langage, UMR 5596 CNRS
- Université Lumière Lyon 2 - Brian MacWhinney
- Carnegie Mellon University
2Presentation roadmap
- Why study child language
- Empirical challenges
- A promising solution
- Phon, a software program for the study of
phonological development - The PhonBank database project
- Potential
3Why study language development?
- Special gift universals
- How is language acquired so easily by children?
- Typology, variation their origins
- Emergentism processes that show up in the course
of language development - Sometimes, no correlates in language typology
- Language disorders
- Developmental speech problems
- Atypical outcomes (e.g. stammered speech)
- Acquired language disorders
- Socialization, literacy, maintenance,
4What everyone needs
- Most current hypotheses must be tested against a
large body of data. - Lots of data in comparable format, from, e.g.
- Different languages
- Different acquisition contexts
- Typical versus disordered speech
- Monolingual versus multilingual populations
- Different age periods
- Different methods of investigation
- Cross-sectional (e.g. variation studies)
- Longitudinal (e.g. developmental studies)
5CHILDES a good departure point
- Child Language Data Exchange System
- http//childes.psy.cmu.edu
- Founded in 1984 in Concord MA
- Director Brian MacWhinney macw_at_mac.com
- Programmers Leonid Spektor, Franklin Chen
- Components of CHILDES
- DATABASE over 190 ( 80 TalkBank) corpora
- CHAT system for speech notation and coding
- CLAN software suite for analysis
- Impact
- 4500 members worldwide
- 2000 articles based on CHILDES data
6However...
- Current CHILDES support for phonology almost
non-existing - No decent database of phonological development
- Database of phonological data is far from
satisfactory - No automatized method of data compilation
- Elaboration of such a database extremely
difficult - No tools facilitating corpus creation
- No data exchange standard established
7Proposed solutions
- CHILDES extension into phonological realm
- Phon (Rose and colleagues)Software for
transcription, compilation and analysis of
phonological data - Specialized for research in acquisition
- Standardized for data sharing
- PhonBank (MacWhinney and Rose)Publicly-available
database on phonological development - Multilingual
- Several acquisition contexts and periods
8The Phon project
- Prerequisite to the building of PhonBank
- Multi-disciplinary team at Memorial University of
Newfoundland - Close collaboration with Brians team at Carnegie
Mellon University - Design and implementation criteria
- Reliability
- Simplicity / flexibility / adaptability
- Analytical neutrality
- Compatibility
- Availability
9Phon an overview
- Intuitive graphical user interface
- Dynamic interaction between software and user
- Flexible project customization functions
- Support for multiple alphabets (IPA, Roman,
Arabic, Chinese, Japanese, Cyrillic, Algonquin) - User-defined data fields
- Functionality for phonological data
- Descriptive features for transcribed segments
- Support for both segmental and prosodic info
- Easily expandable modular architecture
- FREE!!!!!
10Phon technical aspects
- Cross-platform compatible(Macintosh, Windows,
Unix/Linux platforms) - Programmed in Java
- 100 Unicode compliant
- Support for playback of audio and video recorded
data in several formats - Data storage in XML format
- Compatible with the CHILDES TalkBank schema
- Data can be accessed by other applications
- IPA transcription standards supported
- Open-source
11Phon general interface
Project navigator
Records for linguistic data and other
annotations
Multimediacentre
Session metadata
Navigation between records
12User management
Password-protectedaccess
Task management
13Multimedia data segmentation
- Enables the delimitation of the recorded portions
that are relevant for research - Functionality to edit the segments identified
Segment playback
Segment export(AIFF/WAVE formats)
14Phonetic transcription
Jaime les ours en peluche bruns
??m le zu?s ? p?l?? b?
? nus py? ba
- Support for
- Multiple-blind transcriptions (using user ID and
task management functionality) - Phonetic dictionaries of target forms (e.g. CMU
Pronouncing Dictionary)
15Merging of multiple-blind transcriptions
Selection of most accurate transcriptions
Access and comparison of all transcriptions of
target and actual forms
Refinement of selected transcriptions (if needed)
16Segmentation of transcribed utterances
- Multiple-word utterances often must be divided
into smaller portions - Access to precise domainsof analysis
- Enables an analysisof several levels (e.g.
utterance, phrase, word, morpheme, ) - Example of divisioninto lexical items
17Syllabification
- Phonological research must consider prosodic
factors, including - Number of segments in syllables
- Shapes of syllables (ex. CV, CVC, CCV)
- Positions of segments within syllables
- Positions of syllables within the word
- Stressed versus unstressed status of syllables
- Manual coding is tedious and time-consuming
- We need a reliable, automatic system
18Syllabification algorithm (Hedlund OBrien 2004)
- Automatic parsing of segmental strings into
syllables - Several parses possible based on parameters
modifiable by the user(no theoretical bias
imposed) - Possibility to test different hypotheses for
target and actual forms - Labelling of syllables and their segments for
- Word-level prosodic information
- Syllable constituency
- Manual modification of spurious results
19Syllabification et modification interface
Syllable constituent labels colour-codes
Labelling modifiable through contextual menus
20Alignment of target and actual forms
- Several investigations require systematic
comparisons - Segment per segment (ex. /b?næn?/ b _ _ æn?)
- Syllabe per syllabe (ex. /e?p??k?t/ e _ _ _ _
ko _ ) - Comparisons not always easy to obtain
Wrong alignment!
Valid alignment!
ko
e
ko
21Alignment algorithm (Maddocks 2005)
- Segments and syllables aligned based on their
featural similarity - Dynamic programming Complex problems solved
through resolution of their simpler sub-parts
e.g.e?p??k?t/eko(e?/e)(e?p??/eko)(e?p??k?t/ek
o) - Rewards and penalties
- Reward example Alignment of stressed syllables
- Penalty example Alignment with nothing (empty
featural set)
22Effects of rewards and penalties
23Algorithm optimization
- Problem syllable alignment in different corpora
require different parameter settings, difficult
to adjust manually - Solution genetic algorithm (GA)
- 1- generate alignments from a representative
corpus - 2- revise results manually
- 3- GA automatically optimizes parameters based on
manually revised corpus - 96 ? 98 efficiency on English corpus
- 85 ? 96 efficiency on Dutch corpus (initially
generated with English settings)
24Modification of spurious alignments
- Alignment algorithm provides reliable results for
preliminary analyses - Remaining cases must be aligned manually
Select syllable
Add to alignment
Complete alignment
?
?
25Query language
- PhonBasic (Hedlund OBrien 2004)
- Characteristics
- Selectors and predicates terms commonly used by
linguists - Syllable, stressed, voiced, labial,
- Boolean connectives
- Custom predicates
- Prevocalic LApp, Onset
- Postvocalic Coda, RApp, OEHS
- Sample of queries pre-installed
- Memorization of recent queries
- Saving and sharing of queries
26Query results
- Visualization from within the application
- Generation of textual reports
- Recording session
- Exemplification of a given process
- Time period
- Exemplification of an acquisition stage
- Entire database
- Establishment of a learning curve
- Exportation of results
- Text format (Unicode encoding)
- CVS format (ex. Excel, Access, FileMaker Pro, ...)
27Future functionality
- Support for importation of existing corpora
- Additional of dictionaries of target forms
supporting other languages/dialects - Incorporation of basic statistical functions
(using already-existing Java packages) - Schema/graph generation
- Bar graphs (e.g. to illustrate the relative
prominence of patterns) - Line graphs (e.g. to illustrate learning curves)
28Future functionality
- Interoperability with Praat and/or SFS
- Basic goal compilation of acoustic parameters
relative to phonological domains - Alignment of transcriptions with
waveforms/spectrograms (TextGrid-like function) - Exportation of samples for speech analysis
- Importation of acoustic measurement data
- Web interface
- Data sharing at a distance
- Query of PhonBank without the need of downloading
corpora - Automatic detection of patterns
29Timeline
- Late July / early August, 2005
- Release of the first complete version of Phon 1.0
(beta) for Macintosh, Windows, Linux, UNIX - Partial compatibility with existing CHILDES
corpora - August - October, 2005
- Testing/debugging of the beta version
- Extension of CHILDES compatibility
- November, 2005
- Official release of Phon 1.0 at BUCLD
- Beginning of the PhonBank initiative
30PhonBank project
- Project leaders
- Brian MacWhinney (CMU)
- Yvan Rose (MUN)
- Barbara Davis (Texas-Austin)
- Rodrigue Byrne (MUN)
- Research consortium
- 26 collaborators, 16 languages
- Monolingual, bilingual, clinical, babbling,
second language, - Awaiting results from grant application to NIH
31Immediate potential
- Scientific exchanges between researchers working
in related areas made easier - Research based on
- Much stronger empirical base
- Combination of various experimental methods
- Systematic comparisons of various corpora
- Within and across languages
- Within and across populations
- Within and across age groups
32Long term potential
- Better understanding of
- Language acquisition process
- Developmental and acquired language disorders
- Contribution to development of more adequate
theoretical models - Establishment of more accurate baselines for
early detection of language delays/disorders - More rapid and efficient educational and
therapeutic interventions
33Thanks for your attention
34Acknowledgements
- People at MUN
- David Graham, Dean of Arts
- Robert Lucas, Dean of Science
- Jim Black, Associate Dean of Arts
- Barbara Cox et son équipe, Office of Research
- Marguerite MacKenzie, Head of Linguistics, as
well as all members of the department - Wolfgang Banzhaf, Head of Computer Science
- For their feedback and encouragementÉliane
Lebel, Heather Goad, Paula Fikkert, Clara Levelt,
Katherine Demuth, Mark Johnson, Carrie Dyck, Phil
Branigan, Brian MacWhinney, Bryan Gick, Sophie
Wauquier-Gravelines, Sharon Inkelas, Conxita
Lleó, Sónia Frota, Maria João Freitas, Ronald
Sprouse, Joe Pater, John Archibald, Éliane Lebel,
Susana Correia, Laetitia Almeida, Teresa da
Costa, Barbara Davis, Christophe dos Santos,
Sophie Kern, Christine Champdoizeau, Jennifer
Parsons, Carla Dunphy, Lindsay Babcock, Allison
Strong, Megan Maloney, Marina Vigário hoping
that we didnt forget anyone
35Acknowledgements
- The team behing Phon
- Rod Byrne, Todd Wareham, Gregory Hedlund, Philip
OBrien, Keith Maddocks - The CHILDES computer guys
- Franklin Chen (Carnegie Mellon University)
- Leonid Spektor
- Financial support
- Arts Faculty, Memorial University (Y. Rose)
- VP Research, Memorial University (Y. Rose)
- Social Sciences and Humanities Research Council
of Canada (J. Brittain, C. Dyck, Y. Rose M.
MacKenzie) - Natural Sciences and Engineering Research Council
of Canada (T. Wareham) - National Science Foundation (B. MacWhinney)
- Canada fund for Innovation (Y. Rose)