The TalkBank Project

 

Tools for Analyzing Talk – Electronic Edition

 

 

 

 

Part 1:  The CHAT Transcription Format

 

 

Brian MacWhinney

Carnegie Mellon University

 

June 29, 2016

 

 

 

 

 

 

Citation for last printed version:

 

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

 

1       Table of Contents

 

1      Table of Contents...................................................................................................... 2

2      Introduction to Electronic Edition........................................................................... 5

3      Introduction............................................................................................................... 7

3.1       Impressionistic Observation................................................................................. 7

3.2       Baby Biographies................................................................................................. 8

3.3       Transcripts........................................................................................................... 8

3.4       Computers.......................................................................................................... 10

3.5       Connectivity....................................................................................................... 10

3.6       Three Tools........................................................................................................ 11

3.7       Shaping CHAT.................................................................................................. 11

3.8       Building CLAN................................................................................................. 12

3.9       Constructing the Database.................................................................................. 12

3.10     Disseminating CHILDES.................................................................................. 13

3.11     Funding.............................................................................................................. 13

3.12     How to Use These Manuals............................................................................... 14

3.13     Changes............................................................................................................. 15

4      Principles.................................................................................................................. 16

4.1       Computerization................................................................................................. 16

4.2       Words of Caution............................................................................................... 17

4.2.1        The Dominance of the Written Word......................................................... 17

4.2.2        The Misuse of Standard Punctuation......................................................... 18

4.2.3        Working With Video................................................................................... 18

4.3       Problems With Forced Decisions....................................................................... 19

4.4       Transcription and Coding................................................................................... 19

4.5       Three Goals........................................................................................................ 20

5      CHAT Outline.......................................................................................................... 22

5.1       minCHAT – the Form of Files........................................................................... 22

5.2       minCHAT – Words and Utterances................................................................... 22

5.3       Analyzing One Small File.................................................................................. 23

5.4       Next Steps.......................................................................................................... 24

5.5       File Naming....................................................................................................... 24

5.6       Metadata............................................................................................................. 24

5.7       The Documentation File..................................................................................... 26

5.8       Checking Syntactic Accuracy............................................................................. 27

6      File Headers.............................................................................................................. 28

6.1       Hidden Headers................................................................................................. 28

6.2       Initial Headers.................................................................................................... 29

6.3       Participant-Specific Headers.............................................................................. 34

6.4       Constant Headers............................................................................................... 34

6.5       Changeable Headers........................................................................................... 36

7      Words....................................................................................................................... 40

7.1       The Main Line.................................................................................................... 41

7.2       Basic Words....................................................................................................... 41

7.3       Special Form Markers........................................................................................ 41

7.4       Unidentifiable Material....................................................................................... 44

7.5       Incomplete and Omitted Words.......................................................................... 46

7.6       Standardized Spellings....................................................................................... 48

7.6.1        Letters........................................................................................................ 48

7.6.2        Compounds and Linkages......................................................................... 48

7.6.3        Capitalization............................................................................................. 49

7.6.4        Acronyms................................................................................................... 49

7.6.5        Numbers and Titles.................................................................................... 50

7.6.6        Kinship Forms........................................................................................... 50

7.6.7        Shortenings................................................................................................ 50

7.6.8        Assimilations.............................................................................................. 51

7.6.9        Communicators and Interjections.............................................................. 52

7.6.10      Spelling Variants........................................................................................ 53

7.6.11      Colloquial Forms....................................................................................... 53

7.6.12      Dialectal Variations................................................................................... 53

7.6.13      Baby Talk................................................................................................... 54

7.6.14      Word separation in Japanese.................................................................... 55

7.6.15      Abbreviations in Dutch.............................................................................. 55

8      Utterances................................................................................................................. 57

8.1       One Utterance or Many?.................................................................................... 57

8.2       Satellite Markers................................................................................................ 58

8.3       Discourse Repetition.......................................................................................... 59

8.4       C-Units, sentences, utterances, and run-ons....................................................... 59

8.5       Retracing............................................................................................................ 60

8.6       Basic Utterance Terminators.............................................................................. 60

8.7       Separators.......................................................................................................... 61

8.8       Tone Direction................................................................................................... 62

8.9       Prosody Within Words...................................................................................... 62

8.10     Local Events....................................................................................................... 63

8.10.1      Simple Events............................................................................................. 63

8.10.2      Complex Local Events................................................................................ 64

8.10.3      Pauses........................................................................................................ 65

8.10.4      Long Events............................................................................................... 65

8.10.5      Interposed Remarks................................................................................... 65

8.11     Special Utterance Terminators............................................................................ 66

8.12     Utterance Linkers............................................................................................... 68

9      Scoped Symbols....................................................................................................... 70

9.1       Audio and Video Time Marks........................................................................... 70

9.2       Paralinguistic Scoping and Events..................................................................... 71

9.3       Explanations and Alternatives............................................................................ 72

9.4       Retracing, Overlap, and Clauses........................................................................ 74

9.5       Error Marking.................................................................................................... 77

9.6       Initial and Final Codes....................................................................................... 77

10    Dependent Tiers....................................................................................................... 80

10.1     Standard Dependent Tiers.................................................................................. 80

10.2     Synchrony Relations.......................................................................................... 86

11    CHAT-CA Transcription....................................................................................... 88

12    Disfluency Transcription........................................................................................ 91

13    Arabic Transcription............................................................................................... 92

14    Specific Applications............................................................................................... 94

14.1     Code-Switching................................................................................................. 94

14.2     Elicited Narratives and Picture Descriptions...................................................... 95

14.3     Written Language............................................................................................... 95

14.4     Nested Files for Gesture Analysis..................................................................... 96

14.5     Sign Language Transcription............................................................................. 99

14.6     Sign and Speech................................................................................................. 99

15    Speech Act Codes.................................................................................................. 101

15.1     Interchange Types............................................................................................ 101

15.2     Illocutionary Force Codes................................................................................ 102

16    Error Coding.......................................................................................................... 105

16.1     Word level error codes summary..................................................................... 105

16.2     Word level coding – details.............................................................................. 106

16.3     Utterance level error coding (post-codes)......................................................... 109

17    Morphosyntactic Coding...................................................................................... 111

17.1     One-to-one correspondence............................................................................. 111

17.2     Tag Groups and Word Groups........................................................................ 112

17.3     Words.............................................................................................................. 112

17.4     Part of Speech Codes....................................................................................... 113

17.5     Stems............................................................................................................... 114

17.6     Affixes............................................................................................................. 115

17.7     Clitics............................................................................................................... 116

17.8     Compounds...................................................................................................... 116

17.9     Punctuation Marks........................................................................................... 117

17.10       Sample Morphological Tagging for English................................................ 117

References....................................................................................................................... 121

2       Introduction to Electronic Edition

This electronic edition of the CHAT manual is being continually revised to keep pace with the growing interests of the language research communities served by the TalkBank and CHILDES communities. The first three editions were published in 1990, 1995, and 2000 by Lawrence Erlbaum Associates.  After 2000, we switched to the current electronic publication format.  However, we still hope that users of this system cite the version of the manual published in 2000, when using data and programs in published work.

 

In its current version, this manual tends still to focus on the use of the programs for child language data in the context of the CHILDES system (childes.talkbank.org).  However, beginning in 2001 with support from NSF, we introduced the concept of TalkBank to include a wide variety of language databases. These now include:

1.     CHILDES (childes.talkbank.org) for child language acquisition,

2.     AphasiaBank (talkbank.org/Aphasiabank) for aphasia,

3.     PhonBank for the study of phonological development,

4.     TBIBank for language in traumatic brain injury,

5.     DementiaBank for language in dementia,

6.     FluencyBank for the study of childhood fluency development,

7.     HomeBank (homebank.talkbank.org) for daylong recordings in the home,

8.     CABank for Conversation Analysis,

9.     SLABank (sla.talkbank.org) for second language acquisition,

10. ClassBank for studies of language in the classroom,

11. BilingBank for the study of bilingualism and code-switching,

12. LangBank for the study and learning of classical languages,

13. SamtaleBank for Danish conversations,

14. and the SCOTUS corpus with 50 years of oral arguments linked to transcripts at the Supreme Court of the United States. 

We are continually adding corpora to each of these collections.  The current size of the text database is 800MB and there is an additional 3TB of media. All of the data in TalkBank are freely open to downloading and analysis with the exception of the data in AphasiaBank which are open to clinical researchers. The CLAN program and the related morphosyntactic taggers are all free and open-sourced through GitHub.

 

Fortunately, all of these different language banks make use of the same transcription format (CHAT) and the same set of programs (CLAN).  This means that, although most of the examples in this manual rely on data from the CHILDES database, the principles extend easily to data in all of the TalkBank repositories.  TalkBank (http://talkbank.org)  is the largest open repository of data on spoken language.  All of the data in TalkBank are transcribed in the CHAT format which is compatible with the CLAN programs. 

 

Using conversion programs available inside CLAN (see the CLAN manual for details), transcripts in CHAT format can be automatically converted into the formats required for Praat (praat.org), Phon (childes.talkbank.org/phon), ELAN (tla.mpi.nl/tools/elan), CoNLL, ANVIL (anvil-software.org), EXMARaLDA (exmaralda.org), LIPP (ihsys.com), SALT (saltsoftware.com), LENA (lenafoundation.org), Transcriber (trans.sourceforge.net), and ANNIS (corpus-tools.org/ANNIS). 

 

TalkBank databases and programs have been used widely in the research literature.  CHILDES, which is the oldest and most widely recognized of these databases, has been used in over 6000 published articles.  PhonBank has been used in 480 articles and AphasiaBank has been used in 212 publications.  In general, the longer a database has been available to researchers, the more the use of that database has become integrated into the basic research methodology and publication history of the field.

 

Metadata for the transcripts and media in these various TalkBank databases have been entered into the two major systems for accessing linguistic data: OLAC (see Simons, this volume), and CMDI/TLA (see Trippel, this volume).  Each transcript and media file has been assigned a PID (permanent ID) using the Handle System (www.handle.net), and each corpus has received an ISBN number.  In addition, we are currently implementing DOI (digital object identifier) coding.

 

For ten of the languages in the database, we provide automatic morphosyntactic analysis using a series of programs built into CLAN.  These languages are Cantonese, Chinese, Dutch, English, French, German, Hebrew, Japanese, Italian, and Spanish.  The codes produced by these programs could eventually be harmonized with the GOLD ontology (Cavar et al., this volume).  In addition, we can compute a dependency grammar analysis for each of these 10 languages. Through these various methods of transcript format conversion, metadata publication, and grammatical analysis, TalkBank has already fulfilled many of the goals of the LLOD Project. As a result of these efforts, TalkBank has been recognized as a Center in the CLARIN network (clarin.eu) and has received the Data Seal of Approval (datasealofapproval.org).  TalkBank data have also been included in the SketchEngine corpus tool (sketchengine.co.uk).

3       Introduction

Language acquisition research thrives on data collected from spontaneous interactions in naturally occurring situations. You can turn on a tape recorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this first volume, we will present a set of compu_tational tools designed to increase the reliability of transcriptions, automate the process of data analysis, and facilitate the sharing of transcript data. These new computational tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of sec_ond-language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this volume concentrates on their use in the child language field, in the hope that researchers from other areas can make the necessary analogies to their own topics.

 

Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collec_tion of data on language acquisition. These earlier approaches can be grouped into five ma_jor historical periods.

3.1      Impressionistic Observation

The first attempt to understand the process of language development appears in a re_markable passage from The Confessions of St. Augustine (1952). In this passage, Augustine claims that he remembered how he had learned language:

This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.

Augustine's outline of early word learning drew attention to the role of gaze, pointing, intonation, and mutual understanding as fundamental cues to language learning.  Modern research in word learning (P. Bloom, 2000) has supported every point of Augustine's analysis, as well as his emphasis on the role of children's intentions.  In this sense, Augustine's somewhat fanciful recollection of his own language acquisition remained the high water mark for child language studies through the Middle Ages and even the Enlightenment. Unfortunately, the method on which these insights were grounded depends on our ability to actually recall the events of early childhood – a gift granted to very few of us.

3.2      Baby Biographies

Charles Darwin provided much of the inspiration for the development of the second major technique for the study of language acquisition. Using note cards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to col_lect an impressive body of naturalistic data in support of his views on natural selection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human devel_opment. By taking detailed daily notes, Darwin showed how researchers could build diaries that could then be converted into biographies documenting virtually any aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing the language devel_opment of their own children.

 

Darwin's biographical technique also had its effects on the study of adult aphasia. Fol_lowing in this tradition, studies of the language of particular patients and syndromes were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.

3.3      Transcripts

The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Any_one who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the note-taking process interferes with the ongoing interac_tions.

 

The introduction of the tape recorder in the late 1950s provided a way around these lim_itations and ushered in the third period of observational studies. The effect of the tape re_corder on the field of language acquisition was very much like its effect on ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large data sets of tape recordings from several subjects across a period of 2 or 3 years. Much of the excitement in the 1960s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of tape recordings and typed tran_scripts.

 

This increase in the amount of raw data had an additional, seldom discussed, conse_quence. In the period of the baby biography, the final published accounts closely resembled the original database of note cards. In this sense, there was no major gap between the ob_servational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60s and 70s made it impossible to publish the full corpora. Instead, researchers were forced to publish only high-level analyses based on data that were not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavailable for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies, were then made by thermal production of new ditto masters. Each investigator devised a project-specific system of transcription and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.

 

Recognizing this problem, Roger Brown took the lead in attempting to share his tran_scripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These transcripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very po_lite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself!

 

During this early period, the relations between the various coding schemes often remained shrouded in mystery. A fortunate consequence of the unstable nature of coding systems was that researchers were very careful not to throw away their original data, even after it had been coded. Brown himself commented on the impending transition to computers in this passage (Brown, 1973, p. 53):

It is sensible to ask and we were often asked, “Why not code the sentences for grammatically significant features and put them on a computer so that studies could readily be made by anyone?”  My answer always was that I was continually discovering new kinds of information that could be mined from a transcription of conversation and never felt that I knew what the full coding should be.  This was certainly the case and indeed it can be said that in the entire decade since 1962 investigators have continued to hit upon new ways of inferring grammatical and semantic knowledge or competence from free conversation. But, for myself, I must, in candor, add that there was also a factor of research style.  I have little patience with prolonged “tooling up” for research.  I always want to get started. A better scientist would probably have done more planning and used the computer.  He can do so today, in any case, with considerable confidence that he knows what to code.

With the experience of three more decades of computerized analysis behind us, we now know that the idea of reducing child language data to a set of codes and then throwing away the original data is simply wrong.  Instead, our goal must be to computerize the data in a way that allows us to continually enhance it with new codes and annotations.  It is fortunate that Brown preserved his transcript data in a form that allowed us to continue to work on it.  It is unfortunate, however, that the original audiotapes were not kept.

3.4      Computers

Just as these data analysis problems were coming to light, a major technological oppor_tunity was emerging in the shape of the powerful, affordable microcomputer. Microcom_puter word-processing systems and database programs allowed researchers to enter transcript data into computer files that could then be easily duplicated, edited, and ana_lyzed by standard data-processing techniques. In 1981, when the Child Language Data Exchange System (CHILDES) Project was first conceived, researchers basically thought of computer systems as large notepads. Al_though researchers were aware of the ways in which databases could be searched and tab_ulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.

 

Rather than serving only as an “archive” or historical record, a focus on a shared data_base can lead to advances in methodology and theory. However, to achieve these additional advances, researchers first needed to move beyond the idea of a simple data repository. At first, the possibility of utilizing shared transcription formats, shared codes, and shared anal_ysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that CHILDES (the name uses a one-syllable pronunciation) was conceived. The origin of the system can be traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan Ervin-Tripp, and Brian MacWhinney discussed the pos_sibility of creating an archive for typed, handwritten, and computerized transcripts to be lo_cated at the Max-Planck-Institut für Psycholinguistik in Nijmegen. In 1983, the MacArthur Foundation funded meetings of developmental researchers in which Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language researchers discussed the possi_bility of soliciting MacArthur funds to support a data exchange system. In January of 1984, the MacArthur Foundation awarded a two-year grant to Brian MacWhinney and Catherine Snow for the establishment of the Child Language Data Exchange System. These funds provided for the entry of data into the system and for the convening of a meeting of an ad_visory board. Twenty child language researchers met for three days in Concord, Massachu_setts and agreed on a basic framework for the CHILDES system, which Catherine Snow and Brian MacWhinney would then proceed to implement.

3.5      Connectivity

Since 1984, when the CHILDES Project began in earnest, the world of computers has gone through a series of remarkable revolutions, each introducing new opportunities and challenges. The processing power of the home computer now dwarfs the power of the mainframe of the 1980s; new machines are now shipped with built-in audiovisual capabil_ities; and devices such as CD-ROMs and optical disks offer enormous storage capacity at reasonable prices. This new hardware has now opened up the possibility for multimedia ac_cess to digitized audio and video from links inside the written transcripts. In effect, a tran_script is now the starting point for a new exploratory reality in which the whole interaction is accessible from the transcript. Although researchers have just now begun to make use of these new tools, the current shape of the CHILDES system reflects many of these new re_alities. In the pages that follow, you will learn about how we are using this new technology to provide rapid access to the database and to permit the linkage of transcripts to digitized audio and video records, even over the Internet.  For further ideas regarding this type of work, you may wish to connect to http://talkbank.org where there are various extensions of the CHILDES project.

3.6      Three Tools

The reasons for developing a computerized exchange system for language data are im_mediately obvious to anyone who has produced or analyzed transcripts. With such a sys_tem, we can:

1.     automate the process of data analysis,

2.     obtain better data in a consistent, fully-documented transcription system, and

3.     provide more data for more children from more ages, speaking more languages.

The CHILDES system has addressed each of these goals by developing three separate, but integrated, tools. The first tool is the chat transcription and coding format. The sec_ond tool is the clan analysis program, and the third tool is the database. These three tools are like the legs of a three-legged stool. The transcripts in the database have all been put into the chat transcription system. The program is designed to make full use of the chat format to facilitate a wide variety of searches and analyses. Many research groups are now using the CHILDES programs to enter new data sets. Eventually, these new data sets will be available to other researchers as a part of the growing CHILDES database. In this way, chat, CLAN, and the database function as a coarticulated set of complementary tools.

 

There are manuals for each of the three CHILDES tools.  The CHAT manual, which you are now reading, describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the CLAN computer pro_grams that you can use to transcribe, annotate, and analyze language interactions. The third manual, which is actually a collection of over a dozen separate manuals retrievable from a single link on the web, describes the data files in the CHILDES database.  Each of these database manuals describes the data sets in one major component of the database.  In addition, there is a short manual that provides an overview for the entire database.

3.7      Shaping CHAT

We received a great deal of extremely helpful input during the years between 1984 and 1988 when the CHAT system was being formulated. Some of the most detailed comments came from George Allen, Elizabeth Bates, Nan Bernstein Ratner, Giuseppe Cappelli, An_nick De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara Pan, Lucia Pfanner, Kim Plunkett, Kelley Sacco, Catherine Snow, Jeff Sokolov, Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments developed in Edwards (1992) were useful in shaping core aspects of CHAT. George Allen (1988) helped developed the UNIBET and PHO_NASCII systems. The workers in the LIPPS Group (LIPPS, 2000) have developed extensions of CHAT to cover code-switching phenomena. Adaptations of CHAT to deal with data on disfluencies are developed in Bernstein-Ratner, Rooney, and MacWhinney (Bernstein-Ratner, Rooney, & MacWhinney, 1996). The exercises in Chapter 7 of Part II are based on materials originally de_veloped by Barbara Pan for Chapter 2 of Sokolov & Snow (1994)

In the period between 2001 and 2004, we converted much of the CHILDES system to work with the new XML Internet data format.  This work was begun by Romeo Anghelache and completed by Franklin Chen. Support for this major reformatting and the related tightening of the CHAT format came from the NSF TalkBank Infrastructure project which involved a major collaboration with Steven Bird and Mark Liberman of the Linguistic Data Consortium. Ongoing work in TalkBank is documented on the web at http://talkbank.org. 

3.8      Building CLAN

The CLAN program is the brainchild of Leonid Spektor. Ideas for particular analysis commands came from several sources. Bill Tuthill's HUM package provided ideas about concordance analyses. The SALT system of Miller & Chapman (1983) provided guide_lines regarding basic practices in transcription and analysis. Clifton Pye's PAL program provided ideas for the MODREP and PHONFREQ commands.

 

Darius Clynes ported CLAN to the Macintosh. Jeffrey Sokolov wrote the CHIP pro_gram. Mitzi Morris designed the MOR analyzer using specifications provided by Roland Hauser of Erlangen University. Norio Naka and Susanne Miyata developed a MOR rule system for Japanese; and Monica Sanz-Torrent helped develop the MOR system for Spanish. Julia Evans provided recommendations for the design of the audio and visual capabilities of the editor. Johannes Wagner, Mike Forrester, and Chris Ramsden helped show us how we could modify clan to permit transcription in the Conversation Analysis framework. Steven Gillis provided suggestions for aspects of MODREP.  Christophe Parisse built the POST and POSTTRAIN programs (Parisse & Le Normand, 2000). Brian Richards contributed the VOCD program (Malvern, Richards, Chipere, & Purán, 2004).  Julia Evans helped specify TIMEDUR and worked on the details of DSS. Catherine Snow designed CHAINS, KEYMAP, and STATFREQ. Nan Bernstein Ratner specified aspects of PHONFREQ and plans for additional programs for phonological analysis.

3.9      Constructing the Database

The primary reason for the success of the CHILDES database has been the generosity of over 100 researchers who have contributed their corpora. Each of these corpora represents hundreds, often thousands, of hours spent in careful collection, tran_scription, and checking of data. All researchers in child language should be proud of the way researchers have generously shared their valuable data with the whole research com_munity. The growing size of the database for language impairments, adult aphasia, and sec_ond-language acquisition indicates that these related areas have also begun to understand the value of data sharing.

 

Many of the corpora contributed to the system were transcribed before the formulation of CHAT. In order to create a uniform database, we had to reformat these corpora into CHAT. Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, Kelley Sacco, and Gergely Sikuta worked many long hours on this task. Steven Gillis, Helmut Feldweg, Susan Powers, and Heike Behrens supervised a parallel effort with the German and Dutch data sets.

 

Because of the continually changing shape of the programs and the database, keeping this manual up to date has been an ongoing activity. In this process, I received help from Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson, Kelley Sacco, and Gergely Sikuta. Barbara Pan, Jeff Sokolov, and Pam Rollins also provided a reading of the final draft of the 1995 version of the manual.

3.10  Disseminating CHILDES

Since the beginning of the project, Catherine Snow has continually played a pivotal role in shaping policy, building the database, organizing workshops, and determining the shape of chat and CLAN. Catherine Snow collaborated with Jeffrey Sokolov, Pam Rollins, and Barbara Pan to construct a series of tutorial exercises and demonstration analyses that ap_peared in Sokolov & Snow (1994). Those exercises form the basis for similar tutorial sec_tions in the current manual. Catherine Snow has contributed six major corpora to the database and has conducted CHILDES workshops in a dozen countries.

 

Several other colleagues have helped disseminate the CHILDES system through work_shops, visits, and Internet facilities. Hidetosi Sirai established a CHILDES file server mir_ror at Chukyo University in Japan and Steven Gillis established a mirror at the University of Antwerp. Steven Gillis, Kim Plunkett, Johannes Wagner, and Sven Strömqvist helped propagate the CHILDES system at universities in Northern and Central Europe. Susanne Miyata has brought together a vital group of child language researchers using CHILDES to study the acquisition of Japanese and has supervised the translation of the current manual into Japanese. In Italy, Elena Pizzuto organized symposia for developing the CHILDES sys_tem and has supervised the translation of the manual into Italian. Magdalena Smoczynska in Krakow and Wolfgang Dressler in Vienna have helped new researchers who are learning to use CHILDES for languages spoken in Eastern Europe. Miquel Serra has sup_ported a series of CHILDES workshops in Barcelona. Zhou Jing organized a workshop in Nanjing and Chien-ju Chang organized a workshop in Taipei.

3.11  Funding

From 1984 to 1988, the John D. and Catherine T. MacArthur Foundation supported the CHILDES Project. In 1988, the National Science Foundation provided an equipment grant that allowed us to put the database on the Internet and on CD-ROMs. From 1989 to 2010, the project has been supported by an ongoing grant from the National Insti_tutes of Health (NICHHD). In 1998, the National Science Foundation Linguistics Program provided additional support to improve the programs for morphosyntactic analysis of the database. In 1999, NSF funded the TalkBank project which seeks to improve the CHILDES tools and to use CHILDES as a model for other disciplines studying human communication. In 2002, NSF provided support for the development of the GRASP system for parsing of the corpora.  In 2002, NIH provided additional support for the development of PhonBank for child language phonology and AphasiaBank for the study of communication in aphasia.

3.12  How to Use These Manuals

Each of the three parts of the CHILDES system is described in a separate manual.  The CHAT manual describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the editor and the analytic commands. The database manual is a set of over a dozen smaller documents, each describing a separate segment of the database.

 

To learn the CHILDES system, you should begin by downloading and installing the CLAN program.  Next, you should download and start to read the current manual (CHAT Manual) and the CLAN manual.  Before proceeding too far into the CHAT manual, you will want to walk through the tutorial section at the beginning of the CHAT manual. After finishing the tutorial, try working a bit with each of the CLAN commands to get a feel for the overall scope of the system. You can then learn more about CHAT by transcribing a small sample of your data in a short test file. Run the CHECK program at frequent intervals to verify the accuracy of your coding. Once you have fin_ished transcribing a small segment of your data, try out the various analysis pro_grams you plan to use, to make sure that they provide the types of results you need for your work.

 

If you are primarily interested in analyzing data already stored in the CHILDES archive, you do not need to learn the CHAT transcription format in much detail and you will only need to use the editor to open and read files. In that case, you may wish to focus your efforts on learning to use the CLAN programs. If you plan to transcribe new data, then you also need to work with the current manual to learn to use CHAT.

 

Teachers will also want to pay particular attention to the sections of the CLAN manual that present a tutorial introduction. Using some of the examples given there, you can construct additional materials to encourage students to explore the database to test out particular hypotheses.  At the end of the CLAN manual, there are also a series of exercises that help students further consolidate their knowl_edge of CHAT and CLAN.

 

The CHILDES system was not intended to address all issues in the study of language learning, or to be used by all students of spontaneous interactions. The chat system is comprehensive, but it is not ideal for all purposes. The programs are pow_erful, but they cannot solve all analytic problems. It is not the goal of CHILDES to provide facilities for all research endeavors or to force all research into some uniform mold. On the contrary, the programs are designed to offer support for alternative analytic frameworks. For example, the editor now supports the various codes of Conversation Analysis (CA) format, as alternatives and supplements to CHAT format.

 

 There are many researchers in the fields that study language learning who will never need to use CHILDES. Indeed, we estimate that the three CHILDES tools will never be used by at least half of the researchers in the field of child language. There are three com_mon reasons why individual researchers may not find CHILDES useful:

1.     some researchers may have already committed themselves to use of another an_alytic system;

2.     some researchers may have collected so much data that they can work for many years without needing to collect more data and without comparing their own data with other researchers' data; and

3.     some researchers may not be interested in studying spontaneous speech data.

Of these three reasons for not needing to use the three CHILDES tools, the third is the most frequent. For example, researchers studying comprehension would only be interested in CHILDES data when they wish to compare findings arising from studies of comprehension with patterns occurring in spontaneous production.

3.13  Changes

The CHILDES tools have been extensively tested for ease of application, accuracy, and reliability. However, change is fundamental to any research enterprise. Researchers are con_stantly pursuing better ways of coding and analyzing data. It is important that the CHILDES tools keep progress with these changing requirements. For this reason, there will be revisions to chat, the programs, and the database as long as the CHILDES Project is active.

4       Principles

The chat system provides a standardized format for producing computerized tran_scripts of face-to-face conversational interactions. These interactions may involve children and parents, doctors and patients, or teachers and second-language learners. Despite the dif_ferences between these interactions, there are enough common features to allow for the cre_ation of a single general transcription system. The system described here is designed for use with both normal and disordered populations. It can be used with learners of all types, including children, second-language learners, and adults recovering from aphasic disor_ders. The system provides options for basic discourse transcription as well as detailed pho_nological and morphological analysis. The system bears the acronym “chat,” which stands for Codes for the Human Analysis of Transcripts. Chat is the standard transcrip_tion system for the CHILDES (Child Language Data Exchange System) Project. All of the transcripts in the CHILDES da_tabase are in chat format.

What makes CHAT particularly powerful is  the fact that files transcribed in CHAT can also be analyzed by the CLAN programs that are described in the CLAN manual, which is an electronic companion piece to this manual. The CHAT programs can track a wide variety of structures, compute automatic indices, and analyze morphosyntax.  Moreover, because all CHAT files can now also be translated to a highly structured form of XML (a language used for text documents on the web), they are now also compatible with a wide range of other powerful computer programs such as ELAN, Praat, EXMARaLDA, Phon, Transcriber, and so on.

The CHILDES system has had a major impact on the study of child language. At the time of the last monitoring in 2003, there were over 2000 published articles that had made use of the programs and database.  In 2007, the size of the database had grown to over 44 million words, making it by far the largest database of conversational interactions available anywhere.  The total number of researchers who have joined as CHILDES members across the length of the project is now over 4500. Of course, not all of these people are making active use of the tools at all times. However, it is safe to say that, at any given point in time, approximately 100 groups of researchers around the world are involved in new data collection and transcription using the chat system. Eventually the data collected in these various projects will all be contributed to the da_tabase.

4.1      Computerization

Public inspection of experimental data is a crucial prerequisite for serious scientific progress. Imagine how genetics would function if every experimenter had his or her own individual strain of peas or drosophila and refused to allow them to be tested by other ex_perimenters. What would happen in geology, if every scientist kept his or her own set of rock specimens and refused to compare them with those of other researchers? In some fields the basic phenomena in question are so clearly open to public inspection that this is not a problem. The basic facts of planetary motion are open for all to see, as are the basic facts underlying Newtonian mechanics.

 

Unfortunately, in language studies, a free and open sharing and exchange of data has not always been the norm. In earlier decades, researchers jealously guarded their field notes from a particular language community of subject type, refusing to share them openly with the broader community. Various justifications were given for this practice. It was some_times claimed that other researchers would not fully appreciate the nature of the data or that they might misrepresent crucial patterns. Sometimes, it was claimed that only someone who had actually participated in the community or the interaction could understand the na_ture of the language and the interactions. In some cases, these limitations were real and im_portant. However, all such restrictions on the sharing of data inevitably impede the progress of the scientific study of language learning.

 

Within the field of language acquisition studies it is now understood that the advantages of sharing data outweigh the potential dangers. The question is no longer whether data should be shared, but rather how they can be shared in a reliable and responsible fashion. The computerization of transcripts opens up the possibility for many types of data sharing and analysis that otherwise would have been impossible. However, the full exploitation of this opportunity requires the development of a standardized system for data transcription and analysis.

4.2      Words of Caution

Before examining the chat system, we need to consider some dangers involved in computerized transcriptions. These dangers arise from the need to compress a complex set of verbal and nonverbal messages into the extremely narrow channel required for the computer. In most cases, these dangers also exist when one creates a typewritten or hand_written transcript. Let us look at some of the dangers surrounding the enterprise of transcription.

4.2.1     The Dominance of the Written Word

Perhaps the greatest danger facing the transcriber is the tendency to treat spoken lan_guage as if it were written language. The decision to write out stretches of vocal material using the forms of written language can trigger a variety of theoretical commitments. As Ochs (1979) showed so clearly, these decisions will inevitably turn transcription into a theoretical en_terprise. The most difficult bias to overcome is the tendency to map every form spoken by a learner – be it a child, an aphasic, or a second-language learner – onto a set of standard lexical items in the adult language. Transcribers tend to assimilate nonstandard learner strings to standard forms of the adult language. For example, when a child says “put on my jamas,” the transcriber may instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is simply a childish form of “pajamas.” This type of regularization of the child form to the adult lexical norm can lead to misunderstanding of the shape of the child's lex_icon. For example, it could be the case that the child uses “jamas” and “pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).

There are two types of errors possible here. One involves mapping a learner's spoken form onto an adult form when, in fact, there was no real correspondence. This is the prob_lem of overnormalization. The second type of error involves failing to map a learner's spo_ken form onto an adult form when, in fact, there is a correspondence. This is the problem of undernormalization. The goal of transcribers should be to avoid both the Scylla of over_normalization and the Charybdis of undernormalization. Steering a course between these two dangers is no easy matter. A transcription system can provide devices to aid in this pro_cess, but it cannot guarantee safe passage.

 

Transcribers also often tend to assimilate the shape of sounds spoken by the learner to the shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) not_ed that both children and adults generally produce “have” as “uv” before main verbs. As a result, forms like “might have gone” assimilate to “mightuv gone.” Fletcher believed that younger children have not yet learned to associate the full auxiliary “have” with the con_tracted form. If we write the children's forms as “might have,” we then end up mischarac_terizing the structure of their lexicon. To take another example, we can note that, in French, the various endings of the verb in the present tense are distinguished in spelling, whereas they are homophonous in speech. If a child says /m_n_/ “eat,” are we to transcribe it as first person singular mange, as second person singular manges, or as the imperative mange? If the child says /mč_e/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez?

 

CHAT deals with these problems in three ways.  First, it uses IPA as a uniform way of transcribing discourse phonetically.  Second, the editor allows the user to link the digitized audio record of the interaction directly to the transcript.  This is the system called “sonic CHAT.” With these sonic CHAT links, it is possible to double-click on a sentence and hear its sound immediately.  Having the actual sound produced by the child directly available in the transcript takes some of the burden off of the transcription system. However, whenever computerized analyses are based not on the original audio signal but on transcribed orthographic forms, one must continue to understand the limits of transcription conventions. Third, for those who wish to avoid the work involved in IPA transcription or sonic CHAT, that is a system for using nonstandard lexical forms, that the form “might (h)ave” would be universally recognized as the spelling of “mightof”, the contracted form of “might have.” More extreme cases of phonological variation can be annotated as in this example:  popo [: hippopotamus].

4.2.2     The Misuse of Standard Punctuation

Transcribers have a tendency to write out spoken language with the punctuation con_ventions of written language. Written language is organized into clauses and sentences de_limited by commas, periods, and other marks of punctuation. Spoken language, on the other hand, is organized into tone units clustered about a tonal nucleus and delineated by pauses and tonal contours (Crystal, 1969, 1979; Halliday, 1966, 1967, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jefferson, 1984) has demonstrated a close link between tone units and ideational units. Retracings, pauses, stress, and all forms of intonational contours are crucial markers of aspects of the utterance planning process. Moreover, these features also convey important sociolinguistic informa_tion. Within special markings or conventions, there is no way to directly indicate these im_portant aspects of interactions.

4.2.3     Working With Video

Whatever form a transcript may take, it will never contain a fully accurate record of what went on in an interaction. A transcript of an interaction can never fully replace an au_diotape, because an audio recording of the interaction will always be more accurate in terms of preserving the actual details of what transpired. By the same token, an audio recording can never preserve as much detail as a video recording with a high-quality audio track. Au_dio recordings record none of the nonverbal interactions that often form the backbone of a conversational interaction. Hence, they systematically exclude a source of information that is crucial for a full interpretation of the interaction. Although there are biases involved even in a video recording, it is still the most accurate record of an interaction that we have avail_able. For those who are trying to use transcription to capture the full detailed character of an interaction, it is imperative that transcription be done from a video recording which should be repeatedly consulted during all phases of analysis.

 

When the CLAN editor is used to link transcripts to audio recordings, we refer to this as sonic CHAT. When the system is used to link transcripts to video recordings, we refer to this as video CHAT. The CLAN manual explains how to link digital audio and video to transcripts.

4.3      Problems With Forced Decisions

Transcription and coding systems often force the user to make difficult distinctions. For example, a system might make a distinction between grammatical ellipsis and ungrammat_ical omission. However, it may often be the case that the user cannot decide whether an omission is grammatical or not. In that case, it may be helpful to have some way of blurring the distinction. chat has certain symbols that can be used when a categorization cannot be made. It is important to remember that many of the chat symbols are entirely optional. Whenever you feel that you are being forced to make a distinction, check the manual to see whether the particular coding choice is actually required. If it is not required, then simply omit the code altogether.

4.4      Transcription and Coding

It is important to recognize the difference between transcription and coding. Transcrip_tion focuses on the production of a written record that can lead us to understand, albeit only vaguely, the flow of the original interaction. Transcription must be done directly off an au_diotape or, preferably, a videotape. Coding, on the other hand, is the process of recognizing, analyzing, and taking note of phenomena in transcribed speech. Coding can often be done by referring only to a written transcript. For example, the coding of parts of speech can be done directly from a transcript without listening to the audiotape. For other types of coding, such as speech act coding, it is imperative that coding be done while watching the original videotape.

 

The chat system includes conventions for both transcription and coding. When first learning the system, it is best to focus on learning how to transcribe. The chat system offers the transcriber a large array of coding options. Although few transcribers will need to use all of the options, everyone needs to understand how basic transcription is done on the “main line.” Additional coding is done principally on the secondary or “dependent” tiers. As transcribers work more with their data, they will include further options from the secondary or “dependent” tiers. However, the beginning user should focus first on learning to correctly use the conventions for the main line. The manual includes several sample tran_scripts to help the beginner in learning the transcription system.

4.5      Three Goals

Like other forms of communication, transcription systems are subjected to a variety of communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the pressure of three conflicting charges or goals. On the one hand, language is designed to be clear. On the other hand, it is designed to be processible by the listener and quick and easy for the speaker. Unfortunately, ease of production often comes in conflict with clarity of marking. The competition between these three motives leads to a variety of imperfect solutions that satisfy each goal only partially. Such imperfect and unstable solutions characterize the grammar and phonology of human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in fully achieving all three goals.

 

Slobin's view of the pressures shaping human language can be extended to analyze the pressures shaping a transcription system. In many regards, a transcription system is much like any human language. It needs to be clear in its markings of categories, and still preserve readability and ease of transcription. However, unlike a human language, a transcription system needs to address two different audiences. One audience is the human audience of transcribers, analysts, and readers. The other audience is the digital computer and its pro_grams. In order to successfully deal with these two audiences, a system for computerized transcription needs to achieve the following goals:

1.     Clarity: Every symbol used in the coding system should have some clear and definable real-world referent. The relation between the referent and the symbol should be consistent and reliable. Symbols that mark particular words should al_ways be spelled in a consistent manner. Symbols that mark particular conversa_tional patterns should refer to actual patterns consistently observable in the data. In practice, codes will always have to steer between the Scylla of overregular_ization and the Charybdis of underregularization discussed earlier. Distinctions must avoid being either too fine or too coarse. Another way of looking at clarity is through the notion of systematicity. Systematicity is a simple extension of clarity across transcripts or corpora. Codes, words, and symbols must be used in a consistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular tran_script in which it is located. If interactions are necessary, as in hierarchical cod_ing systems, these interactions need to be systematically described.

2.     Readability: Just as human language needs to be easy to process, so transcripts need to be easy to read. This goal often runs directly counter to the first goal. In the CHILDES system, we have attempted to provide a variety of chat options that will allow a user to maximize the readability of a transcript. We have also provided clan tools that will allow a reader to suppress the less readable as_pects in transcript when the goal of readability is more important than the goal of clarity of marking.

3.     Ease of data entry: As distinctions proliferate within a transcription system, data entry becomes increasingly difficult and error-prone. There are two ways of dealing with this problem. One method attempts to simplify the coding scheme and its categories. The problem with this approach is that it sacrifices clarity. The second method attempts to help the transcriber by providing computational aids. The CLAN programs follow this path. They provide systems for the automatic checking of transcription accuracy, methods for the automatic analysis of mor_phology and syntax, and tools for the semiautomatic entry of codes. However, the basic process of transcription has not been automated and remains the major task during data entry.

5           CHAT Outline

chat provides both basic and advanced formats for transcription and coding. The ba_sic level of chat is called minchat. New users should start by learning minchat. This system looks much like other intuitive transcription systems that are in general use in the fields of child language and discourse analysis. However, eventually users will find that there is something they want to be able to code that goes beyond minchat. At that point, they should move on to learning midCHAT.

5.1      minCHAT – the Form of Files

There are several minimum standards for the form of a minchat file. These standards must be followed for the CLAN commands to run successfully on chat files:

1.     Every line must end with a carriage return.

2.     The first line in the file must be an @Begin header line.

3.     The second line in the file must be an @Languages header line.  The languages entered here use a three-letter ISO 639-3 code, such as “eng” for English.

4.     The third line must be an @Participants header line listing three-letter codes for each participant, the participant's name, and the participant's role.

5.     After the @Participants header come a set of @ID headers providing further details for each speaker.  These will be inserted automatically for you when you run CHECK using escape-L.

6.     The last line in the file must be an @End header line.

7.     Lines beginning with * indicate what was actually said. These are called “main lines.” Each main line should code one and only one utterance. When a speaker produces several utterances in a row, code each with a new main line.

8.     After the asterisk on the main line comes a three-letter code in upper case letters for the participant who was the speaker of the utterance being coded. After the three-letter code comes a colon and then a tab.

9.     What was actually said is entered starting in the ninth column.

10.  Lines beginning with the % symbol can contain codes and commentary regarding what was said. They are called “dependent tier” lines.  The % symbol is followed by a three-letter code in lowercase letters for the dependent tier type, such as “pho” for phonology; a colon; and then a tab. The text of the dependent tier begins after the tab.

11.  Continuations of main lines and dependent tier lines begin with a tab which is inserted automatically by the CLAN editor.

5.2      minCHAT – Words and Utterances

In addition to these minimum requirements for the form of the file, there are certain minimum ways in which utterances and words should be written on the main line:

1.     Utterances must end with an utterance terminator. The basic utterance termi_nators are the period, the exclamation mark, and the question mark. These can be preceded by a space, but the space is not required.

2.     Commas can be used as needed to mark phrasal junctions, but they are not used by the programs and have no sharp prosodic definition.

3.     Use upper case letters only for proper nouns and the word “I.” Do not use upper_case letters for the first words of sentences. This will facilitate the identification of proper nouns.

4.     To facilitate recognition of proper nouns and avoid misspellings, words should not contain capital letters except at their beginning. Words should not contain numbers, unless these mark tones.

5.     Unintelligible words with an unclear phonetic shape should be transcribed as xxx.

6.     If you wish to note the phonological form of an incomplete or unintelligible pho_nological string, write it out with an ampersand, as in &guga.

7.     Incomplete words can be written with the omitted material in parentheses, as in (be)cause and (a)bout.

 

Here is a sample that illustrates these principles. This file is syntactically correct and uses the minimum number of chat conventions while still maintaining compatibility with the CLAN commands.

 

@Begin

@Languages:     eng

@Participants: CHI Ross Child, FAT Brian Father

@ID:      eng|macwhinney|CHI|2;10.10||||Target_Child|||

@ID:      eng|macwhinney|FAT|35;2.||||Target_Child|||

*ROS:     why isn't Mommy coming?

%com:     Mother usually picks Ross up around 4 PM.

*FAT:     don't worry.

*FAT:     she'll be here soon.

*CHI:     good.

@End

5.3      Analyzing One Small File

For researchers who are just now beginning to use chat and CLAN, there is one single suggestion that can potentially save literally hundreds of hours of wasted time. The suggestion is to transcribe and analyze one single small file completely and perfectly before launching a major effort in transcription and analysis. The idea is that you should learn just enough about minchat and minCLAN to see your path through these four crucial steps:

1.     entry of a small set of your data into a CHAT file,

2.     successful running of the CHECK command inside the editor to guarantee accu_racy in your CHAT file,

3.     development of a series of codes that will interface with the particular CLAN commands most appropriate for your analysis, and

4.     running of the relevant CLAN commands, so that you can be sure that the results you will get will properly test the hypotheses you wish to develop.

If you go through these steps first, you can guarantee in advance the successful outcome of your project. You can avoid ending up in a situation in which you have transcribed hun_dreds of hours of data in a way that does not match correctly with the input require_ments for CLAN.

5.4      Next Steps

After having learned minchat, you are ready to learn the basics of CLAN. To do this, you will want to work through the first chapters of the CLAN manual focusing in particular on the CLAN tutorial. These chapters will take you up to the level of minCLAN, which corresponds to the minchat level.

 

Once you have learned minCHAT and minCLAN, you are ready to move on to learning the rest of the system. You should next work through the chapters on words, utterances, and scoped symbols. Depending on the shape of your particular project, you may then need to study additional chapters in this manual.  For people working on large projects that last many months, it is a good idea to eventually read all of the current manual, although some sections that seem less relevant to the project can be skimmed.

5.5      File Naming

The CHILDES database consists of a collection of corpora, organized into larger folders by languages and language groups.  For example, there is a top-level folder called Romance in which one finds subfolders for Spanish, French, and other Romance languages.  Within the Spanish folder, there are then dozens of further folders, each of which has a single corpus.  With a corpus, files may be further grouped by individual children or groups of children.  For longitudinal corpora, we recommend that file names use the age of the child followed by a letter if there are several recordings from a given day.  For example, the transcript from the fourth taping session when the child was 2;3;22 would be called  20322d.cha.  It is better to use ages for file names, rather than dates or other material.

5.6      Metadata

Increasingly, researchers rely on Internet systems to locate and retrieve language data and resources.  There are currently several systems designed to facilitate this process and we have adapted the indexing and registration of materials in the CHILDES and TalkBank systems to provide information that can be incorporated into these systems.  The two systems designed specifically to deal with linguistic data are OLAC (Online Language Archives Community at www.language-archives.org) and VLO (Virtual Language Observatory at vlo.clarin.eu).  These systems allow researchers to search for whole corpora or single files, using terms such as Cantonese, video, gesture, or aphasia. In order to publish or register TalkBank data within these systems, we create a 0metadata.cdc file at the top level of each corpus in TalkBank.  Some of the fields in this metadata file are designed for indexing in OLAC and some are designed for the CMDI system used by VLO and the related facility called The Language Archive (tla.mpi.nl).  Because of the highly specific nature of the terms and the software used for regular harvesting and publication of these data, we do not require users to create the 0metadata.cdc files.  The following table explains what keywords are expected within each field of these files.  The first fields listed are for OLAC and the later ones are for CMDI.  For CMDI, the values unknown and unspecified are also available for most of the fields.

 

Field

Example

Values

CMDI_PID:

11312/c-00041631-1

Set by Handle Server system

Title:

Bilingual AarsenBos Corpus

open

Creator:

Aarssen, Jeroen

open

Creator:

Bos, Petra

open

Subject:

child language development

 

Subject.olac:linguistic-field:

language_acquisition

 

Subject.olac:language:

ndl

ISO-639

Subject.olac:language:

tur

ISO-639

Subject.olac:language:

ara

ISO-639

Subject.childes:participant:

age="4 - 10"

open

Description:

 

open

Publisher:

TalkBank

open

Contributor:

Aarssen, Jeroen

open

Date:

2004-03-30

YEAR-MM-DD

Type:

Text

text, video,

Type.olac:linguistic-type:

primary_text

lexicon, primary_text, language_description

Type.olac:discourse-type:

dialogue

dialogue, drama, formulaic, ludic, oratory, narrative, procedural, report, singing, unintelligible speech

Format:

 

 

Identifier:

1-59642-132-0

ISBN

Language:

 

ISO-639

Relation:

 

open

Coverage:

 

open

Rights:

 

open

IMDI_Genre:

discourse

 

IMDI_Interactivity:

interactive

interactive, non-interactive, semi-interactive

IMDI_PlanningType:

spontaneous

spontaneous, semi-spontaneous, planned

IMDI_Involvement:

non-elicited

elicited, non-elicited, no-observer

IMDI_SocialContext:

family

family, private, public, controlled environment, talkshow, shopping, face_to_face, lecture, legal, religious, sports, tutorial, classroom, medical work, meeting, clinic, telechat, phonecall, computer, constructed

IMDI_EventStructure:

conversation

monologue, dialogue, conversation, not a natural format

IMDI_Task:

unspecified

open

IMDI_Modalities:

speech

open

IMDI_Subject:

unspecified

open

IMDI_EthnicGroup:

unspecified

open

IMDI_RecordingConditions:

unspecified

open

IMDI_AccessAvailability:

open access

open

IMDI_Continent:

Europe

Dublin Core

IMDI_Country:

Netherlands

Dublin Core

IMDI_ProjectDescription:

 

open

IMDI_MediaFileDescription:

 

open

IMDI_WrittenResourceSubType:

 

open

 

For the CMDI/VLO/CLARIN system, there must be a cmdi.xml file for each transcript.  To create these several thousand files, we use a CLAN program that takes the information from the 0metadata.cdc files and from the header lines in each transcript.  The information in the @ID field is particularly important in this process.  It also relies on the fact that we use an isomorphic file system for indexing media files.  Fortunately, users do not need to concern themselves with all these many additional technical details.   

5.7      The Documentation File

chat files typically record a conversational sample collected from a particular set of speakers on a particular day. Sometimes researchers study a small set of children repeatedly over a long period of time. Corpora created using this method are referred to as longitudinal studies. For such studies, it is best to break up chat files into one collection for each child. This can be done just by creating file names that begin with the three letter code for the child, as in lea001.cha or eve15.cha. Each collection of files from the children involved in a given study constitutes a cor_pus. A corpus can also be composed of a group of files from different groups of speakers when the focus is on a cross-sectional sampling of larger numbers of language learners from various age groups. In either case, each corpus should have a documen_tation file. This “readme” file should contain a basic set of facts that are indispensable for the proper interpretation of the data by other researchers. The minimum set of facts that should be in each readme file are the following.

1.     Acknowledgments. There should be a statement that asks the user to cite some particular reference when using the corpus. For example, researchers using the Adam, Eve, and Sarah corpora from Roger Brown and his colleagues are asked to cite Brown (1973). In addition, all users can cite this current manual as the source for the CHILDES system in general.

2.     Restrictions. If the data are being contributed to the CHILDES system, contrib_utors can set particular restrictions on the use of their data. For example, re_searchers may ask that they be sent copies of articles that make use of their data. Many researchers have chosen to set no limitations at all on the use of their data.