The CHILDES Project

 

Tools for Analyzing Talk – Electronic Edition

 

 

 

 

Part 1:  The CHAT Transcription Format

 

 

Brian MacWhinney

Carnegie Mellon University

 

April 2, 2014

 

 

 

 

 

 

Citation for last printed version:

 

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

 

1       Table of Contents

 

1    Table of Contents........................................................................................................ 2

2    Introduction................................................................................................................. 5

2.1    Impressionistic Observation.................................................................................... 5

2.2    Baby Biographies.................................................................................................... 6

2.3    Transcripts.............................................................................................................. 6

2.4    Computers............................................................................................................... 8

2.5    Connectivity............................................................................................................ 8

2.6    Three Tools............................................................................................................. 9

2.7    Shaping CHAT....................................................................................................... 9

2.8    Building CLAN..................................................................................................... 10

2.9    Constructing the Database..................................................................................... 10

2.10    Disseminating CHILDES................................................................................... 11

2.11    Funding............................................................................................................... 11

2.12    How to Use These Manuals................................................................................ 12

2.13    Changes.............................................................................................................. 13

3    Principles................................................................................................................... 14

3.1    Computerization.................................................................................................... 14

3.2    Words of Caution................................................................................................. 15

3.2.1    The Dominance of the Written Word............................................................. 15

3.2.2    The Misuse of Standard Punctuation............................................................. 16

3.2.3    Working With Video...................................................................................... 16

3.3    Problems With Forced Decisions......................................................................... 17

3.4    Transcription and Coding...................................................................................... 17

3.5    Three Goals........................................................................................................... 18

4    CHAT Outline........................................................................................................... 20

4.1    minCHAT – the Form of Files.............................................................................. 20

4.2    minCHAT – Words and Utterances...................................................................... 20

4.3    Analyzing One Small File..................................................................................... 21

4.4    midCHAT............................................................................................................. 21

4.5    The Documentation File........................................................................................ 22

4.6    Checking Syntactic Accuracy................................................................................ 23

5    File Headers............................................................................................................... 24

5.1    Hidden Headers.................................................................................................... 24

5.2    Initial Headers....................................................................................................... 25

5.3    Participant-Specific Headers................................................................................. 30

5.4    Constant Headers.................................................................................................. 30

5.5    Changeable Headers.............................................................................................. 32

6    Words.......................................................................................................................... 36

6.1    The Main Line....................................................................................................... 37

6.2    Basic Words......................................................................................................... 37

6.3    Special Form Markers........................................................................................... 37

6.4    Unidentifiable Material.......................................................................................... 41

6.5    Incomplete and Omitted Words............................................................................ 43

6.6    Standardized Spellings.......................................................................................... 44

6.6.1    Letters............................................................................................................ 45

6.6.2    Compounds and Linkages............................................................................. 45

6.6.3    Capitalization................................................................................................. 46

6.6.4    Acronyms....................................................................................................... 46

6.6.5    Numbers and Titles........................................................................................ 46

6.6.6    Kinship Forms............................................................................................... 47

6.6.7    Shortenings.................................................................................................... 47

6.6.8    Assimilations.................................................................................................. 48

6.6.9    Exclamations.................................................................................................. 49

6.6.10    Communicators........................................................................................... 50

6.6.11    Spelling Variants.......................................................................................... 50

6.6.12    Colloquial Forms......................................................................................... 51

6.6.13    Dialectal Variations..................................................................................... 51

6.6.14    Baby Talk..................................................................................................... 52

6.6.15    Word separation in Japanese...................................................................... 53

6.6.16    Punctuation in French and Italian............................................................... 53

6.6.17    Abbreviations in Dutch................................................................................ 54

7    Utterances.................................................................................................................. 55

7.1    One Utterance or Many?....................................................................................... 55

7.2    Discourse Repetition............................................................................................. 57

7.3    Basic Utterance Terminators................................................................................. 57

7.4    Satellite Markers................................................................................................... 59

7.5    Separators............................................................................................................. 59

7.6    Tone Direction...................................................................................................... 60

7.7    Prosody Within Words......................................................................................... 60

7.8    Local Events.......................................................................................................... 61

7.8.1    Simple Events................................................................................................. 61

7.8.2    Complex Local Events.................................................................................... 62

7.8.3    Pauses............................................................................................................ 63

7.8.4    Long Events................................................................................................... 63

7.8.5    Interposed Back Channel............................................................................... 63

7.9    Special Utterance Terminators............................................................................... 64

7.10    Utterance Linkers................................................................................................ 66

8    Scoped Symbols......................................................................................................... 68

8.1    Audio and Video Time Marks............................................................................... 68

8.2    Paralinguistic Scoping and Events......................................................................... 69

8.3    Explanations and Alternatives................................................................................ 69

8.4    Retracing, Overlap, and Clauses............................................................................ 72

8.5    Error Marking....................................................................................................... 75

8.6    Initial and Final Codes.......................................................................................... 75

9    Dependent Tiers........................................................................................................ 78

9.1    Standard Dependent Tiers..................................................................................... 78

9.2    Synchrony Relations............................................................................................. 84

10    CHAT-CA Transcription........................................................................................ 86

11    Disfluency Transcription....................................................................................... 90

12    Arabic Transcription............................................................................................... 91

13    Specific Applications.............................................................................................. 92

13.1    Code-Switching................................................................................................... 92

13.2    Elicited Narratives and Picture Descriptions........................................................ 92

13.3    Written Language................................................................................................ 92

14    Speech Act Codes.................................................................................................... 92

14.1    Interchange Types............................................................................................... 92

14.2    Illocutionary Force Codes................................................................................... 92

15    Error Coding............................................................................................................ 92

15.1    Word level error codes summary........................................................................ 92

15.2    Word level coding – details................................................................................. 92

15.3    Utterance level error coding (post-codes)............................................................ 92

16    Morphosyntactic Coding........................................................................................ 92

16.1    One-to-one correspondence................................................................................ 92

16.2    Tag Groups and Word Groups........................................................................... 92

16.3    Words................................................................................................................. 92

16.4    Part of Speech Codes.......................................................................................... 92

16.5    Stems.................................................................................................................. 92

16.6    Affixes................................................................................................................ 92

16.7    Clitics.................................................................................................................. 92

16.8    Compounds......................................................................................................... 92

16.9    Punctuation Marks.............................................................................................. 92

16.10    Sample Morphological Tagging for English..................................................... 92

References........................................................................................................................ 92

2       Introduction

Language acquisition research thrives on data collected from spontaneous interactions in naturally occurring situations. You can turn on a tape recorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this first volume, we will present a set of compu­tational tools designed to increase the reliability of transcriptions, automate the process of data analysis, and facilitate the sharing of transcript data. These new computational tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of sec­ond-language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this volume concentrates on their use in the child language field, in the hope that researchers from other areas can make the necessary analogies to their own topics.

 

Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collec­tion of data on language acquisition. These earlier approaches can be grouped into five ma­jor historical periods.

2.1      Impressionistic Observation

The first attempt to understand the process of language development appears in a re­markable passage from The Confessions of St. Augustine (1952). In this passage, Augustine claims that he remembered how he had learned language:

This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.

Augustine's outline of early word learning drew attention to the role of gaze, pointing, intonation, and mutual understanding as fundamental cues to language learning.  Modern research in word learning (P. Bloom, 2000) has supported every point of Augustine's analysis, as well as his emphasis on the role of children's intentions.  In this sense, Augustine's somewhat fanciful recollection of his own language acquisition remained the high water mark for child language studies through the Middle Ages and even the Enlightenment. Unfortunately, the method on which these insights were grounded depends on our ability to actually recall the events of early childhood – a gift granted to very few of us.

2.2      Baby Biographies

Charles Darwin provided much of the inspiration for the development of the second major technique for the study of language acquisition. Using note cards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to col­lect an impressive body of naturalistic data in support of his views on natural selection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human devel­opment. By taking detailed daily notes, Darwin showed how researchers could build diaries that could then be converted into biographies documenting virtually any aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing the language devel­opment of their own children.

 

Darwin's biographical technique also had its effects on the study of adult aphasia. Fol­lowing in this tradition, studies of the language of particular patients and syndromes were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.

2.3      Transcripts

The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Any­one who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the note-taking process interferes with the ongoing interac­tions.

 

The introduction of the tape recorder in the late 1950s provided a way around these lim­itations and ushered in the third period of observational studies. The effect of the tape re­corder on the field of language acquisition was very much like its effect on ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large data sets of tape recordings from several subjects across a period of 2 or 3 years. Much of the excitement in the 1960s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of tape recordings and typed tran­scripts.

 

This increase in the amount of raw data had an additional, seldom discussed, conse­quence. In the period of the baby biography, the final published accounts closely resembled the original database of note cards. In this sense, there was no major gap between the ob­servational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60s and 70s made it impossible to publish the full corpora. Instead, researchers were forced to publish only high-level analyses based on data that were not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavailable for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies, were then made by thermal production of new ditto masters. Each investigator devised a project-specific system of transcription and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.

 

Recognizing this problem, Roger Brown took the lead in attempting to share his tran­scripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These transcripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very po­lite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself!

 

During this early period, the relations between the various coding schemes often remained shrouded in mystery. A fortunate consequence of the unstable nature of coding systems was that researchers were very careful not to throw away their original data, even after it had been coded. Brown himself commented on the impending transition to computers in this passage (Brown, 1973, p. 53):

It is sensible to ask and we were often asked, “Why not code the sentences for grammatically significant features and put them on a computer so that studies could readily be made by anyone?”  My answer always was that I was continually discovering new kinds of information that could be mined from a transcription of conversation and never felt that I knew what the full coding should be.  This was certainly the case and indeed it can be said that in the entire decade since 1962 investigators have continued to hit upon new ways of inferring grammatical and semantic knowledge or competence from free conversation. But, for myself, I must, in candor, add that there was also a factor of research style.  I have little patience with prolonged “tooling up” for research.  I always want to get started. A better scientist would probably have done more planning and used the computer.  He can do so today, in any case, with considerable confidence that he knows what to code.

With the experience of three more decades of computerized analysis behind us, we now know that the idea of reducing child language data to a set of codes and then throwing away the original data is simply wrong.  Instead, our goal must be to computerize the data in a way that allows us to continually enhance it with new codes and annotations.  It is fortunate that Brown preserved his transcript data in a form that allowed us to continue to work on it.  It is unfortunate, however, that the original audiotapes were not kept.

2.4      Computers

Just as these data analysis problems were coming to light, a major technological oppor­tunity was emerging in the shape of the powerful, affordable microcomputer. Microcom­puter word-processing systems and database programs allowed researchers to enter transcript data into computer files that could then be easily duplicated, edited, and ana­lyzed by standard data-processing techniques. In 1981, when the Child Language Data Exchange System (CHILDES) Project was first conceived, researchers basically thought of computer systems as large notepads. Al­though researchers were aware of the ways in which databases could be searched and tab­ulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.

 

Rather than serving only as an “archive” or historical record, a focus on a shared data­base can lead to advances in methodology and theory. However, to achieve these additional advances, researchers first needed to move beyond the idea of a simple data repository. At first, the possibility of utilizing shared transcription formats, shared codes, and shared anal­ysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that CHILDES (the name uses a one-syllable pronunciation) was conceived. The origin of the system can be traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan Ervin-Tripp, and Brian MacWhinney discussed the pos­sibility of creating an archive for typed, handwritten, and computerized transcripts to be lo­cated at the Max-Planck-Institut für Psycholinguistik in Nijmegen. In 1983, the MacArthur Foundation funded meetings of developmental researchers in which Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language researchers discussed the possi­bility of soliciting MacArthur funds to support a data exchange system. In January of 1984, the MacArthur Foundation awarded a two-year grant to Brian MacWhinney and Catherine Snow for the establishment of the Child Language Data Exchange System. These funds provided for the entry of data into the system and for the convening of a meeting of an ad­visory board. Twenty child language researchers met for three days in Concord, Massachu­setts and agreed on a basic framework for the CHILDES system, which Catherine Snow and Brian MacWhinney would then proceed to implement.

2.5      Connectivity

Since 1984, when the CHILDES Project began in earnest, the world of computers has gone through a series of remarkable revolutions, each introducing new opportunities and challenges. The processing power of the home computer now dwarfs the power of the mainframe of the 1980s; new machines are now shipped with built-in audiovisual capabil­ities; and devices such as CD-ROMs and optical disks offer enormous storage capacity at reasonable prices. This new hardware has now opened up the possibility for multimedia ac­cess to digitized audio and video from links inside the written transcripts. In effect, a tran­script is now the starting point for a new exploratory reality in which the whole interaction is accessible from the transcript. Although researchers have just now begun to make use of these new tools, the current shape of the CHILDES system reflects many of these new re­alities. In the pages that follow, you will learn about how we are using this new technology to provide rapid access to the database and to permit the linkage of transcripts to digitized audio and video records, even over the Internet.  For further ideas regarding this type of work, you may wish to connect to http://talkbank.org where there are various extensions of the CHILDES project.

2.6      Three Tools

The reasons for developing a computerized exchange system for language data are im­mediately obvious to anyone who has produced or analyzed transcripts. With such a sys­tem, we can:

1.     automate the process of data analysis,

2.     obtain better data in a consistent, fully-documented transcription system, and

3.     provide more data for more children from more ages, speaking more languages.

The CHILDES system has addressed each of these goals by developing three separate, but integrated, tools. The first tool is the chat transcription and coding format. The sec­ond tool is the clan analysis program, and the third tool is the database. These three tools are like the legs of a three-legged stool. The transcripts in the database have all been put into the chat transcription system. The program is designed to make full use of the chat format to facilitate a wide variety of searches and analyses. Many research groups are now using the CHILDES programs to enter new data sets. Eventually, these new data sets will be available to other researchers as a part of the growing CHILDES database. In this way, chat, CLAN, and the database function as a coarticulated set of complementary tools.

 

There are manuals for each of the three CHILDES tools.  The CHAT manual, which you are now reading, describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the CLAN computer pro­grams that you can use to transcribe, annotate, and analyze language interactions. The third manual, which is actually a collection of over a dozen separate manuals retrievable from a single link on the web, describes the data files in the CHILDES database.  Each of these database manuals describes the data sets in one major component of the database.  In addition, there is a short manual that provides an overview for the entire database.

2.7      Shaping CHAT

We received a great deal of extremely helpful input during the years between 1984 and 1988 when the CHAT system was being formulated. Some of the most detailed comments came from George Allen, Elizabeth Bates, Nan Bernstein Ratner, Giuseppe Cappelli, An­nick De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara Pan, Lucia Pfanner, Kim Plunkett, Kelley Sacco, Catherine Snow, Jeff Sokolov, Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments developed in Edwards (1992) were useful in shaping core aspects of CHAT. George Allen (1988) helped developed the UNIBET and PHO­NASCII systems. The workers in the LIPPS Group (LIPPS, 2000) have developed extensions of CHAT to cover code-switching phenomena. Adaptations of CHAT to deal with data on disfluencies are developed in Bernstein-Ratner, Rooney, and MacWhinney (Bernstein-Ratner, Rooney, & MacWhinney, 1996). The exercises in Chapter 7 of Part II are based on materials originally de­veloped by Barbara Pan for Chapter 2 of Sokolov & Snow (1994)

In the period between 2001 and 2004, we converted much of the CHILDES system to work with the new XML Internet data format.  This work was begun by Romeo Anghelache and completed by Franklin Chen. Support for this major reformatting and the related tightening of the CHAT format came from the NSF TalkBank Infrastructure project which involved a major collaboration with Steven Bird and Mark Liberman of the Linguistic Data Consortium. Ongoing work in TalkBank is documented on the web at http://talkbank.org. 

2.8      Building CLAN

The CLAN program is the brainchild of Leonid Spektor. Ideas for particular analysis commands came from several sources. Bill Tuthill's HUM package provided ideas about concordance analyses. The SALT system of Miller & Chapman (1983) provided guide­lines regarding basic practices in transcription and analysis. Clifton Pye's PAL program provided ideas for the MODREP and PHONFREQ commands.

 

Darius Clynes ported CLAN to the Macintosh. Jeffrey Sokolov wrote the CHIP pro­gram. Mitzi Morris designed the MOR analyzer using specifications provided by Roland Hauser of Erlangen University. Norio Naka and Susanne Miyata developed a MOR rule system for Japanese; and Monica Sanz-Torrent helped develop the MOR system for Spanish. Julia Evans provided recommendations for the design of the audio and visual capabilities of the editor. Johannes Wagner, Mike Forrester, and Chris Ramsden helped show us how we could modify clan to permit transcription in the Conversation Analysis framework. Steven Gillis provided suggestions for aspects of MODREP.  Christophe Parisse built the POST and POSTTRAIN programs (Parisse & Le Normand, 2000). Brian Richards contributed the VOCD program (Malvern, Richards, Chipere, & Purán, 2004).  Julia Evans helped specify TIMEDUR and worked on the details of DSS. Catherine Snow designed CHAINS, KEYMAP, and STATFREQ. Nan Bernstein Ratner specified aspects of PHONFREQ and plans for additional programs for phonological analysis.

2.9      Constructing the Database

The primary reason for the success of the CHILDES database has been the generosity of over 100 researchers who have contributed their corpora. Each of these corpora represents hundreds, often thousands, of hours spent in careful collection, tran­scription, and checking of data. All researchers in child language should be proud of the way researchers have generously shared their valuable data with the whole research com­munity. The growing size of the database for language impairments, adult aphasia, and sec­ond-language acquisition indicates that these related areas have also begun to understand the value of data sharing.

 

Many of the corpora contributed to the system were transcribed before the formulation of CHAT. In order to create a uniform database, we had to reformat these corpora into CHAT. Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, Kelley Sacco, and Gergely Sikuta worked many long hours on this task. Steven Gillis, Helmut Feldweg, Susan Powers, and Heike Behrens supervised a parallel effort with the German and Dutch data sets.

 

Because of the continually changing shape of the programs and the database, keeping this manual up to date has been an ongoing activity. In this process, I received help from Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson, Kelley Sacco, and Gergely Sikuta. Barbara Pan, Jeff Sokolov, and Pam Rollins also provided a reading of the final draft of the 1995 version of the manual.

2.10  Disseminating CHILDES

Since the beginning of the project, Catherine Snow has continually played a pivotal role in shaping policy, building the database, organizing workshops, and determining the shape of chat and CLAN. Catherine Snow collaborated with Jeffrey Sokolov, Pam Rollins, and Barbara Pan to construct a series of tutorial exercises and demonstration analyses that ap­peared in Sokolov & Snow (1994). Those exercises form the basis for similar tutorial sec­tions in the current manual. Catherine Snow has contributed six major corpora to the database and has conducted CHILDES workshops in a dozen countries.

 

Several other colleagues have helped disseminate the CHILDES system through work­shops, visits, and Internet facilities. Hidetosi Sirai established a CHILDES file server mir­ror at Chukyo University in Japan and Steven Gillis established a mirror at the University of Antwerp. Steven Gillis, Kim Plunkett, Johannes Wagner, and Sven Strömqvist helped propagate the CHILDES system at universities in Northern and Central Europe. Susanne Miyata has brought together a vital group of child language researchers using CHILDES to study the acquisition of Japanese and has supervised the translation of the current manual into Japanese. In Italy, Elena Pizzuto organized symposia for developing the CHILDES sys­tem and has supervised the translation of the manual into Italian. Magdalena Smoczynska in Krakow and Wolfgang Dressler in Vienna have helped new researchers who are learning to use CHILDES for languages spoken in Eastern Europe. Miquel Serra has sup­ported a series of CHILDES workshops in Barcelona. Zhou Jing organized a workshop in Nanjing and Chien-ju Chang organized a workshop in Taipei.

2.11  Funding

From 1984 to 1988, the John D. and Catherine T. MacArthur Foundation supported the CHILDES Project. In 1988, the National Science Foundation provided an equipment grant that allowed us to put the database on the Internet and on CD-ROMs. From 1989 to 2010, the project has been supported by an ongoing grant from the National Insti­tutes of Health (NICHHD). In 1998, the National Science Foundation Linguistics Program provided additional support to improve the programs for morphosyntactic analysis of the database. In 1999, NSF funded the TalkBank project which seeks to improve the CHILDES tools and to use CHILDES as a model for other disciplines studying human communication. In 2002, NSF provided support for the development of the GRASP system for parsing of the corpora.  In 2002, NIH provided additional support for the development of PhonBank for child language phonology and AphasiaBank for the study of communication in aphasia.

2.12  How to Use These Manuals

Each of the three parts of the CHILDES system is described in a separate manual.  The CHAT manual describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the editor and the analytic commands. The database manual is a set of over a dozen smaller documents, each describing a separate segment of the database.

 

To learn the CHILDES system, you should begin by downloading and installing the CLAN program.  Next, you should download and start to read the current manual (CHAT Manual) and the CLAN manual.  Before proceeding too far into the CHAT manual, you will want to walk through the tutorial section at the beginning of the CHAT manual. After finishing the tutorial, try working a bit with each of the CLAN commands to get a feel for the overall scope of the system. You can then learn more about CHAT by transcribing a small sample of your data in a short test file. Run the CHECK program at frequent intervals to verify the accuracy of your coding. Once you have fin­ished transcribing a small segment of your data, try out the various analysis pro­grams you plan to use, to make sure that they provide the types of results you need for your work.

 

If you are primarily interested in analyzing data already stored in the CHILDES archive, you do not need to learn the CHAT transcription format in much detail and you will only need to use the editor to open and read files. In that case, you may wish to focus your efforts on learning to use the CLAN programs. If you plan to transcribe new data, then you also need to work with the current manual to learn to use CHAT.

 

Teachers will also want to pay particular attention to the sections of the CLAN manual that present a tutorial introduction. Using some of the examples given there, you can construct additional materials to encourage students to explore the database to test out particular hypotheses.  At the end of the CLAN manual, there are also a series of exercises that help students further consolidate their knowl­edge of CHAT and CLAN.

 

The CHILDES system was not intended to address all issues in the study of language learning, or to be used by all students of spontaneous interactions. The chat system is comprehensive, but it is not ideal for all purposes. The programs are pow­erful, but they cannot solve all analytic problems. It is not the goal of CHILDES to provide facilities for all research endeavors or to force all research into some uniform mold. On the contrary, the programs are designed to offer support for alternative analytic frameworks. For example, the editor now supports the various codes of Conversation Analysis (CA) format, as alternatives and supplements to CHAT format.

 

 There are many researchers in the fields that study language learning who will never need to use CHILDES. Indeed, we estimate that the three CHILDES tools will never be used by at least half of the researchers in the field of child language. There are three com­mon reasons why individual researchers may not find CHILDES useful:

1.     some researchers may have already committed themselves to use of another an­alytic system;

2.     some researchers may have collected so much data that they can work for many years without needing to collect more data and without comparing their own data with other researchers' data; and

3.     some researchers may not be interested in studying spontaneous speech data.

Of these three reasons for not needing to use the three CHILDES tools, the third is the most frequent. For example, researchers studying comprehension would only be interested in CHILDES data when they wish to compare findings arising from studies of comprehension with patterns occurring in spontaneous production.

2.13  Changes

The CHILDES tools have been extensively tested for ease of application, accuracy, and reliability. However, change is fundamental to any research enterprise. Researchers are con­stantly pursuing better ways of coding and analyzing data. It is important that the CHILDES tools keep progress with these changing requirements. For this reason, there will be revisions to chat, the programs, and the database as long as the CHILDES Project is active.

3       Principles

The chat system provides a standardized format for producing computerized tran­scripts of face-to-face conversational interactions. These interactions may involve children and parents, doctors and patients, or teachers and second-language learners. Despite the dif­ferences between these interactions, there are enough common features to allow for the cre­ation of a single general transcription system. The system described here is designed for use with both normal and disordered populations. It can be used with learners of all types, including children, second-language learners, and adults recovering from aphasic disor­ders. The system provides options for basic discourse transcription as well as detailed pho­nological and morphological analysis. The system bears the acronym “chat,” which stands for Codes for the Human Analysis of Transcripts. Chat is the standard transcrip­tion system for the CHILDES (Child Language Data Exchange System) Project. All of the transcripts in the CHILDES da­tabase are in chat format.

What makes CHAT particularly powerful is  the fact that files transcribed in CHAT can also be analyzed by the CLAN programs that are described in the CLAN manual, which is an electronic companion piece to this manual. The CHAT programs can track a wide variety of structures, compute automatic indices, and analyze morphosyntax.  Moreover, because all CHAT files can now also be translated to a highly structured form of XML (a language used for text documents on the web), they are now also compatible with a wide range of other powerful computer programs such as ELAN, Praat, EXMARaLDA, Phon, Transcriber, and so on.

The CHILDES system has had a major impact on the study of child language. At the time of the last monitoring in 2003, there were over 2000 published articles that had made use of the programs and database.  In 2007, the size of the database had grown to over 44 million words, making it by far the largest database of conversational interactions available anywhere.  The total number of researchers who have joined as CHILDES members across the length of the project is now over 4500. Of course, not all of these people are making active use of the tools at all times. However, it is safe to say that, at any given point in time, approximately 100 groups of researchers around the world are involved in new data collection and transcription using the chat system. Eventually the data collected in these various projects will all be contributed to the da­tabase.

3.1      Computerization

Public inspection of experimental data is a crucial prerequisite for serious scientific progress. Imagine how genetics would function if every experimenter had his or her own individual strain of peas or drosophila and refused to allow them to be tested by other ex­perimenters. What would happen in geology, if every scientist kept his or her own set of rock specimens and refused to compare them with those of other researchers? In some fields the basic phenomena in question are so clearly open to public inspection that this is not a problem. The basic facts of planetary motion are open for all to see, as are the basic facts underlying Newtonian mechanics.

 

Unfortunately, in language studies, a free and open sharing and exchange of data has not always been the norm. In earlier decades, researchers jealously guarded their field notes from a particular language community of subject type, refusing to share them openly with the broader community. Various justifications were given for this practice. It was some­times claimed that other researchers would not fully appreciate the nature of the data or that they might misrepresent crucial patterns. Sometimes, it was claimed that only someone who had actually participated in the community or the interaction could understand the na­ture of the language and the interactions. In some cases, these limitations were real and im­portant. However, all such restrictions on the sharing of data inevitably impede the progress of the scientific study of language learning.

 

Within the field of language acquisition studies it is now understood that the advantages of sharing data outweigh the potential dangers. The question is no longer whether data should be shared, but rather how they can be shared in a reliable and responsible fashion. The computerization of transcripts opens up the possibility for many types of data sharing and analysis that otherwise would have been impossible. However, the full exploitation of this opportunity requires the development of a standardized system for data transcription and analysis.

3.2      Words of Caution

Before examining the chat system, we need to consider some dangers involved in computerized transcriptions. These dangers arise from the need to compress a complex set of verbal and nonverbal messages into the extremely narrow channel required for the computer. In most cases, these dangers also exist when one creates a typewritten or hand­written transcript. Let us look at some of the dangers surrounding the enterprise of transcription.

3.2.1     The Dominance of the Written Word

Perhaps the greatest danger facing the transcriber is the tendency to treat spoken lan­guage as if it were written language. The decision to write out stretches of vocal material using the forms of written language can trigger a variety of theoretical commitments. As Ochs (1979) showed so clearly, these decisions will inevitably turn transcription into a theoretical en­terprise. The most difficult bias to overcome is the tendency to map every form spoken by a learner – be it a child, an aphasic, or a second-language learner – onto a set of standard lexical items in the adult language. Transcribers tend to assimilate nonstandard learner strings to standard forms of the adult language. For example, when a child says “put on my jamas,” the transcriber may instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is simply a childish form of “pajamas.” This type of regularization of the child form to the adult lexical norm can lead to misunderstanding of the shape of the child's lex­icon. For example, it could be the case that the child uses “jamas” and “pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).

There are two types of errors possible here. One involves mapping a learner's spoken form onto an adult form when, in fact, there was no real correspondence. This is the prob­lem of overnormalization. The second type of error involves failing to map a learner's spo­ken form onto an adult form when, in fact, there is a correspondence. This is the problem of undernormalization. The goal of transcribers should be to avoid both the Scylla of over­normalization and the Charybdis of undernormalization. Steering a course between these two dangers is no easy matter. A transcription system can provide devices to aid in this pro­cess, but it cannot guarantee safe passage.

 

Transcribers also often tend to assimilate the shape of sounds spoken by the learner to the shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) not­ed that both children and adults generally produce “have” as “uv” before main verbs. As a result, forms like “might have gone” assimilate to “mightuv gone.” Fletcher believed that younger children have not yet learned to associate the full auxiliary “have” with the con­tracted form. If we write the children's forms as “might have,” we then end up mischarac­terizing the structure of their lexicon. To take another example, we can note that, in French, the various endings of the verb in the present tense are distinguished in spelling, whereas they are homophonous in speech. If a child says /mʌnʒ/ “eat,” are we to transcribe it as first person singular mange, as second person singular manges, or as the imperative mange? If the child says /mãʒe/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez?

 

CHAT deals with these problems in three ways.  First, it uses IPA as a uniform way of transcribing discourse phonetically.  Second, the editor allows the user to link the digitized audio record of the interaction directly to the transcript.  This is the system called “sonic CHAT.” With these sonic CHAT links, it is possible to double-click on a sentence and hear its sound immediately.  Having the actual sound produced by the child directly available in the transcript takes some of the burden off of the transcription system. However, whenever computerized analyses are based not on the original audio signal but on transcribed orthographic forms, one must continue to understand the limits of transcription conventions. Third, for those who wish to avoid the work involved in IPA transcription or sonic CHAT, that is a system for using nonstandard lexical forms, that the form “might (h)ave” would be universally recognized as the spelling of “mightof”, the contracted form of “might have.” More extreme cases of phonological variation can be annotated as in this example:  popo [: hippopotamus].

3.2.2     The Misuse of Standard Punctuation

Transcribers have a tendency to write out spoken language with the punctuation con­ventions of written language. Written language is organized into clauses and sentences de­limited by commas, periods, and other marks of punctuation. Spoken language, on the other hand, is organized into tone units clustered about a tonal nucleus and delineated by pauses and tonal contours (Crystal, 1969, 1979; Halliday, 1966, 1967, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jefferson, 1984) has demonstrated a close link between tone units and ideational units. Retracings, pauses, stress, and all forms of intonational contours are crucial markers of aspects of the utterance planning process. Moreover, these features also convey important sociolinguistic informa­tion. Within special markings or conventions, there is no way to directly indicate these im­portant aspects of interactions.

3.2.3     Working With Video

Whatever form a transcript may take, it will never contain a fully accurate record of what went on in an interaction. A transcript of an interaction can never fully replace an au­diotape, because an audio recording of the interaction will always be more accurate in terms of preserving the actual details of what transpired. By the same token, an audio recording can never preserve as much detail as a video recording with a high-quality audio track. Au­dio recordings record none of the nonverbal interactions that often form the backbone of a conversational interaction. Hence, they systematically exclude a source of information that is crucial for a full interpretation of the interaction. Although there are biases involved even in a video recording, it is still the most accurate record of an interaction that we have avail­able. For those who are trying to use transcription to capture the full detailed character of an interaction, it is imperative that transcription be done from a video recording which should be repeatedly consulted during all phases of analysis.

 

When the CLAN editor is used to link transcripts to audio recordings, we refer to this as sonic CHAT. When the system is used to link transcripts to video recordings, we refer to this as video CHAT. The CLAN manual explains how to link digital audio and video to transcripts.

3.3      Problems With Forced Decisions

Transcription and coding systems often force the user to make difficult distinctions. For example, a system might make a distinction between grammatical ellipsis and ungrammat­ical omission. However, it may often be the case that the user cannot decide whether an omission is grammatical or not. In that case, it may be helpful to have some way of blurring the distinction. chat has certain symbols that can be used when a categorization cannot be made. It is important to remember that many of the chat symbols are entirely optional. Whenever you feel that you are being forced to make a distinction, check the manual to see whether the particular coding choice is actually required. If it is not required, then simply omit the code altogether.

3.4      Transcription and Coding

It is important to recognize the difference between transcription and coding. Transcrip­tion focuses on the production of a written record that can lead us to understand, albeit only vaguely, the flow of the original interaction. Transcription must be done directly off an au­diotape or, preferably, a videotape. Coding, on the other hand, is the process of recognizing, analyzing, and taking note of phenomena in transcribed speech. Coding can often be done by referring only to a written transcript. For example, the coding of parts of speech can be done directly from a transcript without listening to the audiotape. For other types of coding, such as speech act coding, it is imperative that coding be done while watching the original videotape.

 

The chat system includes conventions for both transcription and coding. When first learning the system, it is best to focus on learning how to transcribe. The chat system offers the transcriber a large array of coding options. Although few transcribers will need to use all of the options, everyone needs to understand how basic transcription is done on the “main line.” Additional coding is done principally on the secondary or “dependent” tiers. As transcribers work more with their data, they will include further options from the secondary or “dependent” tiers. However, the beginning user should focus first on learning to correctly use the conventions for the main line. The manual includes several sample tran­scripts to help the beginner in learning the transcription system.

3.5      Three Goals

Like other forms of communication, transcription systems are subjected to a variety of communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the pressure of three conflicting charges or goals. On the one hand, language is designed to be clear. On the other hand, it is designed to be processible by the listener and quick and easy for the speaker. Unfortunately, ease of production often comes in conflict with clarity of marking. The competition between these three motives leads to a variety of imperfect solutions that satisfy each goal only partially. Such imperfect and unstable solutions characterize the grammar and phonology of human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in fully achieving all three goals.

 

Slobin's view of the pressures shaping human language can be extended to analyze the pressures shaping a transcription system. In many regards, a transcription system is much like any human language. It needs to be clear in its markings of categories, and still preserve readability and ease of transcription. However, unlike a human language, a transcription system needs to address two different audiences. One audience is the human audience of transcribers, analysts, and readers. The other audience is the digital computer and its pro­grams. In order to successfully deal with these two audiences, a system for computerized transcription needs to achieve the following goals:

1.     Clarity: Every symbol used in the coding system should have some clear and definable real-world referent. The relation between the referent and the symbol should be consistent and reliable. Symbols that mark particular words should al­ways be spelled in a consistent manner. Symbols that mark particular conversa­tional patterns should refer to actual patterns consistently observable in the data. In practice, codes will always have to steer between the Scylla of overregular­ization and the Charybdis of underregularization discussed earlier. Distinctions must avoid being either too fine or too coarse. Another way of looking at clarity is through the notion of systematicity. Systematicity is a simple extension of clarity across transcripts or corpora. Codes, words, and symbols must be used in a consistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular tran­script in which it is located. If interactions are necessary, as in hierarchical cod­ing systems, these interactions need to be systematically described.

2.     Readability: Just as human language needs to be easy to process, so transcripts need to be easy to read. This goal often runs directly counter to the first goal. In the CHILDES system, we have attempted to provide a variety of chat options that will allow a user to maximize the readability of a transcript. We have also provided clan tools that will allow a reader to suppress the less readable as­pects in transcript when the goal of readability is more important than the goal of clarity of marking.

3.     Ease of data entry: As distinctions proliferate within a transcription system, data entry becomes increasingly difficult and error-prone. There are two ways of dealing with this problem. One method attempts to simplify the coding scheme and its categories. The problem with this approach is that it sacrifices clarity. The second method attempts to help the transcriber by providing computational aids. The CLAN programs follow this path. They provide systems for the automatic checking of transcription accuracy, methods for the automatic analysis of mor­phology and syntax, and tools for the semiautomatic entry of codes. However, the basic process of transcription has not been automated and remains the major task during data entry.

4           CHAT Outline

chat provides both basic and advanced formats for transcription and coding. The ba­sic level of chat is called minchat. New users should start by learning minchat. This system looks much like other intuitive transcription systems that are in general use in the fields of child language and discourse analysis. However, eventually users will find that there is something they want to be able to code that goes beyond minchat. At that point, they should move on to learning midCHAT.

4.1      minCHAT – the Form of Files

There are several minimum standards for the form of a minchat file. These standards must be followed for the CLAN commands to run successfully on chat files:

1.     Every line must end with a carriage return.

2.     The first line in the file must be an @Begin header line.

3.     The second line in the file must be an @Languages header line.  The languages entered here use a three-letter ISO 639-3 code, such as “eng” for English.

4.     The third line must be an @Participants header line listing three-letter codes for each participant, the participant's name, and the participant's role.

5.     After the @Participants header come a set of @ID headers providing further details for each speaker.  These will be inserted automatically for you when you run CHECK using escape-L.

6.     The last line in the file must be an @End header line.

7.     Lines beginning with * indicate what was actually said. These are called “main lines.” Each main line should code one and only one utterance. When a speaker produces several utterances in a row, code each with a new main line.

8.     After the asterisk on the main line comes a three-letter code in upper case letters for the participant who was the speaker of the utterance being coded. After the three-letter code comes a colon and then a tab.

9.     What was actually said is entered starting in the ninth column.

10.  Lines beginning with the % symbol can contain codes and commentary regarding what was said. They are called “dependent tier” lines.  The % symbol is followed by a three-letter code in lowercase letters for the dependent tier type, such as “pho” for phonology; a colon; and then a tab. The text of the dependent tier begins after the tab.

11.  Continuations of main lines and dependent tier lines begin with a tab which is inserted automatically by the CLAN editor.

4.2      minCHAT – Words and Utterances

In addition to these minimum requirements for the form of the file, there are certain minimum ways in which utterances and words should be written on the main line:

1.     Utterances should end with an utterance terminator. The basic utterance termi­nators are the period, the exclamation mark, and the question mark.

2.     Commas can be used as needed to mark phrasal junctions, but they are not used by the programs and have no sharp prosodic definition.  Similarly,

3.     Use upper case letters only for proper nouns and the word “I.” Do not use upper­case letters for the first words of sentences. This will facilitate the identification of proper nouns.

4.     Words should not contain capital letters except at their beginning. Words should not contain numbers, unless these mark tones.

5.     Unintelligible words with an unclear phonetic shape should be transcribed as xxx.

6.     If you wish to note the phonological form of an incomplete or unintelligible pho­nological string, write it out with an ampersand, as in &guga.

7.     Incomplete words can be written with the omitted material in parentheses, as in (be)cause and (a)bout.

Here is a sample that illustrates these principles. This file is syntactically correct and uses the minimum number of chat conventions while still maintaining compatibility with the CLAN commands.

 

@Begin

@Languages:     eng

@Participants: CHI Ross Child, FAT Brian Father

@ID:      eng|macwhinney|CHI|2;10.10||||Target_Child|||

@ID:      eng|macwhinney|FAT|35;2.||||Target_Child|||

*ROS:     why isn't Mommy coming?

%com:     Mother usually picks Ross up around 4 PM.

*FAT:     don't worry.

*FAT:     she'll be here soon.

*CHI:     good.

@End

4.3      Analyzing One Small File

For researchers who are just now beginning to use chat and CLAN, there is one single suggestion that can potentially save literally hundreds of hours of wasted time. The suggestion is to transcribe and analyze one single small file completely and perfectly before launching a major effort in transcription and analysis. The idea is that you should learn just enough about minchat and minCLAN to see your path through these four crucial steps:

1.     entry of a small set of your data into a CHAT file,

2.     successful running of the CHECK command inside the editor to guarantee accu­racy in your CHAT file,

3.     development of a series of codes that will interface with the particular CLAN commands most appropriate for your analysis, and

4.     running of the relevant CLAN commands, so that you can be sure that the results you will get will properly test the hypotheses you wish to develop.

If you go through these steps first, you can guarantee in advance the successful outcome of your project. You can avoid ending up in a situation in which you have transcribed hun­dreds of hours of data in a way that does not match correctly with the input require­ments for CLAN.

4.4      midCHAT

After having learned minchat, you are ready to learn the basics of CLAN. To do this, you will want to work through the first chapters of the CLAN manual focusing in particular on the CLAN tutorial. These chapters will take you up to the level of minCLAN, which corresponds to the minchat level.

 

Once you have learned minCHAT and minCLAN, you are ready to move on to the next levels, which are midCHAT and midCLAN.  Learning midchat involves mastering the transcription of words and conversational features. In particular, the midCHAT learner should work through the chapters on words, utterances, and scoped symbols. Depending on the shape of the particular project, the transcriber may then need to study additional chapters in this manual.  For people working on large projects that last many months, it is a good idea to eventuallly read all of the current manual, although some sections that seem less relevant to the project can be skimmed.

4.5      The Documentation File

chat files typically record a conversational sample collected from a particular set of speakers on a particular day. Sometimes researchers study a small set of children repeatedly over a long period of time. Corpora created using this method are referred to as longitudinal studies. For such studies, it is best to break up chat files into one collection for each child. This can be done just by creating file names that begin with the three letter code for the child, as in lea001.cha or eve15.cha. Each collection of files from the children involved in a given study constitutes a cor­pus. A corpus can also be composed of a group of files from different groups of speakers when the focus is on a cross-sectional sampling of larger numbers of language learners from various age groups. In either case, each corpus should have a documen­tation file. This “readme” file should contain a basic set of facts that are indispensable for the proper interpretation of the data by other researchers. The minimum set of facts that should be in each readme file are the following.

1.     Acknowledgments. There should be a statement that asks the user to cite some particular reference when using the corpus. For example, researchers using the Adam, Eve, and Sarah corpora from Roger Brown and his colleagues are asked to cite Brown (1973). In addition, all users can cite this current manual as the source for the CHILDES system in general.

2.     Restrictions. If the data are being contributed to the CHILDES system, contrib­utors can set particular restrictions on the use of their data. For example, re­searchers may ask that they be sent copies of articles that make use of their data. Many researchers have chosen to set no limitations at all on the use of their data.

3.     Warnings. This documentation file should also warn other researchers about limitations on the use of the data. For example, if an investigator paid no atten­tion to correct transcription of speech errors, this should be noted.

4.     Pseudonyms. The readme file should also include information on whether infor­mants gave informed consent for the use of their data and whether pseudonyms have been used to preserve informant anonymity. In general, real names should be replaced by pseudonyms. Anonymization is not necessary when the subject of the transcriptions is the researcher's own child, as long as the child grants permission for the use of the data.

5.     History. There should be detailed information on the history of the project. How was funding obtained? What were the goals of the project? How was data col­lected? What was the sampling procedure? How was transcription done? What was ignored in transcription? Were transcribers trained? Was reliability checked? Was coding done? What codes were used? Was the material comput­erized? How?

6.     Codes. If there are project-specific codes, these should be described.

7.     Biographical data. Where possible, extensive demographic, dialectological, and psychometric data should be provided for each informant. There should be information on topics such as age, gender, siblings, schooling, social class, oc­cupation, previous residences, religion, interests, friends, and so forth. Informa­tion on where the parents grew up and the various residences of the family is particularly important in attempting to understand sociolinguistic issues regard­ing language change, regionalism, and dialect. Without detailed information about specific dialect features, it is difficult to know whether these particular markers are being used throughout the language or just in certain regions.

8.     Situational descriptions. The readme file should include descriptions of the contexts of the recordings, such as the layout of the child's home and bedroom or the nature of the activities being recorded. Additional specific situational in­formation should be included in the @Situation and @Comment fields in each  file.

The various readme files for the corpora that are now in the CHILDES database were all contributed in this form. To maintain consistency and promote an overview of the database, these files were then edited and reformatted and combined into the da­tabase files that can now be downloaded from the server.

4.6      Checking Syntactic Accuracy

Each CLAN command runs a very superficial check to see if a file conforms to min­chat. This check looks only to see that each line begins with either @, *, %, a tab or a space. This is the minimum that the CLAN commands must have to function. However, the correct functioning of many of the functions of CLAN depends on adherence to further standards for minchat. In order to make sure that a file matches these minimum require­ments for correct analysis through CLAN, researchers should run each file through the CHECK program. The CHECK command can be run directly inside the editor, so that you can verify the ac­curacy of your transcription as you are producing it. CHECK will detect er­rors such as failure to start lines with the correct symbols, use of incorrect speaker codes, or missing @Begin and @End symbols. CHECK can also be used to find errors in chat coding beyond those discussed in this chapter. Using CHECK is like brushing your teeth. It may be hard at first to remember to use the command, but the more you use it the easier it becomes and the better the final results.

5       File Headers

The three major components of a chat transcript are the file headers, the main tier, and the dependent tiers. In this chapter we discuss creating the first major component – the file headers. A computerized transcript in chat format begins with a series of “head­er” lines, which tells us about things such as the date of the recording, the names of the par­ticipants, the ages of the participants, the setting of the interaction, and so forth.

 

A header is a line of text that gives information about the participants and the setting. All headers begin with the “@” sign. Some headers require nothing more than the @ sign and the header name. These are “bare” headers such as @Begin or @New Episode. How­ever, most headers require that there be some additional material. This additional material is called an “entry.” Headers that take entries must have a colon, which is then followed by one or two tabs and the required entry. By default, tabs are usually understood to be placed at eight-character intervals. The material up to the colon is called the “header name.” In the example following, “@Media” and “@Date” are both header names.

 

@Media:   abe88 movie

@Date: 25-JAN-1983

 

The text that follows the header name is called the “header entry.” Here, “abe88 movie” and “25-JAN-1983” are the header entries. The header name and the head­er entry together are called the “header line.” The header line should never have a punctu­ation mark at the end. In chat, only utterances actually spoken by the subjects receive final punctuation.

 

This chapter presents a set of headers that researchers have considered important. Except for the @Begin, @Languages, @Participants, @ID, and @End headers, none of the headers are required and you should feel free to use only those headers that you feel are needed for the accurate documentation of your corpus.

5.1      Hidden Headers

chat uses five types of headers: hidden, initial, participant-specific, constant, and changeable. In the editor, CHAT files appear to begin with the @Begin header.  However, there are actually three hidden headers that appear before this header.  These are the @Font header, the @UTF8 header, the @PID header, and the @ColorWords header which appear in that order.

@Font:

This header is used to set the default font for the file.  This line appears at the beginning of the file and its presence is hidden in the CLAN editor. When this header is missing, CLAN tries to determine which font is most appropriate for use with the current file by examining information in the @Languages and @Options headers.  If CLAN’s choice is not appropriate for the file, then the user will have to change the font.  After this is done, the font information will be stored in this header line.  Files that are retrieved from the database often do not have this header included, thereby allowing CLAN and the user to decide which font is most appropriate for viewing the current file.

 

@UTF8

 

This hidden header follows after the @Font header.  All files in the database use this header to mark the fact that they are encoded in UTF8.  If the file was produced outside of CLAN and this header is missing, CLAN will complain and ask the user to verify whether the file should be read in UTF8.  Often this means that the user should run the CP2UTF program to convert the file to UTF8.

 

@PID

 

This hidden header follows after the @UTF header and it declares the value of the transcript for Handle System (www.handle.net)  that allows for persistent identification of the location of digital objects.  These values can be entered into any system that resolves PIDs to locate the required resource, such as the server at http://128.2.71.222:8000.

 

@ColorWords

 

This hidden header stores the color values that users create when using the Color Keywords dialog.

 

5.2      Initial Headers

CHAT has seven initial headers.  The first six of these – @Begin, @Languages, @Participants, @Options, @ID, and @Media – appear in this order as the first lines of the file.  The last one @End appears at the end of the file as the last line.

@Begin

This header is always the first visible header placed at the beginning of the file. It is needed to guarantee that no ma­terial has been lost at the beginning of the file. This is a “bare” header that takes no entry and uses no colon.

@Languages:

This is the second visible header; it tells the programs which language is being used in the dialogues. Here is an example of this line for a bilingual transcript using Swedish and Portuguese.

 

@Languages:     sv, pt

 

The language codes come from the international ISO 639-3 standard. For the languages currently in the database, these three-letter codes and extended codes are used:

 

Table 1: ISO Language Codes

 

Language

Code

Language

Code

Language

Code

Afrikaans

afr

German

deu

Polish

pol

Arabic

ara

Greek

ell

 

 

Basque

eus

Hebrew

heb

Portuguese

por

Cantonese

zho-yue

Hungarian

hun

Punjabi

pan

Catalan

cat

Icelandic

isl

Romanian

ron

Chinese

zho

Indonesian

ind

Russian

rus

 

 

Irish

gle

Spanish

spa

Croatian

hrv

Italian

ita

Swahili

swa

Czech

ces

Japanese

jpn

Swedish

swe

Danish

dan

Javanese

jav

Tagalog

tag

Dutch

nld

Kannada

kan

Taiwanese

zho-min

English

eng

Kikuyu

kik

Tamil

tam

Estonian

est

Korean

kor

Thai

tha

Farsi

fas

Lithuanian

lit

Turkish

tur

Finnish

sun

Norwegian

nor

Vietnamese

vie

French

fra

 

 

Welsh

cym

Galician

glg

 

 

Yiddish

yid

 

We continually update this list, and CLAN relies on a file in the lib/fixes directory called ISO-639.cut that lists the current languages. In multilingual corpora, several codes can be combined on the @Languages line.  The first code given is for the language used most frequently in the transcript. Individual utterances in a second or third most frequent languages can be marked with precodes as in this example:

 

*CHI:     [- eng] this is my juguete@s.

 

In this example, Spanish is the most frequent language, but the particular sentence is marked as English.  The @Languages header lists spa for Spanish, and then eng for English.  Within this English sentence, the use of a Spanish word is then marked as @s.  When the @s is used in the main body of the transcript without the [- eng], then it indicates a shift to English, rather than to Spanish.

 

The @s code may also be used to explicitly mark the use of a particular language, even if it is not included in the @Languages header.  For example, the code schlep@s:yid can be used to mark the inclusion of the Yiddish word “schlep” in any text.  The @s code can also be further elaborated to mark code-blended words.  The form well@s:eng&cym indicates that the word “well” could be either an English or a Welsh word.  The combination of a stem from one language with an inflection from another can be marked using the plus sign as in swallowni@s:eng+hun for an English stem with a Hungarian infinitival marking.  All of these codes can be followed by a code with the $ to explicitly mark the parts of speech.  Thus, the form recordar@s$inf indicates that this Spanish word is an infinitive. The marking of part of speech with the $ sign can also be used without the @s.

 

Tone languages like Cantonese, Mandarin, and Thai are allowed to have word forms that include tones and numbers for polysemes.

@Participants:

This is the third visible header.  Like the @Begin and @Participants headers, it is obligatory.  It lists all of the actors within the file. The format for this header is XXX Name Role, XXX Name Role, XXX Name Role. XXX stands for the three-letter speaker ID. Here is an example of a completed @Par­ticipants header line:

 

@Participants: SAR Sue_Day Target_Child, CAR Carol Mother

 

Participants are identified by three elements: their speaker ID, their name and their role:

 

1.     Speaker ID. The speaker ID is usually composed of three letters. The code may be based either on the participant's name, as in *ROS or *BIL, or on her role, as in *CHI or *MOT. In this type of identifying system, several different children could be indicated as *CH1, *CH2, *CH3, and so on. Speaker IDs must be unique because they will be used to identify speakers both in the main body of the transcript and in other headers. In many transcripts, three letters are enough to distinguish all speakers. However, even with three letters, some ambiguities can arise. For example, suppose that the child being studied is named Mark (MAR) and his mother is named Mary (MAR). They would both have the same speaker ID and you would not be able to tell who was talking. So you must change one speaker ID. You would probably want to change it to something that would be easy to read and understand as you go through the file. A good choice is to use that speaker's role. In this example, Mary's speaker ID would be changed to MOT (Mother). You could change Mark's speaker ID to CHI, but that would be misleading if there are other children in the transcript. So a better solution would be to use MAR and MOT as shown in the following example:

 

@Participants: MAR Mark Target_Child, MOT Mary Mother

 

2.     Name. The speaker's name can be omitted. If CLAN finds only a three-letter ID and a role, it will assume that the name has been omitted. In order to preserve anonymity, it is often useful to include a pseudonym for the name, because the pseudonym will also be used in the body of the transcript. For clan to correctly parse the participants line, multiple-word name definitions such as “Sue Day” need to be joined in the form “Sue_Day.”

 

3.     Role. After the ID and name, you type in the role of the speaker. There are a fixed set of roles specified in the depfile.cut file used by CHECK and we recommend trying to use these fixed roles whenever possible. Please consult that file for the full list. You will also see this same list of possible roles in the “role” segment of the “ID Headers” dialog box. All of these roles are hard-wired into the depfile.cut file used by CHECK. If one of these standard roles does not work, it would be best to use one of the generic age roles, like Adult, Child, or Teenager. Then, the exact nature of the role can be put in the place of the name, as in these examples:

 

@Participants:  TBO Toll_Booth_Operator Adult,AIR Airport_Attendant Adult, SI1 First_Sibling Sibling, SI2 Second_Sibling Sibling, OFF MOT_to_INV OffScript, NON Computer_Talk Non_Human

 

@Options:

 

This header is not obligatory, but it is frequently needed.  When it occurs, it must follow the @Participants line.  This header allows the checking programs (CHECK and the XML validator) to suspend certain checking rules for certain file types.

1.     CA.  Use of this option suspends the usual requirement for utterance terminators.

2.     Heritage.  Use of this option tells CHECK and the validator not to look at the content of the main lines at all.  This radical blockage of the function of CHECK is only recommended for people working with CA files done in the traditional Jeffersonian format. When this option is used, text may be placed into italics, as in traditional CA.

3.     Sign.  Use of this option permits the use of all capitals in words for Sign Language notation.

4.     IPA.  Use of this option permits the use of IPA notation on the main line.

5.     Line.  Use of this option tells the web browser to expect time marking bullets on each line.  By default, the browser expects a bullet at the end of each tier.

6.     Multi.  Use of this option tells the checkers to expect multiple bullets on a single line.  This can be used for data that come from programs like Praat that mark time for each word.

7.     Caps.  This option turns off CLAN’s restriction against having capital letters inside words.

8.     Bullets.  This option turns off the requirement that each time-marking bullet should begin after the previous one.

@ID:

This header is used to control programs such as STATFREQ, output to Excel, and new programs based on XML.  The form of this line is:

 

@ID: language|corpus|code|age|sex|group|SES|role|education|custom|

 

There must be one @ID field for each participant.  Often you will not care to encode all of this information.  In that case, you can leave some of these fields empty.  Here is a typical @ID header.

 

@ID:      en|macwhinney|CHI|2;10.10||||Target_Child|||

 

To facilitate typing of these headers, you can run the CHECK program on a new CHAT file.  If CHECK does not see @ID headers, it will use the @Participants line to insert a set of @ID headers to which you can then add further information.  Alternatively, you can use the INSERT program to create these fields automatically from the information in the @Participants line.  For even more complete control over creation of these @ID headers, you can use the dialog system that comes up when you have an open CHAT file and select “ID Headers” under the Tiers Menu pulldown.  Here is a sample version of this dialog box:

 

Description: Brian:Users:macw:Desktop:Screen Shot 2011-08-03 at 5.31.45 PM.png

 

Here are some further characterizations of the possible fields for the @ID header.

 

Language:        as in Table 1 above

Corpus:           a one-word label for the corpus in lowercase

Code:               the three-letter code for the speaker in capitals

Age:                 the age of the speaker (see below)

Sex:                  either “male” or “female” in lowercase

Group:                        any single word label

SES:                 any single word label

Role:                the role as given in the @Participants line

Education:       educational level of the speaker

Custom:          any additional information needed for a given project

 

It is important to use the correct format for the Target_Child’s age.  This field uses the form years;months.days as in 2;11.17 for 2 years, 11 months, and 17 days. If you want to represent a range of several days for a given transcript, you can use this format:  2;11.17 – 2;11.28.  Note that the dash is surrounded by spaces. If you do not know the child's age in days, you can simply use years and months, as in 6;4. with a period after the months. If you do not know the months, you can use the form 6; with the semicolon after the years. If you only know the child’s birthdate and the date of the transcript, you can use the DATES program to compute the child’s age.

 

@Media:

 

This header is used to tell CLAN how to locate and play back media that are linked to transcripts.  The first field in this header specifies the name of the media file.  Extensions should be omitted.  If the media file is abe88.wav, then just enter “abe88”.  Then declare the format as “sound” or “video”.  It is also possible to add the terms “missing” or “unlinked” after the media type.  So the line has this shape:

 

@Media:   abe88, sound, missing

@End 

Like the @Begin header, this header uses no colon and takes no entry. It is placed at the end of the file as the very last line. Adding this header provides a safeguard against the danger of undetected file truncation during copying.

 

5.3      Participant-Specific Headers

The third set of headers provides information specific to each participant. Most of the participant-specific information is in the @ID tier.  That information can be entered by using the ID headers option in CLAN’s Tiers menu. The exceptions are for these tiers:

 

@Birth of #:

@Birthplace of #:

@L1 of #:

5.4      Constant Headers

Currently, the constant headers follow the participant-specific headers.  However, once the participant-specific headers have been merged into the @ID fields, the constant headers will follow the @Media field.  These headers, which are all optional, describe various general facts about the file.

@Exceptions:

This allows for special word forms in certain corpora.

@Interaction Type:                         

The possible entries here include: constructed computer phonecall telechat meeting work medical classroom tutorial private family sports religious legal face_to_face

@Location:

This header should include the city, state or province, and country in which the interac­tion took place. Here is an example of a completed header line:

 

@Location: Boston, MA, USA

@Number:                                        

The possible entries here include: two three four five more audience

 

@Recording Quality:

Possible entries here are: poor, fair, good, and excellent.

@Room Layout:

This header outlines room configuration and positioning of furniture. This is especially useful for experimental settings. The entry should be a description of the room and its con­tents. Here is an example of the completed header line:

 

@Room Layout:   Kitchen; Table in center of room with window on west wall, door to outside on north wall

@Tape Location:

This header indicates the specific tape ID, side and footage. This is very important for identifying the tape from which the transcription was made. The entry for this header should include the tape ID, side and footage. Here is an example of this header:

 

@Tape Location: tape74, side a, 104

@Time Duration:

It is often necessary to indicate the time at which the audiotaping began and the amount of time that passed during the course of the taping, as in the following header:

 

@Time Duration: 12:30-13:30

 

This header provides the absolute time during which the taping occurred. For most projects what is important is not the absolute time, but the time of individual events relative to each other. This sort of relative timing is provided by coding on the %tim dependent tier in conjunction with the @Time Start header described next.

@Time Start:

If you are tracking elapsed time on the %tim tier, the @Time Start header can be used to indicate the absolute time at which the timing marks begin. If a new @Time Start header is placed in the middle of the transcript, this “restarts” the clock.

 

@Time Start: 12:30

@Transcriber:                                  *

This line identifies the people who transcribed and coded the file. Having this indicated is often helpful later, when questions arise. It also provides a way of acknowledging the people who have taken the time to make the data available for further study.

@Transcription:                              

The possible entries here are:  eye_dialect partial full detailed coarse checked

@Warning:

This header is used to warn the user about certain defects or peculiarities in the collec­tion and transcription of the data in the file. Some typical warnings are as follows:

  1. These data are not useful for the analysis of overlaps, because overlapping was not accurately transcribed.
  2. These data contain no information regarding the context. Therefore they will be inappropriate for many types of analysis.
  3. Retracings and hesitation phenomena have not been accurately transcribed in these data.
  4. These data have been transcribed, but the transcription has not yet been double-checked.
  5. This file has not yet passed successfully through CHECK.

5.5      Changeable Headers

Changeable headers can occur either at the beginning of the file along with the constant headers or else in the body of the file. Changeable headers contain information that can change within the file. For example, if the file contains material that was recorded on only one day, the @Date header would occur only once at the beginning of the file. However, if the file contains some material from a later day, the @Date header would be used again lat­er in the file to indicate the next date. These changeable headers appear, then, at the point within the file where the information changes. The list that follows is alphabetical.

@Activities:

This header describes the activities involved in the situation. The entry is a list of com­ponent activities in the situation. Suppose the @Situation header reads, “Getting ready to go out.” The @Activities header would then list what was involved in this, such as putting on coats, gathering school books, and saying good-bye.

@Bck:

Diary material that was not originally transcribed in the chat format often has explan­atory or background material placed before a child's utterance. When converting this ma­terial to the chat format, it is sometimes impossible to decide whether this background material occurs before, during, or after the utterance. In order to avoid having to make these decisions after the fact, one can simply enter it in an @Bck header.

 

@Bck:     Rachel was fussing and pointing toward the cabinet where the cookies are stored.

*RAC:     cookie [/] cookie.

@Bg and @Bg:

These headers are used to mark the beginning of a “gem” for analysis by GEM. If there is a colon, you must follow the colon with a tab and then one or more code words.

@Blank

This header is created by the TEXTIN program.  It is used to represent the fact that some written text includes a blank line or new paragraph. It should not be used for transcripts of spoken language.

@Comment:

This header can be used as an all-purpose comment line. Any type of comment can be entered on an @Comment line. When the comment refers to a particular utterance, use the %com line. When the comment refers to more general material, use the @Comment header. If the comment is intended to apply to the file as a whole, place the @Comment header along with the constant headers before the first utterance. Instead of trying to make up a new coding tier name such as “@Gestational Age” for a special purpose type of informa­tion, it is best to use the @Comment field, as in this example:

 

@Comment: Gestational age of MAR is 7 months

@Comment: Birthweight of MAR is 6 lbs. 4 oz

 

Another example of a special @Comment field is used in the diary notes of the MacWhinney corpus, where they have this shape:

 

@Comment: Diary-Brian – Ross said “I don’t need to throw my blocks out the window anymore.”

@Date:

This header indicates the date of the interaction. The entry for this header is given in the form day-month-year. The date is abbreviated in the same way as in the @Birth header entry. Here is an example of a completed @Date header line:

 

@Date: 01-JUL-1965

 

Because we have some corpora going back over a century, it is important to include the full value for the year.  Also, because the days of the month should always have two digits, it is necessary to add a leading “0” for days such as “01”.

@Eg and @Eg:

These headers are used to mark the end of a “gem” for analysis by the GEM command. If there is a colon, you must follow the colon with a tab and then one or more code words. Each @Eg must have a matching @Bg.  If the @Eg: form is used, then the text following it must exactly match the text in the corresponding @Bg:  You can nest one set of @Bg-@Eg markers inside another, but double embedding is not allowed.  You can also begin a new pair before finishing the current one, but again this cannot be done for three beginnings.

@G:

This header is used in conjunction with the GEM program, which is described in the CLAN manual.  It marks the beginning of  “gems” when no nesting or overlapping of gems occurs.  Each gem is defined as material that begins with an @g marker and ends with the next @g marker.  We refer to these markers as “lazy” gem markers, because they are easier to use than the @bg and @eg markers.  To use  this feature, you need to also use the +n switch in GEM. You may nest at most one @Bg-@Eg pair inside a series of @G headers.

@New Episode 

This header simply marks the fact that there has been a break in the recording and that a new episode has started. It is a “bare” header that is used without a colon, because it takes no entry. There is no need to mark the end of the episode because the @New Episode head­er indicates both the end of one episode and the beginning of another.

@New Language:   

This header is used to indicate the shift from the initially most frequent language listed in the @Languages header to a new most frequent language.  This header should only be used when there is a marked break in a transcript from the use of one language to a fairly uniform use of another language.

@Page:   

This header is used to indicate the page from which some text is taken. It should not be used for spoken texts.

@Situation:

This changeable header describes the general setting of the interaction. It applies to all the material that follows it until a new @Situation header appears. The entry for this header is a standard description of the situation. Try to use standard situations such as: “breakfast,” “outing,” “bath,” “working,” “visiting playmates,” “school,” or “getting ready to go out.” Here is an example of the completed header line:

 

@Situation:  Tim and Bill are playing with toys in the hallway.

 

There should be enough situational information given to allow the user to reconstruct the situation as much as possible. Who is present? What is the layout of the room or other space? What is the social role of those present? Who is usually the caregiver? What activity is in progress? Is the activity routinized and, if so, what is the nature of the routine? Is the routine occurring in its standard time, place, and personnel configuration? What objects are present that affect or assist the interaction? It will also be important to include relevant eth­nographic information that would make the interaction interpretable to the user of the da­tabase. For example, if the text is parent- child interaction before an observer, what is the culture's evaluation of behaviors such as silence, talking a lot, displaying formulaic skills, defending against challenges, and so forth?

6       Words

Words are the basic building blocks for all sentential and discourse structures. By study­ing the development of word use, we can learn an enormous amount about the growth of syntax, discourse, morphology, and conceptual structure. However, in order to realize the full potential of computational analysis of word usage, we need to follow certain basic rules. In particular, we need to make sure that we spell words in a consistent manner. If we sometimes use the form doughnut and sometimes use the form donut, we are being in­consistent in our representation of this particular word. If such inconsistencies are repeated throughout the lexicon, computerized analysis will become inaccurate and misleading. One of the major goals of chat analysis is to maximize systematicity and minimize inconsis­tency. In the Introduction, we discussed some of the problems involved in mapping the speech of language learners onto standard adult forms. This chapter spells out some rules and heuristics designed to achieve the goal of consistency for word-level transcription.

One solution to this problem would be to avoid the use of words altogether by transcrib­ing everything in phonetic or phonemic notation. But this solution would make the tran­script difficult to read and analyze. A great deal of work in language learning is based on searches for words and combinations of words. If we want to conduct these lexical analy­ses, we have to try to match up the child's production to actual words. Work in the analysis of syntactic development also requires that the text be analyzed in terms of lexical items. Without a clear representation of lexical items and the ways that they diverge from the adult standard, it would be impossible to conduct lexical and syntactic analyses computationally. Even for those researchers who do not plan to conduct lexical analyses, it is extremely dif­ficult to understand the flow of a transcript if no attempt is made to relate the learner's sounds to items in the adult language.

At the same time, attempts to force adult lexical forms onto learner forms can seriously misrepresent the data. The solution to this problem is to devise ways to indicate the various types of divergences between learner forms and adult standard forms. Note that we use the term “divergences” rather than “error.” Although both learners (MacWhinney & Osser, 1977) and adults (Stemberger, 1985) clearly do make errors, most of the divergences be­tween learner forms and adult forms are due to structural aspects of the learner's system.

This chapter discusses the various tools that chat provides to mark some of these di­vergences of child forms from adult standards. The basic types of codes for divergences that we discuss are:

1.     special learner-form markers,

2.     codes for unidentifiable material,

  1. codes for incomplete words,
  2. ways of treating formulaic use of words, and
  3. conventions for standardized spellings.

For languages such as English, Spanish, and Japanese, we now have complete MOR grammars. The lexicons used by these grammars constitute the definitive current CHAT standard for words.  Please take a look at the relevant lexical files, since they illustrate in great detail the overall principles we are describing in this chapter.

6.1      The Main Line

The word forms we will be discussing here are the principal components of the “main line.”  This line gives the basic transcription of what the speaker said. The structure of main lines in CHAT is fairly simple.  Each main tier line begins with an asterisk. After the asterisk, there is a three-letter speaker ID, a colon and a tab. The transcription of what was said be­gins in the ninth column, after the tab, because the tab stop in the editor is set for the eighth column. The remainder of the main tier line is composed primarily of a series of words. Words are defined as a series of ASCII characters separated by spaces. In this chapter, we discuss the principles governing the transcription of words. In CLAN, all characters that are not punctuation markers are potentially parts of words. The default punctuation set in­cludes the space and these characters:

 

, . ; ? ! [ ] < >

 

None of these characters or the space can be used within words. Other non-letter char­acters such as the plus sign (+) or the at sign (@) can be used within words to express spe­cial meanings. This punctuation set applies to the main lines and all coding lines with the exception of the %pho and %mod lines which use the system described in the chapter on Dependent Tiers. Be­cause those systems make use of punctuation markers for special characters, only the space can be used as a delimiter on the %pho and %mod lines. As the CLAN manual explains, this default punctuation set can be changed for particular analyses.

6.2      Basic Words

Main lines are composed of words and other markers.  Words are pronounceable forms, surrounded by spaces. Most words are entered just as they are found in the dictionary. The first word of a sentence is not capitalized, unless it is a proper noun.

6.3      Special Form Markers

Special form markers can be placed at the end of a word. To do this, the symbol “@” is used in conjunction with one or two additional letters. Here is an example of the use of the @ symbol:

 

*SAR:     I got a bingbing@c.

 

Here the child has invented the form bingbing to refer to a toy. The word bingbing is not in the dictionary and must be treated as a special form. To further clarify the use of these @c forms, the transcriber should create a file called “0lexicon.cdc” that provides glosses for such forms.

 

The @c form illustrated in this example is only one of many possible special form markers that can be devised. The following table lists some of these markers that we have found useful. However, this categorization system is meant only to be suggestive, not ex­haustive. Researchers may wish to add further distinctions or ignore some of the categories listed. The particular choice of markers and the decision to code a word with a marker form is one that is made by the transcriber, not by chat. The basic idea is that CLAN will treat words marked with the special learner-form markers as words and not as fragments. In ad­dition, the MOR program will not attempt to analyze special forms for part of speech.

 

Table 2: Special Form Markers

 

Letters

Categories

Example

Meaning

POS

@a

addition

xxx@a

unintelligible

w

@b

babbling

abame@b

-

bab

@c

child-invented form

gumma@c

sticky

chi

@d

dialect form

younz@d

you

dia

@f

family-specific form

bunko@f

broken

fam

@g

general special form

gongga@g

-

-

@i

interjection, interaction

uhhuh@i

-

int

@k

multiple letters

ka@k

Japanese “ka”

n:let

@l

letter

b@l

letter b

n:let

@n

neologism

breaked@n

broke

neo

@o

onomatopoeia

woofwoof@o

dog barking

on

@p

phonol. consistent form

aga@p

-

phon

@pm

protomorpheme

wi@pm

will?

pm

@q

metalinguistic use

no if@q-s or but@q-s

when citing words

meta

@s:*

second-language form

istenem@s:hu

Hungarian word

L2

@s$n

second-language noun

perro@s$n

Spanish noun

n|

@si

singing

lalala@si

singing

sing

@sl

signed language

apple@sl

apple

sign

@sas

sign & speech

apple@sas

apple and sign

sas

@t

test word

wug@t

small creature

test

@u

Unibet transcription

binga@u

-

uni

@wp

word play

goobarumba@wp

-

wp

@x

Excluded words

stuff@x

excluded

unk

@z:xxx

User-defined code

word@z:rtfd

any user code

 

 

We can define these special markers in the following ways:

1.     Addition can be used to mark an unintelligible string as a word for inclusion on the %mor line.  MOR then recognizes xxx@a as w|xxx.  It also recognizes xxx@a$n as, for example n|xxx.  Adding this feature will still not allow inclusion of sentences with unintelligible words for MLU and DSS, because the rules for those indices prohibit this.

2.     Babbling can be used to mark both low-level early babbling and high-level sound play in older children. These forms have no obvious meaning and are used just to have fun with sound.

3.     Child-invented forms are words created by the child sometimes from other words without obvious derivational morphology. Sometimes they appear to be sound variants of other words. Sometimes their origin is obscure. However, the child appears to be convinced that they have meaning and adults sometimes come to use these forms themselves.

4.     Dialect form is often an interesting general property of a transcript.  However, the coding of phonological dialect variations on the word level should be mini­mized, because it often makes transcripts more difficult to read and analyze. In­stead, general patterns of phonological variation can be noted in the readme file.

5.     Family-specific forms are much like child-invented forms that have been taken over by the whole family. Sometimes the source of these forms are children, but they can also be older members of the family. Sometimes the forms come from variations of words in another language. An example might be the use of under­toad to refer to some mysterious being in the surf, although the word was sim­ply undertow initially.

6.     General special form marking with @g can be used when all of the above fail. However, its use should generally be avoided. Marking with the @ without a following letter is not accepted by CHECK.

7.     Interjections can be indicated in standard ways, making the use of the @i nota­tion usually not necessary. Instead of transcribing “ahem@i,” one can simply transcribe ahem following the conventions listed later.

8.     Letters can either be transcribed with the @l marker or simply as single-charac­ter words.  Strings of letters are marked as @k.

9.     Neologisms are meant to refer to morphological coinages.  If the novel form is monomor­phemic, then it should be characterized as a child-invented form (@c), family-specific form (@f), or a test word (@t).  Note that this usage is only really sanctioned for CHILDES corpora.  For AphasiaBank corpora, neologisms are considered to be forms that have no real word source, as is typical in jargon aphasia.

10.  Nonvoiced forms are produced typically by hearing-impaired children or their parents who are mouthing words without making their sounds.

11.  Onomatopoeias include animal sounds and attempts to imitate natural sounds.

12.  Phonological consistent forms (PCFs) are early forms that are phonologically consis­tent, but whose meaning is unclear to the transcriber. Usually these forms have some relation to small function words.

13.  Protomorphemes are forms that will eventually become morphemes, including function words and affixes.

14.  Metalinguistic reference can be used to either cite or “quote” single standard words or special child forms.

15.  Second-language forms derive from some language not usually used in the home. These are marked with a second letter for the first letter of the second language, as in @s:zh for Mandarin words inside an English sentence.  You can also mark the part of speech of a second language word by using the form @s$ as in perro@s$n to indicate that the Spanish word perro (dog) is a noun.

16.  Sign language use can be indicated by the @sl.

17.  Sign and speech use involves making a sign or informal sign in parallel with saying the word.

18.  Singing can be marked with @si. Sometimes the phrase that is being sung involves nonwords, as in lalaleloo@si.  In other cases, it involves words that can be joined by underscores. However, if a larger passage is sung, it is best to transcribe it as speech and just mark it as being sung through a comment line.

19.  Test words are nonce forms generated by the investigators to test the productiv­ity of the child's grammar.

20.  Unibet transcription can be given on the main line by using the @u marker. However, if many such forms are being noted, it may be better to construct a @pho line. With the advent of IPA Unicode, we now prefer to avoid the use of Unibet, relying instead directly on IPA.

21.  Word play in older children produces forms that may sound much like the forms of babbling, but which arise from a slightly different process.  It is best to use the @b for forms produced by children younger than 2;0 and @wp for older chil­dren.

22.  Unknown forms can be marked with @x.  However, usually unknown forms are transcribed using the xxx, yyy, and www markers.

23.  User-defined special forms can be marked with @z followed by up to five letters of a user-defined code, such as in word@z:rftd.  This format should be used carefully, because it will be difficult for the MOR program to evaluate words with these codes unless additional detailed information is added to the sf.cut file.

 

Later in this chapter we present a set of standard spellings of English words that make use of @d, @fp, and @i largely unnecessary. However, in languages where such a list is not available, it may be necessary to use forms with @d or @i. The @b, @u, and @wp markers allow the transcriber to represent words and babbling words phonologically on the main line and have CLAN treat them as full lexical items. This should only be done when the analysis requires that the phonological string be treated as a word and it is unclear which standard morpheme corresponds to the word. If a phonological string should not be treated as a full word, it should be marked by a beginning &, and the @b, @u, or @w endings should not be used. Also, if the transcript includes a complete %pho line for each word and the data are intended for phonological analysis, it is better to use yy (see the next section) on the main line and then give the phonological form on the %pho line.  If you wish to omit coding of an item on the %pho line, you can insert the horizontal ellipsis character … (Unicode 2026).  This is a single character, not three periods, and it is not the ellipsis character used by MS-Word.

 

Family-specific forms are special words used only by the family. These are often de­rived from child forms that are adopted by all family members. They also include certain “caregiverese” forms that are not easily recognized by the majority of adult speakers but which may be common to some areas or some families. Family-specific forms can be used by either adults or children.

 

The @n marker is intended for morphological neologisms and over-regularizations, whereas the @c marker is intended to mark nonce creation of stems. Of course, this distinc­tion is somewhat arbitrary and incomplete. Whenever a child-invented form is clearly on­omatopoeic, use the @o coding instead of the @c coding. A fuller characterization of neologisms can be provided by the error coding system presented in a separate chapter.

 

If transcribers find it difficult to distinguish between child-invented forms, onomato­poeia, and familial forms, they can use the @ symbol without any following letter. In this way, they can at least indicate the fact that the preceding word is not a standard item in the adult lexicon. 

6.4      Unidentifiable Material

Sometimes it is difficult to map a sound or group of sounds onto either a conventional word or a non-conventional word. This can occur when the audio signal is so weak or gar­bled that you cannot even identify the sounds being used. At other times, you can recognize the sounds that the speaker is using, but cannot map the sounds onto words. Sometimes you may choose not to transcribe a passage, because it is irrelevant to the interaction. Some­times the person makes a noise or performs an action instead of speaking, and sometimes a person breaks off before completing a recognizable word. All of these problems can be dealt with by using certain special symbols for those items that cannot be easily related to words. These symbols are typed in lower case and are preceded and followed by spaces. When standing alone on a text tier, they should be followed by a period, unless it is clear that the utterance was a question or a command.

Speech                                               xxx 

Use the symbol xxx when you cannot hear or understand what the speaker is saying. If you believe you can distinguish the number of unintelligible words, you may use several xxx strings in a row. Here is an example of the use of the xxx symbol:

 

*SAR:     xxx.

*MOT:     what?

*SAR:     I want xxx.

 

Sarah's first utterance is fully unintelligible. Her second utterance includes some unin­telligible material along with some intelligible material.

 

The MLU and MLT commands will ignore the xxx symbol when computing mean length of utterance and other statistics. If you want to have several words included, use as many occurrences of xxx as you wish.

Phonological Coding                        yyy 

Use the symbol yyy when you plan to code all material phonologically on a %pho line. If you are not consistently creating a %pho line in which each word is transcribed in IPA in the order of the main line, you should use the @u or & notations instead. Here is an example of the use of yyy:

 

*SAR:     yyy yyy a ball.

%pho:     ta gə ə bal

 

The first two words cannot be matched to particular words, but their phonological form is given on the %pho line.

Untranscribed Material                   www 

This symbol must be used in conjunction with an %exp tier which is discussed in the chapter on dependent tiers. This symbol is used on the main line to indicate material that a transcriber does not know how to transcribe or does not want to transcribe. For example, it could be that the ma­terial is in a language that the transcriber does not know. This symbol can also be used when a speaker says something that has no relevance to the interactions taking place and the ex­perimenter would rather ignore it. For example, www could indicate a long conversation between adults that would be superfluous to transcribe. Here is an example of the use of this symbol:

 

*MOT:     www.

%exp:     talks to neighbor on the telephone

Actions Without Speech                  0 

This symbol is used when the speaker performs some action that is not accompanied by speech. Notice that the symbol is the numeral zero “0,” not the capital letter “O.” Here is an example of the correct usage of this symbol:

 

*FAT:     where's your doll?

*DAV:     0 [=! runs over to her closet].

 

If the transcriber wishes to code the phonetics of the crying, it would be better to insert yyy on the main tier. Do not use the zero, if there is any speech on the tier. The zero can also be used to provide a place to attach a dependent tier.

Phonological Fragment                   & 

The & symbol can be used at the beginning of a string to indicate that the following ma­terial is just a phonological fragment or piece of a word and that CLAN should not treat it as a word. It is important not to include any of the three utterance terminators –  the exclamation mark, the question mark, or the period – because CLAN will treat these as utterance terminators. This form of notation is useful when the speaker stutters or breaks off before completing a recognizable word (false starts). The utterance “t- t- c- can't you go” is transcribed as follows:

 

*MAR:     &t &t &k can't you go?

 

The ampersand can also be used for nonce and nonsense forms:

 

*DAN:     &glnk &glnk.

%com:     weird noises

 

Material following the ampersand symbol will be ignored by certain CLAN commands, such as MLU, which computes the mean length of the utterance in a transcript. If you want to have the material treated as a word, use the @u form of notation instead (see the previous section).

 

Unless you specifically attempt to search for strings with the ampersand, the CLAN commands will not see them at all. If you want a command such as FREQ to count all of the instances of phonological fragments, you would have to add a switch such as +s”&*”.

6.5      Incomplete and Omitted Words

Words may also be incomplete or even fully omitted. We can judge a word to be incom­plete when enough of it is produced for us to be sure what was intended. Judging a word to be omitted is often much more difficult.

Noncompletion of a Word                text(text)text

When a word is incomplete, but the intended meaning seems clear, insert the missing material within parentheses. Do not use this notation for fully omitted words, only for words with partial omissions. This notation can also be used to derive a consistent spelling for commonly shortened words, such as (un)til and (be)cause. CLAN will treat items that are coded in this way as full words. For programs such as FREQ, the parentheses will essentially be ignored and (be)cause will be treated as if it were because. The CLAN programs also provide ways of either including or excluding the material in the parenthe­ses, depending on the goals of the analysis.

 

*RAL:     I been sit(ting) all day.

 

The inclusion or exclusion of material enclosed in parentheses is well supported by CLAN and this same notation can also be used for other purposes when necessary. For ex­ample, studies of fluency may find it convenient to code the number of times that a word is repeated directly on that word, as in this example with three repetitions of the word dog.

 

JEF:      that's a dog [x 3].

 

By default, the programs will remove the [x 3] form and the sentence will be treated as a three word utterance.  This behavior can be modified by using the +r switch.

Omitted Word                                   0word 

The coding of word omissions is an extremely difficult and unreliable process. Many researchers will prefer not to even open up this particular can of worms. On the other hand, researchers in language disorders and aphasia often find that the coding of word omissions is crucial to particular theoretical issues. In such cases, it is important that the coding of omitted words be done in as clear a manner as possible

 

To code an omission, the zero symbol is placed before a word on the text tier. If what is important is not the actual word omitted, but its part of speech, then a code for the part of speech can follow the zero. Similarly, the identity of the omitted word is always a guess. The best guess is placed on the main line. This item would be counted for scoping conventions, but it would not be included in the MLU count. Here is an example of its use:

 

*EVE:     I want 0to go.

 

It is very difficult to know when a word has been omitted. However, the following cri­teria can be used to help make this decision for English data:

1.     0art: Unless there is a missing plural, a common noun without an article is coded as 0art.

2.     0v: Sentences with no verbs can be coded as having missing verbs. Of course, often the omission of a verb can be viewed as a grammatical use of ellipsis.

3.     0aux:    In standard English, sentences like “he running” clearly have a missing auxiliary.

4.     0subj: In English, every finite verb requires a subject.

5.     0pobj: Every preposition requires an object. However, often a preposition may be functioning as an adverb. The coder must look at the verb to decide whether a word is functioning as a preposition as in “John put on 0pobj” or an adverb as in “Mary jumped up.”

In English, there seldom are solid grounds for assigning codes like 0adj, 0adv, 0obj, 0prep, or 0dat.

6.6      Standardized Spellings

There are a number of common words in the English language that cannot be found in the dictionary or whose lexical status is vague. For example, how should letters be spelled? What about numbers and titles? What is the best spelling ─ doggy or doggie, yeah or yah, and pst or pss?   If we can increase the consistency with which such forms are transcribed, we can improve the quality of automatic lexical analyses. clan com­mands such as freq and combo provide output based on searches for particular word strings. If a word is spelled in an indeterminate number of variant ways, researchers who attempt to analyze the occurrence of that word will inevitably end up with inaccurate re­sults. For example, if a researcher wants to trace the use of the pronoun you, it might be necessary to search not only for you, ya, and yah, but also for all the assimilations of the pronouns with verbs such as didya/dicha/didcha or couldya/couldcha/coucha. Without a standard set of rules for the transcription of such forms, accurate lexical searches could become impossible. On the other hand, there is no reason to avoid using these forms if a set of standards can be established for their use. Other programs rely on the use of dic­tionaries of words. If the spellings of words are indeterminate, the analyses produced will be equally indeterminate. For that reason, it is helpful to specify a set of standard spellings for marginal words. This section lists some of these words with their standard orthographic form.

 

The forms in these lists all have some conventional lexical status in standard American English. In this regard, they differ from the various nonstandard forms indicated by the spe­cial form markers @b, @c, @f, @l, @n, @o, @p, and @s. Because there is no clear limit to the number of possible babbling forms, onomatopoeic forms, or neologistic forms, there is no way to provide a list of such forms. In contrast, the words given in this section are fairly well known to most speakers of the language, and many can be found in unabridged dictionaries. The list given here is only a beginning; over time, we intend to continue to add new forms.

 

Some of the forms use parentheses to indicate optional material. For example, the ex­clamation yeek can also be said as eek. When a speaker uses the full form, the tran­scriber types in yeek, and when the speaker uses the reduced form the transcriber types (y)eek. When clan analyzes the transcripts, the parentheses can be ignored and both yeek and eek will be retrieved as instances of the same word. Parentheses can also be used to indicate missing fragments of suffixes. The majority of the words listed can be found in the form given in Webster's Third New International Dictionary. Those forms that can­not be found in Webster's Third are indicated with an asterisk. The asterisk should not be used in actual transcription.

6.6.1     Letters

To transcribe letters, use the @l symbol after the letter. For example, the letter “b” would be b@l. Here is an example of the spelling of a letter sequence.

 

*MOT:     could you please spell your name?

*MAR:     it's m@l a@l r@l k@l.

 

The dictionary says that “abc” is a standard word, so that is accepted without the @l marking.  In Japanese, many letters refer to whole syllables or “kana” such as ro or ka.  To represent this as well as strings of letters in English, use the @k symbol, as in ka@k or jklmn@k. Using this form, the above example cover better be coded as:

 

*MOT:     could you please spell your name?

*MAR:     it's mark@k.

6.6.2     Compounds and Linkages

Languages use a variety of methods for combining words into larger lexical items. One method involves inflectional processes, such as cliticization and affixation, that will be discussed later.  Here we consider compounds and linkages. 

 

Earlier, it was necessary to write compounds in the form of bird+house and baby+sitter, but now the plus is no longer necessary.  You can just write birdhouse and babysitter and the correct form will be inserted into the %mor line by the MOR program.

 

The other level of concatenation involves the use of an underscore to indicate the fact that a phrasal combination is not really a compound, but what we call a “linkage”.  Common examples here include titles of books such as Green_Eggs_and_Ham, appelations such as Little_Bo_Beep or Santa_Claus, lines from songs such as The_Farmer_in_the_Dell, and places such as Hong_Kong_University. For these forms, the underscore is used to emphasize the fact that, although the form is collocational, it does not obey standard rules of compound formation. Because these forms all begin with a capital letter, the morphological analyzer will recognize them as proper nouns. The underscore is used for two other purposes.  First, it can be used for irregular combinations, such as how_about and how_come.  Second, it can be used on the %mor line to represent a multiword English gloss for a single stem, as in “lose_flowers” for defleurir.

 

Because the dash is used on the %mor line to indicate suffixation, it is important to avoid con­fusion between the standard use of the dash in compounds such as “blue-green” and the use of the dash in chat. To do this, use the compound marker to replace the dash or hyphen, as in blue+green instead of blue-green.

6.6.3     Capitalization

In general, capitals are only allowed at the beginnings of words.  However, they can also occur later in a word in these cases:

  1. if the @Options tier includes: "sign" or "CA".
  2. If @u is used at the end of the word.
  3. After the + symbol.
  4. After the _ underscore symbol.
  5. The word is listed on the @Exceptions tier.
  6. The capital letter is preceded by prefix, like "Mac", which is specified in the depfile in [UPREFS Mac] code.

6.6.4     Acronyms

Acronyms should be transcribed by using the component letters as a part of a “linked” form. In compounds, the @l marking is not used, since it would make the acronym unreadable. Thus, USA is be written as U_S_A. In this case, the first letter is capitalized in order to mark it as a proper noun. Other examples include M_I_T, C_M_U, M_T_V, E_T, I_U, C_three_P_O, R_two_D_two, and K_Mart. The recom­mended way of transcribing the common name for television is just tv. This form is not capitalized, since it is not a proper noun. Similarly, we can write cd, vcr, tv, and dvd.  The underscore is the best mark for combinations that are not true compounds such as m_and_m-s for the M&M candy.

 

Acronyms that are not actually spelled out when produced in conversation should be written as words. Thus UNESCO would be written as Unesco. The capitalization of the first letter is used to indicate the fact that it is a proper noun. There must be no periods inside acronyms and titles, because these can be confused with utterance delimiters.

6.6.5     Numbers and Titles

Numbers should be written out in words. For example, the number 256 could be written as “two hundred and fifty six,” “two hundred fifty six,” “two five six,” or “two fifty six,” depending on how it was pronounced. It is best to use the form “fifty six” rather than “fifty-six,” because the hyphen is used in chat to indicate morphemicization. Other strings with numbers are monetary amounts, percentages, times, fractions, logarithms, and so on. All should be written out in words, as in “eight thousand two hundred and twenty dollars” for $8220, “twenty nine point five per­cent” for 29.5%, “seven fifteen” for 7:15, “ten o'clock ay m” for 10:00 AM, and “four and three fifths.”

 

Titles such as Dr. or Mr. should be written out in their full capitalized form as Doctor or Mister, as in “Doctor Spock” and “Mister Rogers.” For “Mrs.” use the form “Missus.”

6.6.6     Kinship Forms

The following table lists some of the most important kinship address forms in standard American English. The forms with asterisks cannot be found in Webster's Third New In­ternational Dictionary.

 

Table 2: Kinship Forms

 

Child

Formal

Child

Formal

Da(da)

Father

Mommy

Mother

Daddy

Father

Nan

Grandmother

Gram(s)

Grandmother

Nana

Grandmother

Grammy

Grandmother

 *Nonny

Grandmother

Gramp(s)

Grandfather

Pa

Father

*Grampy

Grandfather

Pap

Father

Grandma

Grandmother

Papa

Father

Grandpa

Grandfather

Pappy

Father

Ma

Mother

Pop

Father

Mama

Mother

Poppa

Father

Momma

Mother

*Poppy

Father

Mom

Mother

 

 

 

6.6.7     Shortenings

One of the biggest problems that the transcriber faces is the tendency of speakers to drop sounds out of words. For example, a speaker may leave the initial “a” off of “about,” saying instead “ 'bout.” In chat, this shortened form appears as (a)bout. clan can easily ignore the parentheses and treat the word as “about.” Alternatively, there is a CLAN option to allow the commands to treat the word as a spelling variant. Many common words have standard shortened forms. Some of the most frequent are given in the table that follows. The basic notational principle illustrated in that table can be extended to other words as needed. All of these words can be found in Webster's Third New International Dictionary.

 

More extreme types of shortenings include: “(what)s (th)at” which becomes “sat,” “y(ou) are” which becomes “yar,” and “d(o) you” which becomes “dyou.” Representing these forms as shortenings rather than as nonstandard words facilitates standardization and the automatic analysis of transcripts.

 

Two sets of contractions that cause particular problems for morphological analysis in English are final apostrophe s and apostrophe d, as in John’s and you’d.  If you transcribe these as John (ha)s and you (woul)d, then the MOR program will work much more efficiently.

 

Table 3: Shortenings

 

Examples of Shortenings

(a)bout

don('t)

(h)is

(re)frigerator

an(d)

(e)nough

(h)isself

(re)member

(a)n(d)

(e)spress(o)

-in(g)

sec(ond)

(a)fraid

(e)spresso

nothin(g)

s(up)pose

(a)gain

(es)presso

(i)n

(th)e

(a)nother

(ex)cept

(in)stead

(th)em

(a)round

(ex)cuse

Jag(uar)

(th)emselves

ave(nue)

(ex)cused

lib(r)ary

(th)ere

(a)way

(e)xcuse

Mass(achusetts)

(th)ese

(be)cause

(e)xcused

micro(phone)

(th)ey

(be)fore

(h)e

(pa)jamas

(to)gether

(be)hind

(h)er

(o)k

(to)mato

b(e)long

(h)ere

o(v)er

(to)morrow

b(e)longs

(h)erself

(po)tato

(to)night

Cad(illac)

(h)im

prob(ab)ly

(un)til

doc(tor)

(h)imself

(re)corder

wan(t)

 

The marking of shortened forms such as (a)bout in this way greatly facilitates the later analysis of the transcript, while still preserving readability and phonological accuracy. Learning to make effective use of this form of transcription is an important part of master­ing use of CHAT. Underuse of this feature is a common error made by beginning users of CHAT.

6.6.8     Assimilations

Words such as “gonna” for “going to” and “whynt cha” for “why don't you” involve complex sound changes, often with assimilations between auxiliaries and the infinitive or a pronoun. For forms of this type, chat allows the transcriber to place the assimilated form on the main line followed by a fuller form in square brackets, as in the form:

 

    gonna [: going to]

 

CLAN allows the user to either analyze the material preceding the brackets or the ma­terial following the brackets, as described in the section of the chapter on options that dis­cusses the +r switch. An extremely incomplete list of assimilated forms is given below. None of these forms can be found in Webster's Third New International Dictionary.

 

Table 4: Assimilations

 

Nonstandard

Standard

Nonstandard

Standard

coulda(ve)

could have

mighta

might have

dunno

don't know

need(t)a

need to

dyou

do you

oughta

ought to

gimme

give me

posta

supposed to

gonna

going to

shoulda(ve)

should have

gotta

got to

sorta

sort of

hadta

had to

sorta

sort of

hasta

has to