The CHILDES Project
Tools
for Analyzing Talk – Electronic Edition
Part
1: The CHAT Transcription Format
Brian
MacWhinney
Carnegie
Mellon University
August 6, 2012
Citation for
last printed version:
MacWhinney,
B. (2000). The CHILDES Project: Tools
for Analyzing Talk. 3rd Edition.
Mahwah, NJ: Lawrence Erlbaum Associates
1 Table
of Contents........................................................................................................ 2
2 Introduction................................................................................................................. 5
2.1 Impressionistic Observation.................................................................................... 5
2.2 Baby Biographies.................................................................................................... 6
2.3 Transcripts.............................................................................................................. 6
2.4 Computers............................................................................................................... 8
2.5 Connectivity............................................................................................................ 8
2.6 Three Tools............................................................................................................. 9
2.7 Shaping CHAT....................................................................................................... 9
2.8 Building CLAN..................................................................................................... 10
2.9 Constructing the Database..................................................................................... 10
2.10 Disseminating CHILDES................................................................................... 11
2.11 Funding............................................................................................................... 11
2.12 How to Use These Manuals................................................................................ 12
2.13 Changes.............................................................................................................. 13
3 Principles................................................................................................................... 14
3.1 Computerization.................................................................................................... 14
3.2 Words of Caution................................................................................................. 15
3.2.1 The Dominance of the Written Word............................................................. 15
3.2.2 The Misuse of Standard Punctuation............................................................. 16
3.2.3 Working With Video...................................................................................... 16
3.3 Problems With Forced Decisions......................................................................... 17
3.4 Transcription and Coding...................................................................................... 17
3.5 Three Goals........................................................................................................... 18
4 CHAT Outline........................................................................................................... 20
4.1 minCHAT – the Form of Files.............................................................................. 20
4.2 minCHAT – Words and Utterances...................................................................... 20
4.3 Analyzing One Small File..................................................................................... 21
4.4 midCHAT............................................................................................................. 21
4.5 The Documentation File........................................................................................ 22
4.6 Checking Syntactic Accuracy................................................................................ 23
5 File Headers............................................................................................................... 24
5.1 Hidden Headers.................................................................................................... 24
5.2 Initial Headers....................................................................................................... 25
5.3 Participant-Specific Headers................................................................................. 30
5.4 Constant Headers.................................................................................................. 30
5.5 Changeable Headers.............................................................................................. 32
6 Words.......................................................................................................................... 36
6.1 The Main Line....................................................................................................... 37
6.2 Basic Words......................................................................................................... 37
6.3 Special Form Markers........................................................................................... 37
6.4 Unidentifiable Material.......................................................................................... 41
6.5 Incomplete and Omitted Words............................................................................ 43
6.6 Standardized Spellings.......................................................................................... 44
6.6.1 Letters............................................................................................................ 45
6.6.2 Compounds and Linkages............................................................................. 45
6.6.3 Capitalization................................................................................................. 46
6.6.4 Acronyms....................................................................................................... 46
6.6.5 Numbers and Titles........................................................................................ 46
6.6.6 Kinship Forms............................................................................................... 47
6.6.7 Shortenings.................................................................................................... 47
6.6.8 Assimilations.................................................................................................. 48
6.6.9 Exclamations.................................................................................................. 49
6.6.10 Communicators........................................................................................... 50
6.6.11 Spelling Variants.......................................................................................... 50
6.6.12 Colloquial Forms......................................................................................... 51
6.6.13 Dialectal Variations..................................................................................... 51
6.6.14 Baby Talk..................................................................................................... 52
6.6.15 Word separation in Japanese...................................................................... 53
6.6.16 Punctuation in French and Italian............................................................... 53
6.6.17 Abbreviations in Dutch................................................................................ 54
7 Utterances.................................................................................................................. 55
7.1 One Utterance or Many?....................................................................................... 55
7.2 Discourse Repetition............................................................................................. 57
7.3 Basic Utterance Terminators................................................................................. 57
7.4 Satellite Markers................................................................................................... 59
7.5 Separators............................................................................................................. 59
7.6 Tone Direction...................................................................................................... 60
7.7 Prosody Within Words......................................................................................... 60
7.8 Local Events.......................................................................................................... 61
7.8.1 Simple Events................................................................................................. 61
7.8.2 Complex Local Events.................................................................................... 62
7.8.3 Pauses............................................................................................................ 63
7.8.4 Long Events................................................................................................... 63
7.8.5 Interposed Back Channel............................................................................... 63
7.9 Special Utterance Terminators............................................................................... 64
7.10 Utterance Linkers................................................................................................ 66
8 Scoped Symbols......................................................................................................... 68
8.1 Audio and Video Time Marks............................................................................... 68
8.2 Paralinguistic Scoping and Events......................................................................... 69
8.3 Explanations and Alternatives................................................................................ 69
8.4 Retracing, Overlap, and Clauses............................................................................ 72
8.5 Error Marking....................................................................................................... 75
8.6 Initial and Final Codes.......................................................................................... 75
9 Dependent Tiers........................................................................................................ 78
9.1 Standard Dependent Tiers..................................................................................... 78
9.2 Synchrony Relations............................................................................................. 84
10 CHAT-CA Transcription........................................................................................ 86
11 Arabic Transcription............................................................................................... 89
12 Specific Applications.............................................................................................. 91
12.1 Code-Switching................................................................................................... 91
12.2 Elicited Narratives and Picture Descriptions........................................................ 92
12.3 Written Language................................................................................................ 92
12.4 Children With Disfluencies................................................................................. 93
13 Speech Act Codes.................................................................................................... 95
13.1 Interchange Types............................................................................................... 95
13.2 Illocutionary Force Codes................................................................................... 96
14 Error Coding............................................................................................................ 99
14.1 Word level error codes summary........................................................................ 99
14.2 Word level coding – details............................................................................... 100
14.3 Utterance level error coding (post-codes).......................................................... 103
15 Morphosyntactic Coding...................................................................................... 105
15.1 One-to-one correspondence.............................................................................. 105
15.2 Tag Groups and Word Groups......................................................................... 106
15.3 Words............................................................................................................... 106
15.4 Part of Speech Codes........................................................................................ 107
15.5 Stems................................................................................................................ 108
15.6 Affixes.............................................................................................................. 109
15.7 Clitics................................................................................................................ 110
15.8 Compounds....................................................................................................... 110
15.9 Sample Morphological Tagging for English..................................................... 111
References...................................................................................................................... 114
Language acquisition
research thrives on data collected from spontaneous interactions in naturally
occurring situations. You can turn on a tape recorder or videotape, and, before
you know it, you will have accumulated a library of dozens or even hundreds of
hours of naturalistic interactions. But simply collecting data is only the
beginning of a much larger task, because the process of transcribing and
analyzing naturalistic samples is extremely time-consuming and often
unreliable. In this first volume, we will present a set of computational tools
designed to increase the reliability of transcriptions, automate the process of
data analysis, and facilitate the sharing of transcript data. These new
computational tools have brought about revolutionary changes in the way that
research is conducted in the child language field. In addition, they have
equally revolutionary potential for the study of second-language learning,
adult conversational interactions, sociological content analyses, and language
recovery in aphasia. Although the tools are of wide applicability, this volume
concentrates on their use in the child language field, in the hope that
researchers from other areas can make the necessary analogies to their own
topics.
Before turning to a
detailed examination of the current system, it may be helpful to take a brief
historical tour over some of the major highlights of earlier approaches to the
collection of data on language acquisition. These earlier approaches can be
grouped into five major historical periods.
The first attempt to
understand the process of language development appears in a remarkable passage
from The Confessions of St. Augustine
(1952). In this passage, Augustine claims that
he remembered how he had learned language:
This I
remember; and have since observed how I learned to speak. It was not that my
elders taught me words (as, soon after, other learning) in any set method; but
I, longing by cries and broken accents and various motions of my limbs to
express my thoughts, that so I might have my will, and yet unable to express
all I willed or to whom I willed, did myself, by the understanding which Thou,
my God, gavest me, practise the sounds in my memory. When they named anything,
and as they spoke turned towards it, I saw and remembered that they called what
they would point out by the name they uttered. And that they meant this thing,
and no other, was plain from the motion of their body, the natural language, as
it were, of all nations, expressed by the countenance, glances of the eye,
gestures of the limbs, and tones of the voice, indicating the affections of the
mind as it pursues, possesses, rejects, or shuns. And thus by constantly
hearing words, as they occurred in various sentences, I collected gradually for
what they stood; and, having broken in my mouth to these signs, I thereby gave
utterance to my will. Thus I exchanged with those about me these current signs
of our wills, and so launched deeper into the stormy intercourse of human life,
yet depending on parental authority and the beck of elders.
Augustine's
outline of early word learning drew attention to the role of gaze, pointing,
intonation, and mutual understanding as fundamental cues to language
learning. Modern research in word
learning (P. Bloom, 2000) has supported every point of Augustine's
analysis, as well as his emphasis on the role of children's intentions. In this sense, Augustine's somewhat fanciful
recollection of his own language acquisition remained the high water mark for
child language studies through the Middle Ages and even the Enlightenment. Unfortunately,
the method on which these insights were grounded depends on our ability to
actually recall the events of early childhood – a gift granted to very
few of us.
Charles Darwin
provided much of the inspiration for the development of the second major
technique for the study of language acquisition. Using note cards and field
books to track the distribution of hundreds of species and subspecies in places
like the Galapagos and Indonesia, Darwin was able to collect an impressive
body of naturalistic data in support of his views on natural selection and
evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for
naturalistic observation could be adopted to the study of human development.
By taking detailed daily notes, Darwin showed how researchers could build
diaries that could then be converted into biographies documenting virtually any
aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing
the language development of their own children.
Darwin's biographical
technique also had its effects on the study of adult aphasia. Following in
this tradition, studies of the language of particular patients and syndromes
were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.
The limits of the
diary technique were always quite apparent. Even the most highly trained
observer could not keep pace with the rapid flow of normal speech production.
Anyone who has attempted to follow a child about with a pen and a notebook
soon realizes how much detail is missed and how the note-taking process
interferes with the ongoing interactions.
The introduction of
the tape recorder in the late 1950s provided a way around these limitations
and ushered in the third period of observational studies. The effect of the tape
recorder on the field of language acquisition was very much like its effect on
ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high
quality field recordings using this new technology. This period was
characterized by projects in which groups of investigators collected large data
sets of tape recordings from several subjects across a period of 2 or 3 years.
Much of the excitement in the 1960s regarding new directions in child language
research was fueled directly by the great increase in raw data that was
possible through use of tape recordings and typed transcripts.
This increase in the
amount of raw data had an additional, seldom discussed, consequence. In the
period of the baby biography, the final published accounts closely resembled
the original database of note cards. In this sense, there was no major gap
between the observational database and the published database. In the period
of typed transcripts, a wider gap emerged. The size of the transcripts produced
in the 60s and 70s made it impossible to publish the full corpora. Instead,
researchers were forced to publish only high-level analyses based on data that
were not available to others. This led to a situation in which the raw
empirical database for the field was kept only in private stocks, unavailable
for general public examination. Comments and tallies were written into the
margins of ditto master copies and new, even less legible copies, were then
made by thermal production of new ditto masters. Each investigator devised a
project-specific system of transcription and project-specific codes. As we
began to compare hand-written and typewritten transcripts, problems in
transcription methodology, coding schemes, and cross-investigator reliability
became more apparent.
Recognizing this
problem, Roger Brown took the lead in attempting to share his transcripts from
Adam, Eve, and Sarah (Brown, 1973) with other researchers. These
transcripts were typed onto stencils and mimeographed in multiple copies. The
extra copies were lent to and analyzed by a wide variety of researchers. In
this model, researchers took their copy of the transcript home, developed their
own coding scheme, applied it (usually by making pencil markings directly on
the transcript), wrote a paper about the results and, if very polite, sent a
copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the
conclusions drawn from those data by Brown himself!
During this early
period, the relations between the various coding schemes often remained
shrouded in mystery. A fortunate consequence of the unstable nature of coding
systems was that researchers were very careful not to throw away their original
data, even after it had been coded. Brown himself commented on the impending
transition to computers in this passage (Brown, 1973, p. 53):
It is
sensible to ask and we were often asked, “Why not code the sentences for
grammatically significant features and put them on a computer so that studies
could readily be made by anyone?” My
answer always was that I was continually discovering new kinds of information
that could be mined from a transcription of conversation and never felt that I
knew what the full coding should be.
This was certainly the case and indeed it can be said that in the entire
decade since 1962 investigators have continued to hit upon new ways of
inferring grammatical and semantic knowledge or competence from free
conversation. But, for myself, I must, in candor, add that there was also a
factor of research style. I have little
patience with prolonged “tooling up” for research. I always want to get started. A better
scientist would probably have done more planning and used the computer. He can do so today, in any case, with
considerable confidence that he knows what to code.
With
the experience of three more decades of computerized analysis behind us, we now
know that the idea of reducing child language data to a set of codes and then
throwing away the original data is simply wrong. Instead, our goal must be to computerize the
data in a way that allows us to continually enhance it with new codes and
annotations. It is fortunate that Brown
preserved his transcript data in a form that allowed us to continue to work on
it. It is unfortunate, however, that the
original audiotapes were not kept.
Just as these data
analysis problems were coming to light, a major technological opportunity was
emerging in the shape of the powerful, affordable microcomputer. Microcomputer
word-processing systems and database programs allowed researchers to enter
transcript data into computer files that could then be easily duplicated,
edited, and analyzed by standard data-processing techniques. In 1981, when the
Child Language Data Exchange System (CHILDES) Project was first conceived, researchers
basically thought of computer systems as large notepads. Although researchers
were aware of the ways in which databases could be searched and tabulated, the
full analytic and comparative power of the computer systems themselves was not
yet fully understood.
Rather than serving
only as an “archive” or historical record, a focus on a shared database can
lead to advances in methodology and theory. However, to achieve these
additional advances, researchers first needed to move beyond the idea of a
simple data repository. At first, the possibility of utilizing shared
transcription formats, shared codes, and shared analysis programs shone only
as a faint glimmer on the horizon, against the fog and gloom of handwritten
tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this
backdrop, the idea of a computerized data exchange system began to emerge. It
was against this conceptual background that CHILDES (the name uses a
one-syllable pronunciation) was conceived. The origin of the system can be
traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan
Ervin-Tripp, and Brian MacWhinney discussed the possibility of creating an
archive for typed, handwritten, and computerized transcripts to be located at
the Max-Planck-Institut für Psycholinguistik in Nijmegen. In 1983, the
MacArthur Foundation funded meetings of developmental researchers in which
Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language
researchers discussed the possibility of soliciting MacArthur funds to support
a data exchange system. In January of 1984, the MacArthur Foundation awarded a
two-year grant to Brian MacWhinney and Catherine Snow for the establishment of
the Child Language Data Exchange System. These funds provided for the entry of
data into the system and for the convening of a meeting of an advisory board.
Twenty child language researchers met for three days in Concord, Massachusetts
and agreed on a basic framework for the CHILDES system, which Catherine Snow
and Brian MacWhinney would then proceed to implement.
Since 1984, when the
CHILDES Project began in earnest, the world of computers has gone through a
series of remarkable revolutions, each introducing new opportunities and
challenges. The processing power of the home computer now dwarfs the power of
the mainframe of the 1980s; new machines are now shipped with built-in
audiovisual capabilities; and devices such as CD-ROMs and optical disks offer
enormous storage capacity at reasonable prices. This new hardware has now
opened up the possibility for multimedia access to digitized audio and video
from links inside the written transcripts. In effect, a transcript is now the
starting point for a new exploratory reality in which the whole interaction is
accessible from the transcript. Although researchers have just now begun to
make use of these new tools, the current shape of the CHILDES system reflects
many of these new realities. In the pages that follow, you will learn about
how we are using this new technology to provide rapid access to the database
and to permit the linkage of transcripts to digitized audio and video records,
even over the Internet. For further
ideas regarding this type of work, you may wish to connect to http://talkbank.org
where there are various extensions of the CHILDES project.
The reasons for
developing a computerized exchange system for language data are immediately
obvious to anyone who has produced or analyzed transcripts. With such a system,
we can:
1. automate the process of data analysis,
2. obtain better data in a consistent,
fully-documented transcription system, and
3. provide more data for more children from
more ages, speaking more languages.
The
CHILDES system has addressed each of these goals by developing three separate,
but integrated, tools. The first tool is the chat
transcription and coding format. The second tool is the clan analysis program,
and the third tool is the database. These three tools are like the legs of a
three-legged stool. The transcripts in the database have all been put into the chat transcription system. The program
is designed to make full use of the chat
format to facilitate a wide variety of searches and analyses. Many research
groups are now using the CHILDES programs to enter new data sets. Eventually,
these new data sets will be available to other researchers as a part of the
growing CHILDES database. In this way, chat,
CLAN, and the database function as a coarticulated set of complementary tools.
There are manuals for
each of the three CHILDES tools. The
CHAT manual, which you are now reading, describes the conventions and
principles of CHAT transcription. The CLAN manual describes the use of the CLAN
computer programs that you can use to transcribe, annotate, and analyze
language interactions. The third manual, which is actually a collection of over
a dozen separate manuals retrievable from a single link on the web, describes
the data files in the CHILDES database.
Each of these database manuals describes the data sets in one major
component of the database. In addition,
there is a short manual that provides an overview for the entire database.
We received a great
deal of extremely helpful input during the years between 1984 and 1988 when the
CHAT system was being formulated. Some of the most detailed comments came from
George Allen, Elizabeth Bates, Nan Bernstein Ratner, Giuseppe Cappelli, Annick
De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul
Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara
Pan, Lucia Pfanner, Kim Plunkett, Kelley Sacco, Catherine Snow, Jeff Sokolov,
Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments
developed in Edwards (1992) were useful in shaping core aspects of
CHAT. George Allen (1988) helped developed the UNIBET and PHONASCII
systems. The workers in the LIPPS Group (LIPPS, 2000) have developed extensions of CHAT to
cover code-switching phenomena. Adaptations of CHAT to deal with data on
disfluencies are developed in Bernstein-Ratner, Rooney, and MacWhinney (Bernstein-Ratner, Rooney, &
MacWhinney, 1996). The exercises in Chapter 7 of Part II are based on
materials originally developed by Barbara Pan for Chapter 2 of Sokolov &
Snow (1994)
In the period between
2001 and 2004, we converted much of the CHILDES system to work with the new XML
Internet data format. This work was
begun by Romeo Anghelache and completed by Franklin Chen. Support for this major
reformatting and the related tightening of the CHAT format came from the NSF
TalkBank Infrastructure project which involved a major collaboration with
Steven Bird and Mark Liberman of the Linguistic Data Consortium. Ongoing work
in TalkBank is documented on the web at http://talkbank.org.
The CLAN program is
the brainchild of Leonid Spektor. Ideas for particular analysis commands came
from several sources. Bill Tuthill's HUM package provided ideas about
concordance analyses. The SALT system of Miller & Chapman (1983) provided guidelines regarding basic
practices in transcription and analysis. Clifton Pye's PAL program provided
ideas for the MODREP and PHONFREQ commands.
Darius Clynes ported
CLAN to the Macintosh. Jeffrey Sokolov wrote the CHIP program. Mitzi Morris
designed the MOR analyzer using specifications provided by Roland Hauser of
Erlangen University. Norio Naka and Susanne Miyata developed a MOR rule
system for Japanese; and Monica Sanz-Torrent helped develop the MOR system for
Spanish. Julia Evans provided recommendations for the design of the audio and
visual capabilities of the editor.
Johannes Wagner, Mike Forrester, and Chris Ramsden helped show us how we could
modify clan to permit transcription in the Conversation Analysis framework.
Steven Gillis provided suggestions for aspects of MODREP. Christophe Parisse built the POST and
POSTTRAIN programs (Parisse & Le Normand, 2000). Brian Richards contributed the VOCD
program (Malvern, Richards, Chipere, & Purán,
2004). Julia Evans helped
specify TIMEDUR and worked on the details of DSS. Catherine Snow designed
CHAINS, KEYMAP, and STATFREQ. Nan
Bernstein Ratner specified aspects of PHONFREQ and plans for additional
programs for phonological analysis.
The primary reason
for the success of the CHILDES database has been the generosity of over 100
researchers who have contributed their corpora. Each of these corpora
represents hundreds, often thousands, of hours spent in careful collection,
transcription, and checking of data. All researchers in child language should
be proud of the way researchers have generously shared their valuable data with
the whole research community. The growing size of the database for language
impairments, adult aphasia, and second-language acquisition indicates that
these related areas have also begun to understand the value of data sharing.
Many of the corpora
contributed to the system were transcribed before the formulation of CHAT. In
order to create a uniform database, we had to reformat these corpora into CHAT.
Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, Kelley Sacco, and
Gergely Sikuta worked many long hours on this task. Steven Gillis, Helmut
Feldweg, Susan Powers, and Heike Behrens supervised a parallel effort with the
German and Dutch data sets.
Because of the
continually changing shape of the programs and the database, keeping this
manual up to date has been an ongoing activity. In this process, I received
help from Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson,
Kelley Sacco, and Gergely Sikuta. Barbara Pan, Jeff Sokolov, and Pam Rollins
also provided a reading of the final draft of the 1995 version of the manual.
Since the beginning
of the project, Catherine Snow has continually played a pivotal role in shaping
policy, building the database, organizing workshops, and determining the shape
of chat and CLAN. Catherine Snow
collaborated with Jeffrey Sokolov, Pam Rollins, and Barbara Pan to construct a
series of tutorial exercises and demonstration analyses that appeared in
Sokolov & Snow (1994). Those exercises form the basis for
similar tutorial sections in the current manual. Catherine Snow has
contributed six major corpora to the database and has conducted CHILDES
workshops in a dozen countries.
Several other
colleagues have helped disseminate the CHILDES system through workshops,
visits, and Internet facilities. Hidetosi Sirai established a CHILDES file
server mirror at Chukyo University in Japan and Steven Gillis established a
mirror at the University of Antwerp. Steven Gillis, Kim Plunkett, Johannes
Wagner, and Sven Strömqvist helped propagate the CHILDES system at universities
in Northern and Central Europe. Susanne Miyata has brought together a vital
group of child language researchers using CHILDES to study the acquisition of
Japanese and has supervised the translation of the current manual into
Japanese. In Italy, Elena Pizzuto organized symposia for developing the CHILDES
system and has supervised the translation of the manual into Italian. Magdalena
Smoczynska in Krakow and Wolfgang Dressler in Vienna have helped new
researchers who are learning to use CHILDES for languages spoken in Eastern
Europe. Miquel Serra has supported a series of CHILDES workshops in Barcelona.
Zhou Jing organized a workshop in Nanjing and Chien-ju Chang organized a
workshop in Taipei.
From 1984 to 1988,
the John D. and Catherine T. MacArthur Foundation supported the CHILDES
Project. In 1988, the National Science Foundation provided an equipment grant
that allowed us to put the database on the Internet and on CD-ROMs. From 1989
to 2010, the project has been supported by an ongoing grant from the National
Institutes of Health (NICHHD). In 1998, the National Science Foundation
Linguistics Program provided additional support to improve the programs for
morphosyntactic analysis of the database. In 1999, NSF funded the TalkBank
project which seeks to improve the CHILDES tools and to use CHILDES as a model
for other disciplines studying human communication. In 2002, NSF provided
support for the development of the GRASP system for parsing of the
corpora. In 2002, NIH provided
additional support for the development of PhonBank for child language phonology
and AphasiaBank for the study of communication in aphasia.
Each of the three
parts of the CHILDES system is described in a separate manual. The CHAT manual describes the conventions and
principles of CHAT transcription. The CLAN manual describes the use of the
editor and the analytic commands. The database manual is a set of over a dozen
smaller documents, each describing a separate segment of the database.
To learn the CHILDES
system, you should begin by downloading and installing the CLAN program. Next, you should download and start to read the
current manual (CHAT Manual) and the CLAN manual. Before proceeding too far into the CHAT
manual, you will want to walk through the tutorial section at the beginning of
the CHAT manual. After finishing the tutorial, try working a bit with each of
the CLAN commands to get a feel for the overall scope of the system. You can
then learn more about CHAT by transcribing a small sample of your data in a
short test file. Run the CHECK program at frequent intervals to verify the
accuracy of your coding. Once you have finished transcribing a small segment
of your data, try out the various analysis programs you plan to use, to make
sure that they provide the types of results you need for your work.
If you are primarily
interested in analyzing data already stored in the CHILDES archive, you do not
need to learn the CHAT transcription format in much detail and you will only
need to use the editor to open and read files. In that case, you may wish to
focus your efforts on learning to use the CLAN programs. If you plan to
transcribe new data, then you also need to work with the current manual to
learn to use CHAT.
Teachers will also
want to pay particular attention to the sections of the CLAN manual that
present a tutorial introduction. Using some of the examples given there, you
can construct additional materials to encourage students to explore the
database to test out particular hypotheses.
At the end of the CLAN manual, there are also a series of exercises that
help students further consolidate their knowledge of CHAT and CLAN.
The CHILDES system
was not intended to address all issues in the study of language learning, or to
be used by all students of spontaneous interactions. The chat system is comprehensive, but it is
not ideal for all purposes. The programs are powerful, but they cannot solve
all analytic problems. It is not the goal of CHILDES to provide facilities for
all research endeavors or to force all research into some uniform mold. On the
contrary, the programs are designed to offer support for alternative analytic
frameworks. For example, the editor now supports the various codes of
Conversation Analysis (CA) format, as alternatives and supplements to CHAT
format.
There are many researchers in the fields that
study language learning who will never need to use CHILDES. Indeed, we estimate
that the three CHILDES tools will never be used by at least half of the
researchers in the field of child language. There are three common reasons why
individual researchers may not find CHILDES useful:
1. some researchers may have already
committed themselves to use of another analytic system;
2. some researchers may have collected so
much data that they can work for many years without needing to collect more
data and without comparing their own data with other researchers' data; and
3. some researchers may not be interested in
studying spontaneous speech data.
Of
these three reasons for not needing to use the three CHILDES tools, the third
is the most frequent. For example, researchers studying comprehension would
only be interested in CHILDES data when they wish to compare findings arising
from studies of comprehension with patterns occurring in spontaneous
production.
The CHILDES tools
have been extensively tested for ease of application, accuracy, and reliability.
However, change is fundamental to any research enterprise. Researchers are constantly
pursuing better ways of coding and analyzing data. It is important that the
CHILDES tools keep progress with these changing requirements. For this reason,
there will be revisions to chat,
the programs, and the database as long as the CHILDES Project is active.
The chat system provides a standardized
format for producing computerized transcripts of face-to-face conversational
interactions. These interactions may involve children and parents, doctors and
patients, or teachers and second-language learners. Despite the differences
between these interactions, there are enough common features to allow for the
creation of a single general transcription system. The system described here
is designed for use with both normal and disordered populations. It can be used
with learners of all types, including children, second-language learners, and
adults recovering from aphasic disorders. The system provides options for
basic discourse transcription as well as detailed phonological and
morphological analysis. The system bears the acronym “chat,” which stands for Codes for the Human Analysis of
Transcripts. Chat is the standard
transcription system for the CHILDES (Child Language Data Exchange System)
Project. All of the transcripts in the CHILDES database are in chat format.
What makes CHAT
particularly powerful is the fact that
files transcribed in CHAT can also be analyzed by the CLAN programs that are
described in the CLAN manual, which is an electronic companion piece to this
manual. The CHAT programs can track a wide variety of structures, compute
automatic indices, and analyze morphosyntax.
Moreover, because all CHAT files can now also be translated to a highly
structured form of XML (a language used for text documents on the web), they
are now also compatible with a wide range of other powerful computer programs
such as ELAN, Praat, EXMARaLDA, Phon, Transcriber, and so on.
The CHILDES system
has had a major impact on the study of child language. At the time of the last
monitoring in 2003, there were over 2000 published articles that had made use
of the programs and database. In 2007,
the size of the database had grown to over 44 million words, making it by far the
largest database of conversational interactions available anywhere. The total number of researchers who have
joined as CHILDES members across the length of the project is now over 4500. Of
course, not all of these people are making active use of the tools at all
times. However, it is safe to say that, at any given point in time,
approximately 100 groups of researchers around the world are involved in new
data collection and transcription using the chat
system. Eventually the data collected in these various projects will all
be contributed to the database.
Public inspection of
experimental data is a crucial prerequisite for serious scientific progress.
Imagine how genetics would function if every experimenter had his or her own
individual strain of peas or drosophila and refused to allow them to be tested
by other experimenters. What would happen in geology, if every scientist kept
his or her own set of rock specimens and refused to compare them with those of
other researchers? In some fields the basic phenomena in question are so
clearly open to public inspection that this is not a problem. The basic facts
of planetary motion are open for all to see, as are the basic facts underlying
Newtonian mechanics.
Unfortunately, in
language studies, a free and open sharing and exchange of data has not always
been the norm. In earlier decades, researchers jealously guarded their field
notes from a particular language community of subject type, refusing to share
them openly with the broader community. Various justifications were given for
this practice. It was sometimes claimed that other researchers would not fully
appreciate the nature of the data or that they might misrepresent crucial
patterns. Sometimes, it was claimed that only someone who had actually
participated in the community or the interaction could understand the nature
of the language and the interactions. In some cases, these limitations were
real and important. However, all such restrictions on the sharing of data
inevitably impede the progress of the scientific study of language learning.
Within the field of
language acquisition studies it is now understood that the advantages of
sharing data outweigh the potential dangers. The question is no longer whether
data should be shared, but rather how they can be shared in a reliable and
responsible fashion. The computerization of transcripts opens up the
possibility for many types of data sharing and analysis that otherwise would
have been impossible. However, the full exploitation of this opportunity
requires the development of a standardized system for data transcription and
analysis.
Before examining the chat system, we need to consider some
dangers involved in computerized transcriptions. These dangers arise from the
need to compress a complex set of verbal and nonverbal messages into the
extremely narrow channel required for the computer. In most cases, these
dangers also exist when one creates a typewritten or handwritten transcript.
Let us look at some of the dangers surrounding the enterprise of transcription.
Perhaps the greatest
danger facing the transcriber is the tendency to treat spoken language as if
it were written language. The decision to write out stretches of vocal material
using the forms of written language can trigger a variety of theoretical
commitments. As Ochs (1979) showed so clearly, these decisions will
inevitably turn transcription into a theoretical enterprise. The most
difficult bias to overcome is the tendency to map every form spoken by a
learner – be it a child, an aphasic, or a second-language learner –
onto a set of standard lexical items in the adult language. Transcribers tend
to assimilate nonstandard learner strings to standard forms of the adult
language. For example, when a child says “put on my jamas,” the transcriber may
instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is
simply a childish form of “pajamas.” This type of regularization of the child
form to the adult lexical norm can lead to misunderstanding of the shape of the
child's lexicon. For example, it could be the case that the child uses “jamas”
and “pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).
There are two types
of errors possible here. One involves mapping a learner's spoken form onto an
adult form when, in fact, there was no real correspondence. This is the problem
of overnormalization. The second type of error involves failing to map a
learner's spoken form onto an adult form when, in fact, there is a
correspondence. This is the problem of undernormalization. The goal of
transcribers should be to avoid both the Scylla of overnormalization and the
Charybdis of undernormalization. Steering a course between these two dangers is
no easy matter. A transcription system can provide devices to aid in this process,
but it cannot guarantee safe passage.
Transcribers also
often tend to assimilate the shape of sounds spoken by the learner to the
shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) noted that both children and adults
generally produce “have” as “uv” before main verbs. As a result, forms like
“might have gone” assimilate to “mightuv gone.” Fletcher believed that younger
children have not yet learned to associate the full auxiliary “have” with the
contracted form. If we write the children's forms as “might have,” we then end
up mischaracterizing the structure of their lexicon. To take another example,
we can note that, in French, the various endings of the verb in the present
tense are distinguished in spelling, whereas they are homophonous in speech. If
a child says /mʌnʒ/ “eat,” are we to transcribe it as first
person singular mange, as second
person singular manges, or as the
imperative mange? If the child says /măʒe/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez?
CHAT deals with these
problems in three ways. First, it uses
IPA as a uniform way of transcribing discourse phonetically. Second, the editor allows the user to link
the digitized audio record of the interaction directly to the transcript. This is the system called “sonic CHAT.” With
these sonic CHAT links, it is possible to double-click on a sentence and hear
its sound immediately. Having the actual
sound produced by the child directly available in the transcript takes some of
the burden off of the transcription system. However, whenever computerized
analyses are based not on the original audio signal but on transcribed
orthographic forms, one must continue to understand the limits of transcription
conventions. Third, for those who wish to avoid the work involved in IPA
transcription or sonic CHAT, that is a system for using nonstandard lexical
forms, that the form “might (h)ave” would be universally recognized as the
spelling of “mightof”, the contracted form of “might have.” More extreme cases
of phonological variation can be annotated as in this example: popo [: hippopotamus].
Transcribers have a
tendency to write out spoken language with the punctuation conventions of
written language. Written language is organized into clauses and sentences delimited
by commas, periods, and other marks of punctuation. Spoken language, on the
other hand, is organized into tone units clustered about a tonal nucleus and
delineated by pauses and tonal contours (Crystal, 1969, 1979; Halliday, 1966,
1967, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jefferson, 1984) has demonstrated a close link between
tone units and ideational units. Retracings, pauses, stress, and all forms of
intonational contours are crucial markers of aspects of the utterance planning
process. Moreover, these features also convey important sociolinguistic information.
Within special markings or conventions, there is no way to directly indicate
these important aspects of interactions.
Whatever form a
transcript may take, it will never contain a fully accurate record of what went
on in an interaction. A transcript of an interaction can never fully replace an
audiotape, because an audio recording of the interaction will always be more
accurate in terms of preserving the actual details of what transpired. By the
same token, an audio recording can never preserve as much detail as a video
recording with a high-quality audio track. Audio recordings record none of the
nonverbal interactions that often form the backbone of a conversational
interaction. Hence, they systematically exclude a source of information that is
crucial for a full interpretation of the interaction. Although there are biases
involved even in a video recording, it is still the most accurate record of an
interaction that we have available. For those who are trying to use
transcription to capture the full detailed character of an interaction, it is
imperative that transcription be done from a video recording which should be
repeatedly consulted during all phases of analysis.
When the CLAN editor
is used to link transcripts to audio recordings, we refer to this as sonic
CHAT. When the system is used to link transcripts to video recordings, we refer
to this as video CHAT. The CLAN manual explains how to link digital audio and
video to transcripts.
Transcription and
coding systems often force the user to make difficult distinctions. For
example, a system might make a distinction between grammatical ellipsis and
ungrammatical omission. However, it may often be the case that the user cannot
decide whether an omission is grammatical or not. In that case, it may be
helpful to have some way of blurring the distinction. chat has certain symbols that can be used when a
categorization cannot be made. It is important to remember that many of the chat symbols are entirely optional. Whenever
you feel that you are being forced to make a distinction, check the manual to
see whether the particular coding choice is actually required. If it is not
required, then simply omit the code altogether.
It is important to recognize
the difference between transcription
and coding. Transcription focuses on
the production of a written record that can lead us to understand, albeit only
vaguely, the flow of the original interaction. Transcription must be done
directly off an audiotape or, preferably, a videotape. Coding, on the other
hand, is the process of recognizing, analyzing, and taking note of phenomena in
transcribed speech. Coding can often be done by referring only to a written
transcript. For example, the coding of parts of speech can be done directly
from a transcript without listening to the audiotape. For other types of
coding, such as speech act coding, it is imperative that coding be done while
watching the original videotape.
The chat system includes conventions for
both transcription and coding. When first learning the system, it is best to
focus on learning how to transcribe. The chat
system offers the transcriber a large array of coding options. Although few
transcribers will need to use all of the options, everyone needs to understand
how basic transcription is done on the “main line.” Additional coding is done
principally on the secondary or “dependent” tiers. As transcribers work more
with their data, they will include further options from the secondary or “dependent”
tiers. However, the beginning user should focus first on learning to correctly
use the conventions for the main line. The manual includes several sample transcripts
to help the beginner in learning the transcription system.
Like other forms of
communication, transcription systems are subjected to a variety of
communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the
pressure of three conflicting charges or goals. On the one hand, language is
designed to be clear. On the other
hand, it is designed to be processible
by the listener and quick and easy
for the speaker. Unfortunately, ease of production often comes in conflict with
clarity of marking. The competition between these three motives leads to a
variety of imperfect solutions that satisfy each goal only partially. Such
imperfect and unstable solutions characterize the grammar and phonology of
human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in
fully achieving all three goals.
Slobin's view of the
pressures shaping human language can be extended to analyze the pressures
shaping a transcription system. In many regards, a transcription system is much
like any human language. It needs to be clear in its markings of categories,
and still preserve readability and ease of transcription. However, unlike a
human language, a transcription system needs to address two different
audiences. One audience is the human audience of transcribers, analysts, and
readers. The other audience is the digital computer and its programs. In order
to successfully deal with these two audiences, a system for computerized
transcription needs to achieve the following goals:
1. Clarity: Every symbol used in the coding system
should have some clear and definable real-world referent. The relation between
the referent and the symbol should be consistent and reliable. Symbols that
mark particular words should always be spelled in a consistent manner. Symbols
that mark particular conversational patterns should refer to actual patterns
consistently observable in the data. In practice, codes will always have to
steer between the Scylla of overregularization and the Charybdis of
underregularization discussed earlier. Distinctions must avoid being either too
fine or too coarse. Another way of looking at clarity is through the notion of
systematicity. Systematicity is a simple extension of clarity across
transcripts or corpora. Codes, words, and symbols must be used in a consistent
manner across transcripts. Ideally, each code should always have a unique
meaning independent of the presence of other codes or the particular transcript
in which it is located. If interactions are necessary, as in hierarchical coding
systems, these interactions need to be systematically described.
2. Readability: Just as human language needs to be easy
to process, so transcripts need to be easy to read. This goal often runs
directly counter to the first goal. In the CHILDES system, we have attempted to
provide a variety of chat options
that will allow a user to maximize the readability of a transcript. We have
also provided clan tools that will allow a reader to suppress the less readable
aspects in transcript when the goal of readability is more important than the
goal of clarity of marking.
3. Ease
of data entry: As
distinctions proliferate within a transcription system, data entry becomes
increasingly difficult and error-prone. There are two ways of dealing with this
problem. One method attempts to simplify the coding scheme and its categories.
The problem with this approach is that it sacrifices clarity. The second method
attempts to help the transcriber by providing computational aids. The CLAN
programs follow this path. They provide systems for the automatic checking of
transcription accuracy, methods for the automatic analysis of morphology and
syntax, and tools for the semiautomatic entry of codes. However, the basic
process of transcription has not been automated and remains the major task
during data entry.
chat provides both basic and advanced formats
for transcription and coding. The basic level of chat is called minchat.
New users should start by learning minchat.
This system looks much like other intuitive transcription systems that are in
general use in the fields of child language and discourse analysis. However,
eventually users will find that there is something they want to be able to code
that goes beyond minchat. At that
point, they should move on to learning midCHAT.
There are several
minimum standards for the form of a minchat
file. These standards must be followed for the CLAN commands to run
successfully on chat files:
1. Every line must end with a carriage
return.
2. The first line in the file must be an
@Begin header line.
3. The second line in the file must be an
@Languages header line. The languages
entered here use a three-letter ISO 639-3 code, such as “eng” for English.
4. The third line must be an @Participants
header line listing three-letter codes for each participant, the participant's
name, and the participant's role.
5. After the @Participants header come a set
of @ID headers providing further details for each speaker. These will be inserted automatically for you
when you run CHECK using escape-L.
6. The last line in the file must be an @End
header line.
7. Lines beginning with * indicate what was
actually said. These are called “main lines.” Each main line should code one
and only one utterance. When a speaker produces several utterances in a row,
code each with a new main line.
8. After the asterisk on the main line comes
a three-letter code in upper case letters for the participant who was the
speaker of the utterance being coded. After the three-letter code comes a colon
and then a tab.
9. What was actually said is entered
starting in the ninth column.
10. Lines beginning with the % symbol can
contain codes and commentary regarding what was said. They are called
“dependent tier” lines. The % symbol is
followed by a three-letter code in lowercase letters for the dependent tier
type, such as “pho” for phonology; a colon; and then a tab. The text of the
dependent tier begins after the tab.
11. Continuations of main lines and dependent
tier lines begin with a tab which is inserted automatically by the CLAN editor.
In addition to these
minimum requirements for the form of the file, there are certain minimum ways
in which utterances and words should be written on the main line:
1. Utterances should end with an utterance
terminator. The basic utterance terminators are the period, the exclamation mark,
and the question mark.
2. Commas can be used as needed to mark
phrasal junctions, but they are not used by the programs and have no sharp
prosodic definition. Similarly,
3. Use upper case letters only for proper
nouns and the word “I.” Do not use uppercase letters for the first words of
sentences. This will facilitate the identification of proper nouns.
4. Words should not contain capital letters
except at their beginning. Words should not contain numbers, unless these mark
tones.
5. Unintelligible words with an unclear
phonetic shape should be transcribed as xxx.
6. If you wish to note the phonological form
of an incomplete or unintelligible phonological string, write it out with an
ampersand, as in &guga.
7. Incomplete words can be written with the
omitted material in parentheses, as in (be)cause
and (a)bout.
Here
is a sample that illustrates these principles. This file is syntactically
correct and uses the minimum number of chat
conventions while still maintaining compatibility with the CLAN commands.
@Begin
@Languages: eng
@Participants: CHI Ross Child, FAT Brian Father
@ID: eng|macwhinney|CHI|2;10.10||||Target_Child|||
@ID: eng|macwhinney|FAT|35;2.||||Target_Child|||
*ROS: why isn't Mommy coming?
%com: Mother usually picks Ross up around 4 PM.
*FAT: don't worry.
*FAT: she'll be here soon.
*CHI: good.
@End
For researchers who
are just now beginning to use chat
and CLAN, there is one single suggestion that can potentially save literally
hundreds of hours of wasted time. The suggestion is to transcribe and analyze
one single small file completely and perfectly before launching a major effort
in transcription and analysis. The idea is that you should learn just enough
about minchat and minCLAN to see
your path through these four crucial steps:
1. entry of a small set of your data into a
CHAT file,
2. successful running of the CHECK command
inside the editor to guarantee accuracy in your CHAT file,
3. development of a series of codes that will
interface with the particular CLAN commands most appropriate for your analysis,
and
4. running of the relevant CLAN commands, so
that you can be sure that the results you will get will properly test the
hypotheses you wish to develop.
If
you go through these steps first, you can guarantee in advance the successful
outcome of your project. You can avoid ending up in a situation in which you
have transcribed hundreds of hours of data in a way that does not match
correctly with the input requirements for CLAN.
After having learned
minchat, you are ready to learn
the basics of CLAN. To do this, you will want to work through the first
chapters of the CLAN manual focusing in particular on the CLAN tutorial. These
chapters will take you up to the level of minCLAN, which corresponds to the minchat level.
Once you have learned
minCHAT and minCLAN, you are ready to move on to the next levels, which are
midCHAT and midCLAN. Learning midchat involves mastering the
transcription of words and conversational features. In particular, the midCHAT
learner should work through the chapters on words, utterances, and scoped
symbols. Depending on the shape of the particular project, the transcriber may
then need to study additional chapters in this manual. For people working on large projects that
last many months, it is a good idea to eventuallly read all of the current
manual, although some sections that seem less relevant to the project can be
skimmed.
chat files typically record a conversational
sample collected from a particular set of speakers on a particular day.
Sometimes researchers study a small set of children repeatedly over a long
period of time. Corpora created using this method are referred to as
longitudinal studies. For such studies, it is best to break up chat files into one collection for each
child. This can be done just by creating file names that begin with the three
letter code for the child, as in lea001.cha or eve15.cha. Each collection of
files from the children involved in a given study constitutes a corpus. A
corpus can also be composed of a group of files from different groups of
speakers when the focus is on a cross-sectional sampling of larger numbers of
language learners from various age groups. In either case, each corpus should
have a documentation file. This “readme” file should contain a basic set of
facts that are indispensable for the proper interpretation of the data by other
researchers. The minimum set of facts that should be in each readme file are
the following.
1. Acknowledgments. There should be a statement that asks
the user to cite some particular reference when using the corpus. For example,
researchers using the Adam, Eve, and Sarah corpora from Roger Brown and his colleagues
are asked to cite Brown (1973). In addition, all users can cite this
current manual as the source for the CHILDES system in general.
2. Restrictions.
If the data are being
contributed to the CHILDES system, contributors can set particular
restrictions on the use of their data. For example, researchers may ask that
they be sent copies of articles that make use of their data. Many researchers
have chosen to set no limitations at all on the use of their data.
3. Warnings. This documentation file should also warn
other researchers about limitations on the use of the data. For example, if an
investigator paid no attention to correct transcription of speech errors, this
should be noted.
4. Pseudonyms. The readme file should also include
information on whether informants gave informed consent for the use of their
data and whether pseudonyms have been used to preserve informant anonymity. In
general, real names should be replaced by pseudonyms. Anonymization is not
necessary when the subject of the transcriptions is the researcher's own child,
as long as the child grants permission for the use of the data.
5. History. There should be detailed information on
the history of the project. How was funding obtained? What were the goals of
the project? How was data collected? What was the sampling procedure? How was
transcription done? What was ignored in transcription? Were transcribers
trained? Was reliability checked? Was coding done? What codes were used? Was
the material computerized? How?
6. Codes. If there are project-specific codes,
these should be described.
7. Biographical
data. Where possible,
extensive demographic, dialectological, and psychometric data should be
provided for each informant. There should be information on topics such as age,
gender, siblings, schooling, social class, occupation, previous residences,
religion, interests, friends, and so forth. Information on where the parents
grew up and the various residences of the family is particularly important in
attempting to understand sociolinguistic issues regarding language change,
regionalism, and dialect. Without detailed information about specific dialect
features, it is difficult to know whether these particular markers are being
used throughout the language or just in certain regions.
8. Situational
descriptions. The readme
file should include descriptions of the contexts of the recordings, such as the
layout of the child's home and bedroom or the nature of the activities being
recorded. Additional specific situational information should be included in
the @Situation and @Comment fields in each
file.
The
various readme files for the corpora that are now in the CHILDES database were
all contributed in this form. To maintain consistency and promote an overview
of the database, these files were then edited and reformatted and combined into
the database files that can now be downloaded from the server.
Each CLAN command
runs a very superficial check to see if a file conforms to minchat. This check looks only to see that
each line begins with either @, *, %, a tab or a space. This is the
minimum that the CLAN commands must have to function. However, the correct
functioning of many of the functions of CLAN depends on adherence to further
standards for minchat. In order
to make sure that a file matches these minimum requirements for correct
analysis through CLAN, researchers should run each file through the CHECK
program. The CHECK command can be run directly inside the editor, so that you
can verify the accuracy of your transcription as you are producing it. CHECK
will detect errors such as failure to start lines with the correct symbols,
use of incorrect speaker codes, or missing @Begin and @End symbols. CHECK can
also be used to find errors in chat
coding beyond those discussed in this chapter. Using CHECK is like brushing
your teeth. It may be hard at first to remember to use the command, but the
more you use it the easier it becomes and the better the final results.
The three major
components of a chat transcript
are the file headers, the main tier, and the dependent tiers. In this chapter
we discuss creating the first major component – the file headers. A
computerized transcript in chat
format begins with a series of “header” lines, which tells us about things
such as the date of the recording, the names of the participants, the ages of
the participants, the setting of the interaction, and so forth.
A header is a line of
text that gives information about the participants and the setting. All headers
begin with the “@” sign. Some headers require nothing more than the @ sign and
the header name. These are “bare” headers such as @Begin or @New Episode. However,
most headers require that there be some additional material. This additional
material is called an “entry.” Headers that take entries must have a colon,
which is then followed by one or two tabs and the required entry. By default,
tabs are usually understood to be placed at eight-character intervals. The
material up to the colon is called the “header name.” In the example following,
“@Media” and “@Date” are both header names.
@Media: abe88 movie
@Date: 25-JAN-1983
The text that follows
the header name is called the “header entry.” Here, “abe88 movie” and
“25-JAN-1983” are the header entries. The header name and the header entry together
are called the “header line.” The header line should never have a punctuation
mark at the end. In chat, only
utterances actually spoken by the subjects receive final punctuation.
This chapter presents
a set of headers that researchers have considered important. Except for the
@Begin, @Languages, @Participants, @ID, and @End headers, none of the headers
are required and you should feel free to use only those headers that you feel
are needed for the accurate documentation of your corpus.
chat uses five types of headers: hidden,
initial, participant-specific, constant, and changeable. In the editor, CHAT
files appear to begin with the @Begin header.
However, there are actually three hidden headers that appear before this
header. These are the @Font header, the
@UTF8 header, and the @ColorWords which appear
in that order.
This header is used
to set the default font for the file.
This line appears at the beginning of the file and its presence is
hidden in the CLAN editor. When this header is missing, CLAN tries to determine
which font is most appropriate for use with the current file by examining
information in the @Languages and @Options headers. If CLAN’s choice is not appropriate for the
file, then the user will have to change the font. After this is done, the font information will
be stored in this header line. Files that
are retrieved from the database often do not have this header included, thereby
allowing CLAN and the user to decide which font is most appropriate for viewing
the current file.
@UTF8
This hidden header
follows after the @Font header. All
files in the database use this header to mark the fact that they are encoded in
UTF8. If the file was produced outside
of CLAN and this header is missing, CLAN will complain and ask the user to
verify whether the file should be read in UTF8.
Often this means that the user should run the CP2UTF program to convert
the file to UTF8.
CHAT has seven initial headers. The first six of these – @Begin, @Languages, @Participants, @Options, @ID, and @Media – appear in this order as the first lines of the file. The last one @End appears at the end of the file as the last line.
@Begin
This header is always
the first visible header placed at the beginning of the file. It is needed to
guarantee that no material has been lost at the beginning of the file. This is
a “bare” header that takes no entry and uses no colon.
This is the second
visible header; it tells the programs which language is being used in the
dialogues. Here is an example of this line for a bilingual transcript using
Swedish and Portuguese.
@Languages: sv, pt
The
language codes come from the international ISO 639-3 standard. For the
languages currently in the database, these three-letter codes and extended
codes are used:
|
Language |
Code |
Language |
Code |
Language |
Code |
|
Afrikaans |
afr |
German |
deu |
Polish |
pol |
|
Arabic |
ara |
Greek |
ell
|
|
|
|
Basque |
eus |
Hebrew |
heb |
Portuguese |
por |
|
Cantonese |
zho-yue |
Hungarian |
hun |
Punjabi |
pan |
|
Catalan |
cat |
Icelandic |
isl |
Romanian |
ron |
|
Chinese |
zho |
Indonesian |
ind |
Russian |
rus |
|
|
|
Irish |
gle |
Spanish |
spa |
|
Croatian |
hrv |
Italian |
ita |
Swahili |
swa |
|
Czech |
ces |
Japanese |
jpn |
Swedish |
swe |
|
Danish |
dan |
Javanese |
jav |
Tagalog |
tag |
|
Dutch |
nld |
Kannada |
kan |
Taiwanese |
zho-min |
|
English |
eng |
Kikuyu |
kik |
Tamil |
tam |
|
Estonian |
est |
Korean |
kor |
Thai |
tha |
|
Farsi |
fas |
Lithuanian |
lit |
Turkish |
tur |
|
Finnish |
sun |
Norwegian |
nor |
Vietnamese |
vie |
|
French |
fra |
|
|
Welsh |
cym |
|
Galician |
glg |
|
|
Yiddish |
yid |
We continually update this list, and CLAN
relies on a file in the lib/fixes directory called ISO-639.cut that lists the
current languages. In multilingual corpora, several codes can be combined on
the @Languages line. The first code
given is for the language used most frequently in the transcript. Individual
utterances in a second or third most frequent languages can be marked with
precodes as in this example:
*CHI: [- eng] this is my juguete@s.
In this example, Spanish is the most
frequent language, but the particular sentence is marked as English. The @Languages header lists spa for Spanish,
and then eng for English. Within this English
sentence, the use of a Spanish word is then marked as @s. When the @s is used in the main body of the
transcript without the [- eng], then it indicates a shift to English, rather
than to Spanish.
The
@s code may also be used to explicitly mark the use of a particular language,
even if it is not included in the @Languages header. For example, the code schlep@s:yid can be
used to mark the inclusion of the Yiddish word “schlep” in any text. The @s code can also be
further elaborated to mark code-blended words.
The form well@s:eng&cym
indicates that the word “well” could be either an English or a Welsh word. The
combination of a stem from one language with an inflection from another can be
marked using the plus sign as in swallowni@s:eng+hun for an English stem with a
Hungarian infinitival marking. All
of these codes can be followed by a code with the $ to explicitly mark the
parts of speech. Thus, the form
recordar@s$v:inf indicates that this Spanish word is an infinitive. The marking
of part of speech with the $ sign can also be used without the @s.
Tone languages like Cantonese, Mandarin,
and Thai are allowed to have word forms that include tones and numbers for
polysemes.
This is the third
visible header. Like the @Begin and
@Participants headers, it is obligatory. It lists all of the actors within the file.
The format for this header is XXX Name Role, XXX Name Role, XXX Name Role. XXX
stands for the three-letter speaker ID. Here is an example of a completed @Participants
header line:
@Participants: SAR Sue_Day Target_Child, CAR Carol Mother
Participants
are identified by three elements: their speaker ID, their name and their role:
1.
Speaker ID. The
speaker ID is usually composed of three letters. The code may be based
either on the participant's name, as in *ROS or *BIL, or on her role, as in
*CHI or *MOT. In this type of identifying system, several different children
could be indicated as *CH1, *CH2, *CH3, and so on. Speaker IDs must be unique
because they will be used to identify speakers both in the main body of the
transcript and in other headers. In many transcripts, three letters are enough
to distinguish all speakers. However, even with three letters, some ambiguities
can arise. For example, suppose that the child being studied is named Mark
(MAR) and his mother is named Mary (MAR). They would both have the same speaker
ID and you would not be able to tell who was talking. So you must change one
speaker ID. You would probably want to change it to something that would be
easy to read and understand as you go through the file. A good choice is to use
that speaker's role. In this example, Mary's speaker ID would be changed to MOT
(Mother). You could change Mark's speaker ID to CHI, but that would be
misleading if there are other children in the transcript. So a better solution
would be to use MAR and MOT as shown in the following example:
@Participants: MAR Mark Target_Child, MOT Mary Mother
2.
Name. The speaker's name can be omitted. If CLAN finds only a
three-letter ID and a role, it will assume that the name has been omitted. In
order to preserve anonymity, it is often useful to include a pseudonym for the
name, because the pseudonym will also be used in the body of the transcript.
For clan to correctly parse the participants line, multiple-word name
definitions such as “Sue Day” need to be joined in the form “Sue_Day.”
3.
Role. After the ID and name, you type in the role of the speaker.
There are a fixed set of roles specified in the depfile.cut file used by CHECK
and we recommend trying to use these fixed roles whenever possible. Please
consult that file for the full list. You will also see this same list of
possible roles in the “role” segment of the “ID Headers” dialog box. All of
these roles are hard-wired into the depfile.cut file used by CHECK. If one of these standard roles does not work, it
would be best to use one of the generic age roles, like Adult, Child, or
Teenager. Then, the exact nature of the role can be put in the place of the
name, as in these examples:
@Participants: TBO Toll_Booth_Operator Adult,AIR Airport_Attendant Adult, SI1 First_Sibling Sibling, SI2 Second_Sibling Sibling, OFF MOT_to_INV OffScript, NON Computer_Talk Non_Human
This
header is not obligatory, but it is frequently needed. When it occurs, it must follow the
@Participants line. This header allows
the checking programs (CHECK and the XML validator) to suspend certain checking
rules for certain file types.
1. CA.
Use
of this option suspends the usual requirement for utterance terminators.
2.
Heritage. Use of this option tells CHECK and the
validator not to look at the content of the main lines at all. This radical blockage of the function of
CHECK is only recommended for people working with CA files done in the
traditional Jeffersonian format. When this option is used, text
may be placed into italics, as in traditional CA.
3.
Sign. Use of this option permits the use of all
capitals in words for Sign Language notation.
4.
IPA. Use of this option permits the use of IPA
notation on the main line.
5.
Line. Use of this option tells the web browser to
expect time marking bullets on each line.
By default, the browser expects a bullet at the end of each tier.
6.
Multi. Use of this option tells the checkers to
expect multiple bullets on a single line.
This can be used for data that come from programs like Praat that mark
time for each word.
7.
Caps. This option turns off CLAN’s restriction
against having capital letters inside words.
This
header is used to control programs such as STATFREQ, output to Excel, and new
programs based on XML. The form of this
line is:
@ID: language|corpus|code|age|sex|group|SES|role|education|custom|
There must be one @ID field for each participant. Often you will not care to encode all of this
information. In that case, you can leave
some of these fields empty. Here is a
typical @ID header.
@ID: en|macwhinney|CHI|2;10.10||||Target_Child|||
To facilitate typing of these headers,
you can run the CHECK program on a new CHAT file. If CHECK does not see @ID headers, it will
use the @Participants line to insert a set of @ID headers to which you can then
add further information. Alternatively,
you can use the INSERT program to create these fields automatically from the
information in the @Participants line.
For even more complete control over creation of these @ID headers, you
can use the dialog system that comes up when you have an open CHAT file and
select “ID Headers” under the Tiers Menu pulldown. Here is a sample version of this dialog box:

Here are some further characterizations
of the possible fields for the @ID header.
Corpus: a one-word label for the corpus in
lowercase
Code: the three-letter code for the
speaker in capitals
Age: the age of the speaker (see
below)
Sex: either “male” or “female” in
lowercase
Role: the role as given in the
@Participants line
Education: educational level of the speaker
Custom: any additional information needed for
a given project
It
is important to use the correct format for the Target_Child’s age. This field uses the form years;months.days as
in 2;11.17 for 2 years, 11 months, and 17 days. If you want to represent a range of several days for a given transcript, you can use
this format: 2;11.17 –
2;11.28. Note that the dash is
surrounded by spaces. If you do not know the child's age in days, you can
simply use years and months, as in 6;4. with a period after the months. If you
do not know the months, you can use the form 6; with the semicolon after the
years. If you only know the child’s birthdate and the date of the transcript,
you can use the DATES program to compute the child’s age.
This
header is used to tell CLAN how to locate and play back media that are linked
to transcripts. The first field in this
header specifies the name of the media file.
Extensions should be omitted. If
the media file is abe88.wav, then just enter “abe88”. Then declare the format as
“sound” or “video”. It is also possible
to add the terms “missing” or “unlinked” after the media type. So the line has this shape:
@End
Like the @Begin
header, this header uses no colon and takes no entry. It is placed at the end of the file as the very last line. Adding
this header provides a safeguard against the danger of undetected file
truncation during copying.
The third set of
headers provides information specific to each participant. Most of the
participant-specific information is in the @ID tier. That information can be entered by using the
ID headers option in CLAN’s Tiers menu. The exceptions are for these tiers:
Currently, the constant headers follow the participant-specific headers. However, once the participant-specific headers have been merged into the @ID fields, the constant headers will follow the @Media field. These headers, which are all optional, describe various general facts about the file.
This
allows for special word forms in certain corpora.
The
possible entries here include: constructed computer phonecall telechat meeting
work medical classroom tutorial private family sports religious legal
face_to_face
This
header should include the city, state or province, and country in which the
interaction took place. Here is an example of a completed header line:
@Location: Boston, MA, USA
The possible entries here include: two
three four five more audience
Possible entries here are: poor, fair,
good, and excellent.
This
header outlines room configuration and positioning of furniture. This is
especially useful for experimental settings. The entry should be a description
of the room and its contents. Here is an example of the completed header line:
@Room Layout: Kitchen; Table in center of room with window on west wall, door to outside on north wall
This header indicates the specific tape
ID, side and footage. This is very important for identifying the tape from
which the transcription was made. The entry for this header should include the
tape ID, side and footage. Here is an example of this header:
@Tape Location: tape74, side a, 104
@Time Duration:
It
is often necessary to indicate the time at which the audiotaping began and the
amount of time that passed during the course of the taping, as in the following
header:
@Time Duration: 12:30-13:30
This header provides the absolute time
during which the taping occurred. For most projects what is important is not
the absolute time, but the time of individual events relative to each other.
This sort of relative timing is provided by coding on the %tim dependent tier
in conjunction with the @Time Start header described next.
If
you are tracking elapsed time on the %tim tier, the @Time Start header can be
used to indicate the absolute time at which the timing marks begin. If a new
@Time Start header is placed in the middle of the transcript, this “restarts”
the clock.
@Time Start: 12:30
This
line identifies the people who transcribed and coded the file. Having this
indicated is often helpful later, when questions arise. It also provides a way
of acknowledging the people who have taken the time to make the data available
for further study.
The
possible entries here are: eye_dialect
partial full detailed coarse checked
This header is used
to warn the user about certain defects or peculiarities in the collection and
transcription of the data in the file. Some typical warnings are as follows:
Changeable headers
can occur either at the beginning of the file along with the constant headers
or else in the body of the file. Changeable headers contain information that
can change within the file. For example, if the file contains material that was
recorded on only one day, the @Date header would occur only once at the
beginning of the file. However, if the file contains some material from a later
day, the @Date header would be used again later in the file to indicate the
next date. These changeable headers appear, then, at the point within the file
where the information changes. The list that follows is alphabetical.
This header describes
the activities involved in the situation. The entry is a list of component
activities in the situation. Suppose the @Situation header reads, “Getting
ready to go out.” The @Activities header would then list what was involved in
this, such as putting on coats, gathering school books, and saying good-bye.
Diary material that
was not originally transcribed in the chat
format often has explanatory or background material placed before a child's
utterance. When converting this material to the chat format, it is sometimes impossible to decide whether
this background material occurs before, during, or after the utterance. In
order to avoid having to make these decisions after the fact, one can simply
enter it in an @Bck header.
@Bck: Rachel was fussing and pointing toward the cabinet where the cookies are stored.
*RAC: cookie [/] cookie.
These
headers are used to mark the beginning of a “gem” for analysis by GEM. If there
is a colon, you must follow the colon with a tab and then one or more code
words.
This
header is created by the TEXTIN program.
It is used to represent the fact that some written text includes a blank
line or new paragraph. It should not be used for transcripts of spoken
language.
This header can be
used as an all-purpose comment line. Any type of comment can be entered on an
@Comment line. When the comment refers to a particular utterance, use the %com
line. When the comment refers to more general material, use the @Comment
header. If the comment is intended to apply to the file as a whole, place the
@Comment header along with the constant headers before the first utterance.
Instead of trying to make up a new coding tier name such as “@Gestational Age”
for a special purpose type of information, it is best to use the @Comment
field, as in this example:
@Comment: Gestational age of MAR is 7 months
@Comment: Birthweight of MAR is 6 lbs. 4 oz
Another example
of a special @Comment field is used in the diary notes of the MacWhinney
corpus, where they have this shape:
@Comment: Diary-Brian – Ross said “I don’t need to throw my blocks out the window anymore.”
This header indicates
the date of the interaction. The entry for this header is given in the form
day-month-year. The date is abbreviated in the same way as in the @Birth header
entry. Here is an example of a completed @Date header line:
@Date: 01-JUL-1965
Because
we have some corpora going back over a century, it is important to include the
full value for the year. Also, because
the days of the month should always have two digits, it is necessary to add a
leading “0” for days such as “01”.
These headers are
used to mark the end of a “gem” for analysis by the GEM command. If there is a
colon, you must follow the colon with a tab and then one or more code words.
Each @Eg must have a matching @Bg. If
the @Eg: form is used, then the text following it must exactly match the text
in the corresponding @Bg: You can nest
one set of @Bg-@Eg markers inside another, but double embedding is not
allowed. You can also begin a new pair
before finishing the current one, but again this cannot be done for three
beginnings.
This header is used
in conjunction with the GEM program, which is described in the CLAN
manual. It marks the beginning of “gems” when no nesting or overlapping of gems
occurs. Each gem is defined as material
that begins with an @g marker and ends with the next @g marker. We refer to these markers as “lazy” gem
markers, because they are easier to use than the @bg and @eg markers. To use
this feature, you need to also use the +n switch in GEM. You may nest at
most one @Bg-@Eg pair inside a series of @G headers.
This header simply
marks the fact that there has been a break in the recording and that a new
episode has started. It is a “bare” header that is used without a colon,
because it takes no entry. There is no need to mark the end of the episode
because the @New Episode header indicates both the end of one episode and the
beginning of another.
This header is used
to indicate the shift from the initially most frequent language listed in the
@Languages header to a new most frequent language. This header should only be used when there is
a marked break in a transcript from the use of one language to a fairly uniform
use of another language.
This
header is used to indicate the page from which some text is taken. It should
not be used for spoken texts.
This changeable
header describes the general setting of the interaction. It applies to all the
material that follows it until a new @Situation header appears. The entry for
this header is a standard description of the situation. Try to use standard
situations such as: “breakfast,” “outing,” “bath,” “working,” “visiting
playmates,” “school,” or “getting ready to go out.” Here is an example of the
completed header line:
@Situation: Tim and Bill are playing with toys in the hallway.
There should be
enough situational information given to allow the user to reconstruct the
situation as much as possible. Who is present? What is the layout of the room
or other space? What is the social role of those present? Who is usually the
caregiver? What activity is in progress? Is the activity routinized and, if so,
what is the nature of the routine? Is the routine occurring in its standard
time, place, and personnel configuration? What objects are present that affect
or assist the interaction? It will also be important to include relevant ethnographic
information that would make the interaction interpretable to the user of the database.
For example, if the text is parent- child interaction before an observer, what
is the culture's evaluation of behaviors such as silence, talking a lot,
displaying formulaic skills, defending against challenges, and
so forth?
Words are the basic
building blocks for all sentential and discourse structures. By studying the
development of word use, we can learn an enormous amount about the growth of
syntax, discourse, morphology, and conceptual structure. However, in order to
realize the full potential of computational analysis of word usage, we need to
follow certain basic rules. In particular, we need to make sure that we spell
words in a consistent manner. If we sometimes use the form doughnut and sometimes use the form donut, we are being inconsistent in our representation of this
particular word. If such inconsistencies are repeated throughout the lexicon,
computerized analysis will become inaccurate and misleading. One of the major
goals of chat analysis is to
maximize systematicity and minimize inconsistency. In the Introduction, we
discussed some of the problems involved in mapping the speech of language
learners onto standard adult forms. This chapter spells out some rules and
heuristics designed to achieve the goal of consistency for word-level
transcription.
One solution to this
problem would be to avoid the use of words altogether by transcribing
everything in phonetic or phonemic notation. But this solution would make the
transcript difficult to read and analyze. A great deal of work in language
learning is based on searches for words and combinations of words. If we want
to conduct these lexical analyses, we have to try to match up the child's
production to actual words. Work in the analysis of syntactic development also
requires that the text be analyzed in terms of lexical items. Without a clear
representation of lexical items and the ways that they diverge from the adult
standard, it would be impossible to conduct lexical and syntactic analyses
computationally. Even for those researchers who do not plan to conduct lexical
analyses, it is extremely difficult to understand the flow of a transcript if
no attempt is made to relate the learner's sounds to items in the adult
language.
At the same time,
attempts to force adult lexical forms onto learner forms can seriously
misrepresent the data. The solution to this problem is to devise ways to
indicate the various types of divergences between learner forms and adult
standard forms. Note that we use the term “divergences” rather than “error.”
Although both learners (MacWhinney & Osser, 1977) and adults (Stemberger, 1985) clearly do make errors, most of the divergences
between learner forms and adult forms are due to structural aspects of the
learner's system.
This chapter
discusses the various tools that chat
provides to mark some of these divergences of child forms from adult
standards. The basic types of codes for divergences that we discuss are:
1. special learner-form markers,
2. codes for unidentifiable material,
For
languages such as English, Spanish, and Japanese, we now have complete MOR
grammars. The lexicons used by these grammars constitute the definitive current
CHAT standard for words. Please take a
look at the relevant lexical files, since they illustrate in great detail the
overall principles we are describing in this chapter.
The word forms we
will be discussing here are the principal components of the “main line.” This line gives the basic transcription of
what the speaker said. The structure of main lines in CHAT is fairly
simple. Each main tier line begins with
an asterisk. After the asterisk, there is a three-letter speaker ID, a colon
and a tab. The transcription of what was said begins in the ninth column,
after the tab, because the tab stop in the editor is set for the eighth column.
The remainder of the main tier line is composed primarily of a series of words.
Words are defined as a series of ASCII characters separated by spaces. In this
chapter, we discuss the principles governing the transcription of words. In
CLAN, all characters that are not punctuation markers are potentially parts of
words. The default punctuation set includes the space and these characters:
, . ; ? ! [ ] < >
None
of these characters or the space can be used within words. Other non-letter
characters such as the plus sign (+) or the at sign (@) can be used within
words to express special meanings. This punctuation set applies to the main
lines and all coding lines with the exception of the %pho and %mod lines which
use the system described in the chapter on Dependent Tiers. Because those
systems make use of punctuation markers for special characters, only the space
can be used as a delimiter on the %pho and %mod lines. As the CLAN manual
explains, this default punctuation set can be changed for particular analyses.
Main lines are composed of words and
other markers. Words are pronounceable
forms, surrounded by spaces. Most words are entered just as they are found in
the dictionary. The first word of a sentence is not capitalized, unless it is a
proper noun.
Special form markers
can be placed at the end of a word. To do this, the symbol “@” is used in
conjunction with one or two additional letters. Here is an example of the use
of the @ symbol:
*SAR: I got a
bingbing@c.
Here
the child has invented the form bingbing
to refer to a toy. The word bingbing
is not in the dictionary and must be treated as a special form. To further
clarify the use of these @c forms, the transcriber should create a file called
“0lexicon.cdc” that provides glosses for such forms.
The @c form
illustrated in this example is only one of many possible special form markers
that can be devised. The following table lists some of these markers that we
have found useful. However, this categorization system is meant only to be
suggestive, not exhaustive. Researchers may wish to add further distinctions
or ignore some of the categories listed. The particular choice of markers and
the decision to code a word with a marker form is one that is made by the
transcriber, not by chat. The
basic idea is that CLAN will treat words marked with the special learner-form
markers as words and not as fragments. In addition, the MOR program will not
attempt to analyze special forms for part of speech.
Table 2:
Special Form Markers
|
Letters |
Categories |
Example |
Meaning |
POS |
|
@a |
addition |
xxx@a |
unintelligible |
w |
|
@b |
babbling |
abame@b |
- |
bab |
|
@c |
child-invented
form |
gumma@c |
sticky |
chi |
|
@d |
dialect form |
younz@d |
you |
dia |
|
@f |
family-specific
form |
bunko@f |
broken |
fam |
|
@g |
general
special form |
gongga@g |
- |
- |
|
@i |
interjection,
interaction |
uhhuh@i |
- |
int |
|
@k |
multiple
letters |
ka@k |
Japanese “ka” |
n:let |
|
@l |
letter |
b@l |
letter b |
n:let |
|
@n |
neologism |
breaked@n |
broke |
neo |
|
@o |
onomatopoeia |
woofwoof@o |
dog barking |
on |
|
@p |
phonol.
consistent form |
aga@p |
- |
phon |
|
@pm |
protomorpheme |
wi@pm |
will? |
pm |
|
@q |
metalinguistic
use |
no if@q-s or
but@q-s |
when citing
words |
meta |
|
@s:* |
second-language
form |
istenem@s:hu |
Hungarian word |
L2 |
|
@si |
singing |
lalala@si |
singing |
sing |
|
@sl |
signed
language |
apple@sl |
apple |
sign |
|
@sas |
sign &
speech |
apple@sas |
apple and sign |
sas |
|
@t |
test word |
wug@t |
small creature |
test |
|
@u |
Unibet transcription |
binga@u |
- |
uni |
|
@wp |
word play |
goobarumba@wp |
- |
wp |
|
@x |
Excluded words |
excluded |
unk |
|
|
User-defined code |
word@x:rtfd |
any user code |
|
We can define these
special markers in the following ways:
1. Addition can
be used to mark an unintelligible string as a word for inclusion on the %mor
line. MOR then recognizes xxx@a as
w|xxx. It also recognizes xxx@a$n as,
for example n|xxx.
2. Babbling
can be used to mark both
low-level early babbling and high-level sound play in older children. These
forms have no obvious meaning and are used just to have fun with sound.
3.
Child-invented forms are words created by the child sometimes
from other words without obvious derivational morphology. Sometimes they appear
to be sound variants of other words. Sometimes their origin is obscure.
However, the child appears to be convinced that they have meaning and adults
sometimes come to use these forms themselves.
4.
Dialect form is often an interesting general property
of a transcript. However, the coding of
phonological dialect variations on the word level should be minimized, because
it often makes transcripts more difficult to read and analyze. Instead,
general patterns of phonological variation can be noted in the readme file.
5.
Family-specific forms are
much like child-invented forms that have been taken over by the whole
family. Sometimes the source of these forms are children, but they can also be
older members of the family. Sometimes the forms come from variations of words
in another language. An example might be the use of undertoad to refer to some mysterious being in the surf, although
the word was simply undertow
initially.
6.
General special form marking with @g can be used when all of
the above fail. However, its use should generally be avoided. Marking with the
@ without a following letter is not accepted by CHECK.
7.
Interjections can be indicated in standard ways, making
the use of the @i notation usually not necessary. Instead of transcribing
“ahem@i,” one can simply transcribe ahem
following the conventions listed later.
8. Letters can either be transcribed with the @l
marker or simply as single-character words.
Strings of letters are marked as @k.
9.
Neologisms are meant to refer to morphological
coinages. If the novel form is monomorphemic,
then it should be characterized as a child-invented form (@c), family-specific
form (@f), or a test word (@t). Note
that this usage is only really sanctioned for CHILDES corpora. For AphasiaBank corpora, neologisms are
considered to be forms that have no real word source, as is typical in jargon
aphasia.
10. Nonvoiced
forms are produced
typically by hearing-impaired children or their parents who are mouthing words
without making their sounds.
11. Onomatopoeias
include animal sounds
and attempts to imitate natural sounds.
12. Phonological
consistent forms (PCFs) are early forms that are phonologically consistent, but
whose meaning is unclear to the transcriber. Usually these forms have some
relation to small function words.
13. Protomorphemes are forms that will eventually become
morphemes, including function words and affixes.
14. Metalinguistic
reference can be used to
either cite or “quote” single standard words or special child forms.
15. Second-language
forms derive from some
language not usually used in the home. These are marked with a second letter
for the first letter of the second language, as in @s:zh for Mandarin words inside
an English sentence.
16. Sign
language use can be
indicated by the @sl.
17. Sign
and speech use involves
making a sign or informal sign in parallel with saying the word.
18. Singing
can be marked with @si.
Sometimes the phrase that is being sung involves nonwords, as in
lalaleloo@si. In other cases, it
involves words that can be joined by underscores. However, if a larger passage
is sung, it is best to transcribe it as speech and just mark it as being sung
through a comment line.
19. Test
words are nonce forms
generated by the investigators to test the
productivity of the child's grammar.
20. Unibet
transcription can be
given on the main line by using the @u marker. However, if many such forms are
being noted, it may be better to construct a @pho line. With the advent of IPA
Unicode, we now prefer to avoid the use of Unibet, relying instead directly on
IPA.
21. Word
play in older children
produces forms that may sound much like the forms of babbling, but which arise
from a slightly different process. It is
best to use the @b for forms produced by children younger than 2;0 and @wp for
older children.
22. Unknown
forms can be marked with
@x. However, usually unknown forms are
transcribed using the xx, xxx, yy, yyy, and www markers.
23. User-defined
special forms can be
marked with @z followed by up to five letters of a user-defined code, such as
in word@z:rftd. This format should be
used carefully, because it will be difficult for the MOR program to evaluate
words with these codes unless additional detailed information is added to the
sf.cut file.
Later in this chapter
we present a set of standard spellings of English words that make use of @d,
@fp, and @i largely unnecessary. However, in languages where such a list is not
available, it may be necessary to use forms with @d or @i. The @b, @u, and @wp
markers allow the transcriber to represent words and babbling words
phonologically on the main line and have CLAN treat them as full lexical items.
This should only be done when the analysis requires that the phonological string
be treated as a word and it is unclear which standard morpheme corresponds to
the word. If a phonological string should not be treated as a full word, it
should be marked by a beginning &, and the @b, @u, or @w endings should not
be used. Also, if the transcript includes a complete %pho line for each word
and the data are intended for phonological analysis, it is better to use yy
(see the next section) on the main line and then give the phonological form on
the %pho line. If you wish to omit
coding of an item on the %pho line, you can insert the horizontal ellipsis
character … (Unicode 2026). This is a
single character, not three periods, and it is not the ellipsis character used
by MS-Word.
Family-specific forms
are special words used only by the family. These are often derived from child
forms that are adopted by all family members. They also include certain
“caregiverese” forms that are not easily recognized by the majority of adult
speakers but which may be common to some areas or some families. Family-specific
forms can be used by either adults or children.
The @n marker is
intended for morphological neologisms and over-regularizations, whereas the @c
marker is intended to mark nonce creation of stems. Of course, this distinction
is somewhat arbitrary and incomplete. Whenever a child-invented form is clearly
onomatopoeic, use the @o coding instead of the @c coding. A fuller
characterization of neologisms can be provided by the error coding system
presented in a separate chapter.
If transcribers find
it difficult to distinguish between child-invented forms, onomatopoeia, and
familial forms, they can use the @ symbol without any following letter. In this
way, they can at least indicate the fact that the preceding word is not a
standard item in the adult lexicon.
Sometimes it is
difficult to map a sound or group of sounds onto either a conventional word or
a non-conventional word. This can occur when the audio signal is so weak or garbled
that you cannot even identify the sounds being used. At other times, you can
recognize the sounds that the speaker is using, but cannot map the sounds onto
words. Sometimes you may choose not to transcribe a passage, because it is
irrelevant to the interaction. Sometimes the person makes a noise or performs
an action instead of speaking, and sometimes a person breaks off before
completing a recognizable word. All of these problems can be dealt with by
using certain special symbols for those items that cannot be easily related to
words. These symbols are typed in lower case and are preceded and followed by
spaces. When standing alone on a text tier, they should be followed by a
period, unless it is clear that the utterance was a question or a command.
Use the symbol xxx
when you cannot hear or understand what the speaker is saying. If you believe
you can distinguish the number of unintelligible words, you may use several xxx
strings in a row. Here is an example of the use of the xxx symbol:
*SAR: xxx.
*MOT: what?
*SAR: I want xx.
Sarah's first
utterance is fully unintelligible. Her second utterance includes some unintelligible
material along with some intelligible material.
The MLU and MLT
commands will ignore the xxx symbol when computing mean length of utterance and
other statistics. If you want to have several words included, use as many
occurrences of xx as you wish.
Use the symbol yyy
when you plan to code all material phonologically on a %pho line. If you are
not consistently creating a %pho line in which each word is transcribed in IPA
in the order of the main line, you should use the @u or & notations
instead. Here is an example of the use of yyy:
*SAR: yyy yyy a ball.
%pho: ta gə ə bal
The
first two words cannot be matched to particular words, but their phonological
form is given on the %pho line.
This symbol must be
used in conjunction with an %exp tier which is discussed in the chapter on
dependent tiers. This symbol is used on the main line to indicate material that
a transcriber does not know how to transcribe or does not want to transcribe.
For example, it could be that the material is in a language that the
transcriber does not know. This symbol can also be used when a speaker says
something that has no relevance to the interactions taking place and the experimenter
would rather ignore it. For example, www could indicate a long conversation
between adults that would be superfluous to transcribe. Here is an example of
the use of this symbol:
*MOT: www.
%exp: talks to neighbor on the telephone
This symbol is used
when the speaker performs some action that is not accompanied by speech. Notice
that the symbol is the numeral zero “0,” not the capital letter “O.” Here is an
example of the correct usage of this symbol:
*FAT: where's your doll?
*DAV: 0 [=! runs over to her closet].
If
the transcriber wishes to code the phonetics of the crying, it would be better
to insert yyy on the main tier. Do not use the zero, if there is any speech on
the tier. The zero can also be used to provide a place to attach a dependent
tier.
The & symbol can
be used at the beginning of a string to indicate that the following material
is just a phonological fragment or piece of a word and that CLAN should not
treat it as a word. It is important not to include any of the three utterance
terminators – the exclamation
mark, the question mark, or the period – because CLAN will treat these as
utterance terminators. This form of notation is useful when the speaker
stutters or breaks off before completing a recognizable word (false starts).
The utterance “t- t- c- can't you go” is transcribed as follows:
*MAR: &t &t &k can't you go?
The
ampersand can also be used for nonce and nonsense forms:
*DAN: &glnk &glnk.
%com: weird noises
Material
following the ampersand symbol will be ignored by certain CLAN commands, such
as MLU, which computes the mean length of the utterance in a transcript. If you
want to have the material treated as a word, use the @u form of notation
instead (see the previous section).
Unless you
specifically attempt to search for strings with the ampersand, the CLAN
commands will not see them at all. If you want a command such as FREQ to count
all of the instances of phonological fragments, you would have to add a switch
such as +s”&*”.
Words may also be
incomplete or even fully omitted. We can judge a word to be incomplete when
enough of it is produced for us to be sure what was intended. Judging a word to
be omitted is often much more difficult.
Noncompletion
of a Word text(text)text
When a word is
incomplete, but the intended meaning seems clear, insert the missing material
within parentheses. Do not use this notation for fully omitted words, only for
words with partial omissions. This notation can also be used to derive a
consistent spelling for commonly shortened words, such as (un)til and (be)cause.
CLAN will treat items that are coded in this way as full words. For programs
such as FREQ, the parentheses will essentially be ignored and (be)cause will be treated as if it were because. The CLAN programs also provide
ways of either including or excluding the material in the parentheses,
depending on the goals of the analysis.
*RAL: I been sit(ting) all day.
The
inclusion or exclusion of material enclosed in parentheses is well supported by
CLAN and this same notation can also be used for other purposes when necessary.
For example, studies of fluency may find it convenient to code the number of
times that a word is repeated directly on that word, as in this example with
three repetitions of the word dog.
JEF: that's a dog [x 3].
By
default, the programs will remove the [x 3] form and the sentence will be
treated as a three word utterance. This
behavior can be modified by using the +r switch.
The coding of word omissions
is an extremely difficult and unreliable process. Many researchers will prefer
not to even open up this particular can of worms. On the other hand,
researchers in language disorders and aphasia often find that the coding of
word omissions is crucial to particular theoretical issues. In such cases, it
is important that the coding of omitted words be done in as clear a manner as
possible
To code an omission,
the zero symbol is placed before a word on the text tier. If what is important
is not the actual word omitted, but its part of speech, then a code for the
part of speech can follow the zero. Similarly, the identity of the omitted word
is always a guess. The best guess is placed on the main line. This item would
be counted for scoping conventions, but it would not be included in the MLU
count. Here is an example of its use:
*EVE: I want 0to go.
It is very difficult
to know when a word has been omitted. However, the following criteria can be
used to help make this decision for English data:
1. 0art: Unless there is a missing plural, a
common noun without an article is coded as 0art.
2. 0v: Sentences with no verbs can be coded
as having missing verbs. Of course, often the omission of a verb can be viewed
as a grammatical use of ellipsis.
3. 0aux: In
standard English, sentences like “he running” clearly have a missing auxiliary.
4. 0subj: In English, every finite verb
requires a subject.
5. 0pobj: Every preposition requires an
object. However, often a preposition may be functioning as an adverb. The coder
must look at the verb to decide whether a word is functioning as a preposition
as in “John put on 0pobj” or an adverb as in “Mary jumped up.”
In
English, there seldom are solid grounds for assigning codes like 0adj, 0adv,
0obj, 0prep, or 0dat.
There are a number of
common words in the English language that cannot be found in the dictionary or
whose lexical status is vague. For example, how should letters be spelled? What
about numbers and titles? What is the best spelling ─ doggy or doggie, yeah or yah, and pst or pss? If we can increase the consistency with
which such forms are transcribed, we can improve the quality of automatic
lexical analyses. clan commands such as freq and combo provide output based on
searches for particular word strings. If a word is spelled in an indeterminate
number of variant ways, researchers who attempt to analyze the occurrence of
that word will inevitably end up with inaccurate results. For example, if a
researcher wants to trace the use of the pronoun you, it might be necessary to search not only for you, ya,
and yah, but also for all the
assimilations of the pronouns with verbs such as didya/dicha/didcha or couldya/couldcha/coucha.
Without a standard set of rules for the transcription of such forms, accurate
lexical searches could become impossible. On the other hand, there is no reason
to avoid using these forms if a set of standards can be established for their
use. Other programs rely on the use of dictionaries of words. If the spellings
of words are indeterminate, the analyses produced will be equally
indeterminate. For that reason, it is helpful to specify a set of standard
spellings for marginal words. This section lists some of these words with their
standard orthographic form.
The forms in these lists
all have some conventional lexical status in standard American English. In this
regard, they differ from the various nonstandard forms indicated by the special
form markers @b, @c, @f, @l, @n, @o, @p, and @s. Because there is no clear
limit to the number of possible babbling forms, onomatopoeic forms, or
neologistic forms, there is no way to provide a list of such forms. In
contrast, the words given in this section are fairly well known to most
speakers of the language, and many can be found in unabridged dictionaries. The
list given here is only a beginning; over time, we intend to continue to add
new forms.
Some of the forms use
parentheses to indicate optional material. For example, the exclamation yeek can also be said as eek. When a speaker uses the full form,
the transcriber types in yeek, and
when the speaker uses the reduced form the transcriber types (y)eek. When clan analyzes the
transcripts, the parentheses can be ignored and both yeek and eek will be
retrieved as instances of the same word. Parentheses can also be used to
indicate missing fragments of suffixes. The majority of the words listed can be
found in the form given in Webster's
Third New International Dictionary. Those forms that cannot be found in Webster's Third are indicated with an
asterisk. The asterisk should not be used in actual transcription.
To
transcribe letters, use the @l symbol after the letter. For example, the letter
“b” would be b@l. Here is an example of the spelling of a letter sequence.
*MOT: could you please spell your name?
*MAR: it's m@l a@l r@l k@l.
The
dictionary says that “abc” is a standard word, so that is accepted without the
@l marking. In Japanese, many letters
refer to whole syllables or “kana” such as ro or ka. To represent this as well as strings of letters in English, use the @k symbol, as in
ka@k or jklmn@k. Using this form, the above example cover better be coded as:
*MOT: could you please spell your name?
*MAR: it's mark@k.
Languages
use a variety of methods for combining words into larger lexical items. One
method involves inflectional processes, such as cliticization and affixation,
that will be discussed later. Here we
consider compounds and linkages.
Earlier, it was
necessary to write compounds in the form of bird+house
and baby+sitter, but now the plus is
no longer necessary. You can just write birdhouse and babysitter and the correct form will be inserted into the %mor line
by the MOR program.
The other level of concatenation involves the use of an underscore to indicate the fact that a phrasal combination is not really a compound, but what we call a “linkage”. Common examples here include titles of books such as Green_Eggs_and_Ham, appelations such as Little_Bo_Beep or Santa_Claus, lines from songs such as The_Farmer_in_the_Dell, and places such as Hong_Kong_University. For these forms, the underscore is used to emphasize the fact that, although the form is collocational, it does not obey standard rules of compound formation. Because these forms all begin with a capital letter, the morphological analyzer will recognize them as proper nouns. The underscore is used for two other purposes. First, it can be used for irregular combinations, such as how_about and how_come. Second, it can be used on the %mor line to represent a multiword English gloss for a single stem, as in “lose_flowers” for defleurir.
Because the dash is
used on the %mor line to indicate suffixation, it is important to avoid confusion
between the standard use of the dash in compounds such as “blue-green” and the
use of the dash in chat. To do
this, use the compound marker to replace the dash or hyphen, as in blue+green instead of blue-green.
In general, capitals are only allowed at the beginnings of words. However, they can also occur later in a word in these cases:
Acronyms should be
transcribed by using the component letters as a part of a “linked” form. In
compounds, the @l marking is not used, since it would make the acronym
unreadable. Thus, USA is be written as U_S_A.
In this case, the first letter is capitalized in order to mark it as a proper
noun. Other examples include M_I_T, C_M_U, M_T_V, E_T, I_U, C_three_P_O,
R_two_D_two, and K_Mart. The recommended way of transcribing the common name
for television is just tv. This form
is not capitalized, since it is not a proper noun. Similarly, we can write cd, vcr, tv, and dvd. The underscore is the
best mark for combinations that are not true compounds such as m_and_m-s for the M&M candy.
Acronyms that are not
actually spelled out when produced in conversation should be written as words.
Thus UNESCO would be written as Unesco. The capitalization of the first
letter is used to indicate the fact that it is a proper noun. There must be no
periods inside acronyms and titles, because these can be confused with
utterance delimiters.
Numbers should be
written out in words. For example, the number 256 could be written as “two
hundred and fifty six,” “two hundred fifty six,” “two five six,” or “two fifty
six,” depending on how it was pronounced. It is best to use the form “fifty
six” rather than “fifty-six,” because the hyphen is used in chat to indicate morphemicization. If
you want to emphasize the fact that a number is a single lexical item, you can
treat it as a compound using the form two+hundred+and+fifty+six. However, if
you do this, it will be more difficult to search for uses of a particular
digit. Other strings with numbers are monetary amounts, percentages, times,
fractions, logarithms, and so on. All should be written out in words, as in
“eight thousand two hundred and twenty dollars” for $8220, “twenty nine point
five percent” for 29.5%, “seven fifteen” for 7:15, “ten o'clock ay m” for
10:00 AM, and “four and three fifths.”
Titles
such as Dr. or Mr. should be written out in their full capitalized form as Doctor or Mister, as in “Doctor Spock” and “Mister Rogers.” For “Mrs.” use
the form “Missus.”
The following table
lists some of the most important kinship address forms in standard American
English. The forms with asterisks cannot be found in Webster's Third New International Dictionary.
Table 2:
Kinship Forms
|
Child |
Formal |
Child |
Formal |
|
Da(da) |
Father |
Mommy |
Mother |
|
Daddy |
Father |
Nan |
Grandmother |
|
Gram(s) |
Grandmother |
Nana |
Grandmother |
|
Grammy |
Grandmother |
*Nonny |
Grandmother |
|
Gramp(s) |
Grandfather |
Pa |
Father |
|
*Grampy |
Grandfather |
Pap |
Father |
|
Grandma |
Grandmother |
Papa |
Father |
|
Grandpa |
Grandfather |
Pappy |
Father |
|
Ma |
Mother |
Pop |
Father |
|
Mama |
Mother |
Poppa |
Father |
|
Momma |
Mother |
*Poppy |
Father |
|
Mom |
Mother |
|
|
One of the biggest
problems that the transcriber faces is the tendency of speakers to drop sounds
out of words. For example, a speaker may leave the initial “a” off of “about,”
saying instead “ 'bout.” In chat,
this shortened form appears as (a)bout. clan can easily ignore the parentheses
and treat the word as “about.” Alternatively, there is a CLAN option to allow
the commands to treat the word as a spelling variant. Many common words have
standard shortened forms. Some of the most frequent are given in the table that
follows. The basic notational principle illustrated in that table can be
extended to other words as needed. All of these words can be found in Webster's Third New International Dictionary.
More extreme types of
shortenings include: “(what)s (th)at” which becomes “sat,” “y(ou) are” which
becomes “yar,” and “d(o) you” which becomes “dyou.” Representing these forms as
shortenings rather than as nonstandard words facilitates standardization and
the automatic analysis of transcripts.
Two sets of
contractions that cause particular problems for morphological analysis in
English are final apostrophe s and apostrophe d, as in John’s and you’d. If you transcribe these as John (ha)s and you
(woul)d, then the MOR program will work much more efficiently.
Table 3:
Shortenings
|
Examples of Shortenings |
|||
|
(a)bout |
don('t) |
(h)is |
(re)frigerator |
|
an(d) |
(e)nough |
(h)isself |
(re)member |
|
(a)n(d) |
(e)spress(o) |
-in(g) |
sec(ond) |
|
(a)fraid |
(e)spresso |
nothin(g) |
s(up)pose |
|
(a)gain |
(es)presso |
(i)n |
(th)e |
|
(a)nother |
(ex)cept |
(in)stead |
(th)em |
|
(a)round |
(ex)cuse |
Jag(uar) |
(th)emselves |
|
ave(nue) |
(ex)cused |
lib(r)ary |
(th)ere |
|
(a)way |
(e)xcuse |
Mass(achusetts) |
(th)ese |
|
(be)cause |
(e)xcused |
micro(phone) |
|