CHILDES Derived Corpora and Counts

Researchers have constructed several derived corpora and frequency counts based on segments of the CHILDES database.

Derived Corpora

Johnson Sesotho Corpus: In order to train an automatic segmentation program, Mark Johnson at Brown has created a corpus derived from the CDS (child-directed speech) of the CHILDES Sesotho corpus. The available materials include the Python script that can be run on the Sesotho corpus, along with the output in the form of sentences of child directed speech (CDS).

Brent_Ratner Corpus: In order to train an automatic segmentation program, Michael Brent at Washington University has created a corpus derived from the CDS of the CHILDES Bernstein-Ratner corpus. The current version of this derived corpus was contributed by Sharon Goldwater.

Pearl_Sprouse Corpus: This corpus, contributed by Lisa Pearl and Jon Sprouse, provides Penn TreeBank style parses for selected corpora from the American English segment of the CHILDES database.

UCI_Brent_Syl Corpus: In order to train an automatic segmentation program, Lisa Pearl and Lawrence Phillips at UC Irvine have created a corpus derived from the CDS of the CHILDES Brent corpus. The corpus comes with the scripts and dictionary used to produce it.

Blanchard Transliterator: A Perl script to split diphthongs and r-sounds into two symbols.

Hungarian-Italian IDS: Judit Gervain's phonological transcription of the Infant-Directed Speech in the Hungarian and Italian segments of CHILDES.

Polish IDS: Luc Borota's phonological transcription of the Infant-Directed Speech in the Polish segments of CHILDES.

Frequency Counts

Ping Li of Penn State has contributed frequency counts of child directed speech for viewing directly or zipped for downloading, along with the documentation.