The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of

The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University, and it is used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created.

The corpus contains more than 425 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2011 and the corpus is also updated once or twice a year (the most recent texts are from March 2011). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2011 article in Literary and Linguistic Computing).

The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near feelings), which often gives you good insight into the meaning and use of a word.

The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:

By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals

Over time: compare different years from 1990 to the present time

You can also easily carry out semantically-based queries of the corpus. For example, you can contrast and compare the collocates of two related words (little/small, democrats/republicans, men/women), to determine the difference in meaning or use between these words. You can find the frequency and distribution of synonyms for nearly 60,000 words and also compare their frequency in different genres, and also use these word lists as part of other queries. Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.

Please feel free to take a five minute guided tour, which will show the major features of the corpus. A simple click for each query will automatically fill in the form for you, search through the more than 425 million words of text, and then display the results.

Compare the Corpus of Contemporary American English to the American National Corpus time corpus american english wordlists word lists frequency BYU Mark Davies

The Corpus of Contemporary American English was created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University. Some scanning of original texts (mainly novels) was done by students at BYU. Mark Davies was responsible for all other aspects of the corpus construction -- collecting and editing the electronic texts, designing and implementing the corpus architecture, and designing and programming the web interface.

Thanks go to the Department of Linguistics and English Language and the College of Humanities at Brigham Young University for giving me a sabbatical in Fall 2008 (during which the final stages of the corpus were completed), as well as for the funds to purchase a new server in Fall 2008. Thanks also to Microsoft for providing a free version of SQL Server 2005 (64-bit Enterprise version), which serves as the backbone of the corpus architecture. Finally, thanks to Paul Rayson and other in the CLAWS development team for the version of CLAWS that was used to tag the corpus.

The basic architecture and interface of this corpus is similar to other large corpora that I have placed online, including my interface to the 100 million word British National Corpus (British, 1980s-1993), the 400 million word Corpus of Historical American English (COHA), the 100 million word TIME Corpus of Contemporary American English (1923-2006), the NEH-funded 100 million word Corpus del Español (1200s-1900s), and the 45 million word NEH-funded Corpus do Português (1300s-1900s). If you have an idea for other corpora and * have funding * as well, please feel free to contact me.

Finally, I should mention that I use the Corpus of Contemporary American English extensively in classes that I teach at BYU (as well as other corpora, such as BNC and TIME), and other faculty in the department use these corpora extensively in their teaching as well. I have developed many activities and projects that are based on data from the corpora. If you would like me to teach a workshop at your university related to the corpus and what it can tell us about American English, please contact me.

Technical information

The functionality of the corpus is due in large part to the unique architecture upon which it is based. The architecture relies on MS SQL Server relational databases, with n-gram databases that contain contextual information for each of the 425 million words in the corpus, as well as other databases containing information on word forms, part of speech, lemmas, synonyms, customized wordlists, etc. All of this was developed by me "in house"; none of the architecture is based on other "off-the-shelf" corpus-searching tools. You may wish do consult some of the recent publications that I have had, which explain this architecture in more detail.

The Corpus of Contemporary American English (COCA) offers a balance of availability, size, genres, and currency (how recent it is) that is not found in other corpora, including the American National Corpus (ANC), the British National Corpus (BNC), the Bank of English (BOE), or the Oxford English Corpus (OEC).

In addition, the architecture used for COCA (and the other corpora from corpus.byu.edu) is among the most powerful architectures available for online corpora -- at least as scalable and feature-rich as Corpus Workbench (including its incarnation in Sketch Engine and BNCweb), as well as other architectures like VISL and PIE (more information...).

The chart below provides a summary of the features of the different corpora, and detailed discussions can be found via the links at the top of each column.

Feature COCA BNC ANC BOE OEC

Availability Free / web Free / web Free $1150 year (Limited) 1

Size (millions of words) 425 100 22 455 1,898

Time span 1990-2011 1970s-1993 2000-2005?2 1970s-2005 2000-2006

Number of words of text being added each year 20 million 0 0 0 0

Can be used as a monitor corpus to see ongoing changes in English Yes No No (Limited) (Limited)

Wide range of genres: spoken, fiction, popular magazine, newspaper, academic Yes Yes No (Yes) (Yes)

Size of spoken (millions of words) 85 10 4 62 82

Spoken = conversational, unscripted? (Mostly:

notes) Yes Yes (Some) (Some)

Dialect American British American Br / Am + Am / Br +

Interface BYU BYU Sketch Engine BNCweb VISL PIE Open ANC (DVD) Sketch Engine Sketch Engine

Notes:

1. The OEC is generally available only to researchers at Oxford University Press, although access is occasionally given to selected researchers outside of the OUP.

2. In 2005 the corpus had 22 million words, and the current site still mentions 22 million words, but some people have claimed that new texts have been added since 2005. We'd appreciate details from anyone who knows about the status on this.

[ COMPARE COCA TO THE ANC, BNC, BOE, OEC ]

Corpus of Contemporary American English

American National Corpus 2

Size

425 million words 1

22 million words

Dates

1990 - 2011

1990 -?? 3

Date distribution

20 million words each year

0.5-3 million

Updated

Yes, 1-2 times/year

No (??) 4

Availability / price

Free access (but only via web interface)

Free (via Open ANC), or DVD ($75, from the LDC). Full text access.

Spoken

85 million words (4m each year, 1990-2011)

Transcripts of unscripted conversation on 150+ different TV and radio programs (ABC, CBS, NBC, Fox, PBS, NPR, etc)

4 million words

Call-Home, Charlotte, MICASE, Switchboard

Fiction

81 million words (4m each year, 1990-2011)

Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, movie and TV scripts

0.5 million words

From the publishers Hargraves and Eggan

Magazines

(popular)

86 million words (4m each year, 1990-2011)

100 magazines; balanced between news, health, home and gardening, women, financial, religion, sports, etc

5 million words

2 magazines: Slate (politics) and Verbatim (linguistics)

Newspapers

81 million words (4m each year, 1990-2011)

10 newspapers, including USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc.

4 million words

1 newspaper: New York Times

Academic

(journals)

81 million words (4m each year, 1990-2011)

100 journals. Balanced coverage of the entire range of the Library of Congress classification system (K = education, T = technology, etc.),

4 million words

2 journals: BioMed and PLOS (Public Library of Science)

Other text types

3 million words: Blog (Buffy the Vampire Slayer)

1m: Travel guide (Berlitz)

1m: Government ("web data" [??])

<1m: Miscellaneous (911 report, letters, other non-fiction)

Notes

1 The Corpus of Contemporary American English contained about 365 million words in size when it was released in early 2008 (20 million words each year, 1990-2007). As of mid-2011, it has more than 425 million words. It will continue to grow by 20 million words each year.

2 Refers to the Second Release (2005) of the American National Corpus. There has not been a Third Release since that time.

3 This is probably a function of whether/when the ANC is completed

4 The ANC was projected to have 100 million words upon completion in c2005. No plans have been announced to expand the corpus beyond that size, if/when the corpus is completed.

The British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) complement each other nicely, since they are the only large, well-balanced corpora of English that are freely-available online. Here we will briefly compare the two corpora in terms of corpus size, genre coverage, and how up-to-date they are.

Corpus size

The Corpus of Contemporary American English (425 million words) is more than four times as large as the British National Corpus (100 million words). As a result, it often provides data for lower-frequency constructions that are not available from the BNC. In terms of concrete examples, let us focus here on just two types of phenomena -- collocates and syntax.

Collocates / semantics. The following table shows the number of different collocates that occur at least 3-5 times with the given node words. Notice that with a word like nibble, the word itself only occurs 4-5 times as often in COCA as the BNC (1194 to 244; to be expected from a corpus four times the size). But in terms of collocates, there are 14 times as many in COCA that occur 5 times or more as there are in the BNC. For low frequency words like these, there is often a real difference between a 100 million word corpus and a 425 million word corpus.

Word (PoS) COCA

freq BNC

freq

collocate

PoS / span

COCA (click to see)

BNC (click to see)

click (noun) 3145 445

adj

2L / 0R

loud, audible, double, sharp

double, sharp, loud

nibble (verb) 1194 244

noun

0L / 3R

edges, grass, ear, lip

ear, bait

serenely (adv) 308 83

verb

4L / 4R

smile, float, gaze, glide

said, smiled

crumbled (adj) 446 27

noun

0L / 3R

cheese, bacon, bread, cornbread

---

Syntax. Consider the following three examples.

[like] for [p*] to [v*] (I’d really like for you to stay)

There are 5 tokens in the BNC, but 330 tokens in COCA. With the BNC there aren't enough examples to see if this is a feature of informal or formal English, but the data from COCA show that it is clearly a feature of spoken English. The data also shows that it is increasing slowly over time, when compared as a ratio to the construction [ like -- him to V ].

Is it excel in V-ing, or excel at V-ing? (she excels in/at playing the piano)

Granted, this is a very narrow issue, but it is precisely the thing that translators and non-native speakers are interested in. With the BNC there are 5 tokens with at and 6 with in -- probably not enough to say which is more common. In COCA, however, there are 122 with at and 42 with in. This is enough to begin to see which genres prefer one or the other, as well as which subordinate clause verbs occur with each. Such granularity is not possible with the BNC.

[have] been being [vvn] (she had been being watched)

There are 2 tokens in the BNC (1 spoken, 1 fiction), and this is not enough data to see any possible genre variation. In COCA, on the other hand, there are 13 tokens (10 spoken, 2 fiction, 1 news). This is enough to show that this is a feature of spoken English, and the data also shows that it is increasing since 1990. (By the way, most native speakers of both dialects will cringe at sentences like this, but they are in the corpora.)

In summary, while 100 million words is often adequate for studying syntax, for some very low-frequency phenomena, there is a real difference between 100 million words (BNC) and 425 million words (COCA).

How up-to-date are the corpora?

COCA has 20 million words in each year since the early 1990s (for a total of more than 425 million words total since the early 1990s), and the most recent texts are from March 2011. The most recent texts in the BNC, on the other hand, are from the early 1990s -- more than fifteen years ago. This has important implications in terms of how the two corpora represent contemporary English.

Lexical. Perhaps the easiest comparison deals with words that have recently come into English, or which are used a lot more now than 15-20 years ago. The following lists show a few words (just a tiny sample of all such words) that are found less than half as often in the BNC than in COCA (per million words), and the words in italics are found less than 10% as often (often, there are no tokens in the BNC). Obviously, some are American words and wouldn't be in a corpus of British English. Many others, however, are words that are simply much more common in COCA, because it alone contains texts from the last 15 years.

Noun: website (COCA/BNC), blog (COCA/BNC), globalization/globalisation (COCA/BNC), SUV, RPG, Taliban, e-mail, anthrax, recount, adolescent, prep, tsunami, affiliation, Sunni, insurgent, insurgency, terrorism, coping, terrorist, cleric, yoga, homeland, genome, steroid, detainee, militant

Adjective: same-sex (COCA/BNC), Islamist (COCA/BNC), upscale (COCA/BNC), terrorist, faith-based, web-based, nonstick, dot-com, performance-enhancing, high-stakes, 21st-century, old-school, pandemic, iconic, insurgent, online, broadband, gated, wireless, clueless

Adverb: wirelessly (COCA/BNC), healthfully (COCA/BNC), multiculturally (COCA/BNC), preemptively, inferiorly, counterintuitively, online, forensically, intraoperatively, postoperatively, famously

Verb: mentor (COCA/BNC), morph (COCA/BNC), download (COCA/BNC), e-mail, makeover, prep, upload, workout, freak, transition, vaccinate, encrypt, reconnect, click, host, splurge, preheat, co-write, outsource, snack, partner

Although we have focused just on new "words" here, the same thing holds for other areas of language -- morphology (word formation), syntax (grammar), and semantics (word meaning, such as green = "environmentally friendly"), or discourse analysis (what we are saying about immigrants, or women, or the environment). Any changes that have occurred since the early 1990s will not show up in BNC, but should be modeled quite nicely with COCA.

Genre balance

The BNC is 10% spoken / 90% written, while in COCA the corpus is nearly evenly divided (20% in each genre) between spoken, fiction, popular magazines, newspaper, and academic.GENRE COCA (millions of words) BNC (millions of words)

Spoken 85 10

Fiction 81 17

Popular magazines 86 16

Newspaper 81 11

Academic 81 16

Other 30

The BNC has a much wider range of spoken sub-genres, while COCA is composed of unscripted conversation on TV and radio shows (see notes on the naturalness of these conversations). Both corpora are very well balanced in terms of sub-genres for the written genres (e.g. Newspaper-Sports, or Academic-Medicine). In addition, because there is a diachronic aspect to COCA (coverage over time), in COCA the distribution of 20% in each of the five genres stays constant from year to year.

Summary

COCA and the BNC complement each other nicely, and they are are only large, well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall.

Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora. This is especially true when -- as with the interface at corpus.byu.edu -- both corpora can be used side-by-side, with the same interface. For most types of studies, academic publications and presentations that rely on just the BNC for data from Modern English will look increasingly outdated and insular as time goes on.

The Bank of English (BoE, now online as Word Banks Online) is a wonderful corpus overall, and it has been used as the basis for many insightful analyses of Modern English. However, one particular area in which it is perhaps less useful than COCA is in its use as a "monitor corpus" -- a corpus that has be used to look at ongoing changes in the language. As far as we are aware, the creators of the Bank of English (BoE) themselves have never claimed that it can be used as a monitor corpus, but this claim has been made on behalf of the corpus by many other researchers.

As we will see, however, there are several weaknesses with the BoE that seriously limit its usefulness as a monitor corpus. It is perhaps for this reason that there are in fact very few studies that actually use the BoE as the basis for research into current, ongoing changes in the language. The Corpus of Contemporary American English (COCA), on the other hand, was designed from the ground up as a monitor corpus, and it does provide rich, useful data that is not available from the BoE.

Corpus size and growth

Both corpora are quite large. COCA is about 402 million words, while the BoE is about 455 million words. Both are much larger than the widely-used 100 million word British National Corpus (BNC; see comparison of COCA and the BNC). One important difference between COCA and the BoE, however, is that COCA continues to be updated, as a true monitor corpus should be. Another 20 million words are added to COCA each year (the last update was March 2011), while work on the BoE has (apparently) stopped -- the last texts in the BoE are from 2005.

Genres

It is a bit difficult to know exactly what is in the Bank of English, since there is only one page with a sketchy outline online. With the Corpus of Contemporary American English, on the other hand, we have details by year, genre, sub-genre, and even down to the level of each of the 160,000 individual texts.

It looks, however, like the following is the composition of the BoE for the American and British sub-corpora, along with the equivalent sizes from COCA:

GENRE

COCA (millions of words)

BoE: UK (millions of words)

BoE: US (millions of words)

Spoken

81.8

41.4

20.1

Fiction

78.8

24.1

33.1

Popular magazines

83.3

16.3

15.3

Newspaper

79.5

125.6

77.8

Academic

79.0

---

Other (Non-fiction books)

---

51.6

43.1

TOTAL

402.3

259.4

189.4

As one can see, COCA is evenly balanced between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The BoE, on the other hand, is heavily weighted towards newspapers (about 50%) because they are easy to acquire from online sources. There are apparently no academic journals in the BoE (or at least they are not labeled as such). Finally, much of the spoken in the BoE is taken from transcripts that are read, whereas in COCA they come from spontaneous speech on TV and radio programs.

Where is the informal speech?

In several searches of informal constructions in the BoE that we have done, it appears that the BoE has far too little data, which suggests that the limited Spoken texts in the BoE do not represent actual spoken English very well. (This is probably because their texts come just from transcripts from the Voice of America, and there is little or no spontaneous speech). To give just one example, the following are the number of tokens of the "quotative like" (and she's like, "I don't know").

Years BoE (just the American texts) COCA

tokens size per million tokens size per million

1990-94 5 20,883,000 0.24 128 103,300,000 1.2

1995-99 1 19,187,000 0.05 336 102,900,000 3.3

2000-04 173 123,055,000 1.41 453 102,600,000 4.4

As can be seen, there is a huge disparity between COCA and the BoE. In terms of normalized frequencies (per million words), this informal construction is 3.1 times as common in COCA as in the BoE in 2000-04, 5.0 times as common 1990-94, and 66.0 times as common 1995-99. We could repeat this with many other phenomena (and in forthcoming publications we do so). The bottom line is that COCA -- even though it has at times been (incorrectly) criticized for not having enough "informal" spoken texts, has much more of this than the Bank of English.

Genre balance over time

In order to use frequency statistics to look at changes over time -- as we would want to do with a monitor corpus -- each historical period needs to have the same genre composition. To take a worst-case example, suppose that a corpus had only newspapers from the 1990s and then only fiction from the 2000s. For any change that we see from the 1990s to the 2000s, we would not know if the change had actually occurred in the language as a whole, or if it is just an "artifact" of the changing genre composition from one period to the next.

What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2011, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.

In the BoE, however, the genre composition varies widely from one year (or set of years) to another. For example, the following figures show the percentage of fiction in the US sub-corpus in different time periods:

Time period

Fiction

Total

% fiction

1960-1979

1,030,000

1,414,000

72.8%

1980-1989

3,087,000

8,792,000

35.1%

1990-1994

6,049,000

20,833,000

29.0%

1995-1999

3,100,000

19,187,000

16.2%

2000-2004

18,800,000

123,055,000

15.3%

Notice how the percentage of fiction decreases by nearly 50% from the early 1990s to the late 1990s. Let us briefly look at how this distorts the corpus data for these periods.

ALL 90-94

per mil

ALL 95-99

per mil

FIC 90-94

per mil

FIC 95-99

per mil

mutter (all forms)

18.1

14.0

53.9

51.3

she said

189.5

145.0

540.8

578.4

had + VBN (e.g. had seen)

2699.5

1622.2

3569.2

3360.7

All three of these forms (mutter, she said, and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of the BoE (green cells), the frequency per million words stays about the same from 1990-94 to 1995-99, as we would expect. But in the entire US part of the BoE (all genres; in blue), the normalized frequency (per million words) decreases much more from 1990-94 to 1995-99. For example, had + VBN decreases by about 40%. Why is this? Well, notice that in the table above that the percentage of the US corpus in the BoE that is fiction decreased by about 55% during the same period. In other words, the decrease in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. only 50% as much in the late 1990s as the early 1990s!)

In COCA, on the other hand, the relative frequency of these three forms in the overall corpus stays quite flat from 1990-94 until 2005-09, because the percentage of texts in the corpus that are from fiction (20% each year) stays the same.

mutter

1990-1994

1995-1999

2000-2004

2005-2009

PER MIL

14.9

13.4

14.8

15.9

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

she said

1990-1994

1995-1999

2000-2004

2005-2009

PER MIL

197.9

210.7

190.4

204.5

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

had [VVN]

1990-1994

1995-1999

2000-2004

2005-2009

PER MIL

1,173.1

1,066.2

1,059.0

1,095.4

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

Strange BoE data

Even beyond this serious problem with genre balance, it appears that there might be an even more fundamental problem with the Bank of English. To see what this is, consider the following table:

BoE

1990-94

1995-99

2000-04

90-94 >

95-99

95-99 >

00-04

was VVN

(32370)

(20558)

(179367)

0.69

1.36

to be

(29467)

(24726)

(141880)

0.91

0.89

(134551)

(157808)

(810686)

1.28

0.80

and

(467783)

(432037)

(2286364)

1.01

0.83

This table shows the frequency of four common words (is and and), phrases (to be), and grammatical constructions (was VVN: was seen, was considered) in the Bank of English in three periods – 1990-94, 1995-1999, and 2000-2004. (The raw frequency data is in parentheses, while the normalized value – per million words – is in bold.) The two columns at the right (90-94 / 95-99 and 95-99 / 00-04) shows the percentage change (for the normalized figures) between 1990-94 and 1995-99 and for 1995-99 and 2000-04. For example, in the BoE the frequency of the passive “decreased” 31% between 1990-94 and 1995-1999, and then “increased” 36% between 1995-99 and 2000-04.

One might wonder why the passive would increase or decrease 30-35% per cent between two adjacent periods, or why a very common word like is or and would vary by 20-30% from one period to the next. And notice that it is not just a problem with corpus sizes and bad calculations – with one word the frequency might increase dramatically between two periods in the BoE, while with another word it might decrease dramatically during the same period. With frequency statistics this strange for common, predictable words, it is difficult to have confidence that the BoE will provide accurate data for other words, phrase, and grammatical constructions that we might be researching.

COCA

1990-94

1995-99

2000-04

90-94 >

95-99

95-99 >

00-04

was VVN

0.95

1.00

to be

0.97

0.98

0.99

0.98

and

1.00

The table above shows the normalized frequencies (per million words) for four common words, phrases, and grammatical constructions in COCA from the early 1990s to 2004 (data is also available for 2005-2011 but we have omitted it here, to enable easier comparison with the BoE data). Notice that the frequency of these words is essentially flat over time (as we would expect it to be), and we do not have the strange anomalies that are found in the Bank of English.

Summary

For those who can afford the $1,150 per year, the Bank of English provides very good data for British and American English. Contrary to what has commonly been said about the BoE, however, it is probably not an overly-reliable "monitor corpus", because its genre balance varies so much from year to year. One can never know whether the changes that one sees are a function of the changing genre balance or whether they represent actual changes in the "real world".

The freely-available Corpus of Contemporary American English (COCA), on the other hand, was explicitly and carefully designed as a monitor corpus. This is especially apparent in the corpus design, where the corpus maintains the same genre balance from year to year. As far as we are aware, COCA is the only large corpus that is designed this way, and which can thus be used to accurately measure recent shifts in English.

The Oxford English Corpus (OEC) is the largest structured corpus of any language. Unfortunately, the corpus is generally only available to researchers who are working on projects for Oxford University Press. Nevertheless, it may still be useful to make a brief comparison of COCA and the OEC.

Corpus size and historical coverage

The Oxford English Corpus is about 1.9 billion words, or about 4-5 times the size of COCA. This means that for many very low-level phenomena it provides much more data than COCA, in the same way that COCA provides much more data than a corpus that is 4-5 times smaller than itself, such as the 100 million word British National Corpus.

In terms of historical coverage, however, COCA is more developed than the OEC. COCA has texts from 1990-2010 (20 million words each year), while the OEC has texts from just about one-third that number of years -- 2000-2006. No texts have been added to the OEC since 2006, whereas 20 million words of text continue to be added to COCA each year, bringing it right up to the present time.

The following is a summary of the number of words in each corpus in each year from 1990 to 2011:

Year COCA OEC

1990 20,459,999

1991 20,565,501

1992 20,636,288

1993 20,834,905

1994 20,896,229

1995 20,687,538

1996 20,344,253

1997 20,384,808

1998 20,641,624

1999 20,907,933

2000 20,703,668 133,340,604

2001 20,030,139 182,815,084

2002 20,377,984 280,691,405

2003 20,755,087 370,800,134

2004 20,702,033 518,717,955

2005 20,702,447 377,953,350

2006 20,795,229 25,099,165

2007 20,417,804

2008 20,376,279

2009 19,801,393

2009 9,564,454

TOTAL 420,585,595 1,889,417,697

Genres and informal texts

The OEC has texts from a wide range of genres and text types, as well as dialects. COCA is divided evenly (20% each year, and there overall as well) between spoken, fiction, popular magazines, newspapers, and academic journals (see details by year, genre, sub-genre, and even down to the level of each of the 160,000 individual texts).

Some have incorrectly criticized COCA for not have enough "informal" texts, because they have not really understood what is in the spoken texts in COCA. In comparison with the Oxford English Corpus, however, COCA does a very good job of including informal language. There are many phenomena for which COCA has 4-5 times as much material (per million words) as does the OEC. For example, the following shows the number of tokens for the "quotative like" construction (and I'm like, you're crazy) in each year in the US portion of the OEC. The overall average is 0.76 tokens per million words. In COCA, on the other hand, it is 3.94, or more than five times what is in the OEC. This is just one of many examples that could be given.Years OEC COCA

tokens size per million tokens size per million

1990-94 130 103,300,000 1.3

1995-99 347 102,900,000 3.4

2000 45 66,455,562 0.68 462 102,600,000 4.5

2001 40 89,913,492 0.44

2002 111 142,621,850 0.78

2003 121 191,239,937 0.63

2004 202 240,840,436 0.84

2005 177 180,930,648 0.98 645 93,600,000 6.9

2006 12 15,442,798 0.78

Genre balance over time

What we find is that COCA is balanced across genres -- almost perfectly -- from year to year. In each and every year from 1990-2009, the corpus has been divided between spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%), and academic journals (20%). Even at the level of sub-genre (e.g. Newspaper-Sports, or Academic-Medicine), the corpus composition changes very little from year to year.

In the OEC, however, the genre composition varies widely from one year (or set of years) to another. For example, the following figures show the percentage of fiction in the US sub-corpus in different time periods:

Year Total Fiction % fiction

2000 66,455,562 6,479,988 9.8

2001 89,913,492 14,326,315 15.9

2002 142,621,850 36,938,545 25.9

2003 191,239,937 61,788,465 32.3

2004 240,840,436 53,462,736 22.2

2005 180,930,648 57,083,698 31.6

2006 15,442,798 12,740,916 82.5

Notice how the percentage of fiction varies widely from year to year (10% to 82%), and how even in two adjacent years it varies widely, such as 32% in 2003 and 22% in 2004. Let us briefly look at how this distorts the corpus data for these periods.

Entire corpus Fiction

2001 2004 2006 2001 2004 2006

mutter

(all forms) 1669

18.6 8552

44.7 1652

107.0 1557

110.1 5927

110.9 1647

129.3

had + VBN

(e.g. had seen) 81811

909.9 245966

1021.3 32178

2083.7 36135

2522.3 135952

2542.9 30535

2396.6

These two forms (mutter and had + VBN) are characteristic of fiction. Notice that in just the US fiction part of the OEC (green cells), the frequency per million words stays about the same between 2001, 2004, and 2006, as we would expect. But in the entire US part of the OEC (all genres; in blue), the normalized frequency (per million words) varies widely. For example, had + VBN increases by more than 100%. Why is this? Well, notice that in the table above that the percentage of the US corpus in the OEC that is fiction increases markedly over time. In other words, the increase in frequency of these phenomena in the corpus is probably just a function of the change in genre balance, rather than any change in "real world" language. (It would, after all, be quite strange if people really did all of the sudden say had eaten, had noticed, etc. 200% as much in 2006 as in 2004!)

In COCA, on the other hand, the relative frequency of these forms in the overall corpus stays quite flat from 1990-94 until 2005-09, because the percentage of texts in the corpus that are from fiction (20% each year) stays the same.

mutter

1990-1994

1995-1999

2000-2004

2005-2009

PER MIL

14.9

13.4

14.8

15.9

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

had [VVN]

1990-1994

1995-1999

2000-2004

2005-2009

PER MIL

1,173.1

1,066.2

1,059.0

1,095.4

SIZE (MW)

103.3

102.9

102.6

93.6

FREQ

Summary

The Oxford English Corpus is great corpus in terms of its size and even the wide range of genres and text types. COCA is not as large, but it does cover more years. Perhaps most importantly, COCA has the same genre balance from year to year, which allows it to be used as a monitor corpus in ways that the OEC could not be.

The bottom line, however, is that the OEC is not really available to the general public, so very few people can actually use it. COCA, however, is freely available to all interested researchers, teachers, and language learners.

The Web is much larger than the Corpus of Contemporary American English, and Google is a great search engine. So why not just use Google to see what's happening in contemporary American English? Well, as good as it is for most searches, there are things that neither Google (nor any other search engine) can do (or which they do only very poorly), but which are possible with our corpus. These include the following:

Looking at differences between different styles or types of English. Is a particular grammatical construction or a given phrase used more in informal (e.g. spoken) or formal (e.g. academic) English? Google is pretty good at knowing what domain something comes from (e.g. cbs.com or neh.org), but it can't really relate that (well) to "genre", or "styles of speech".

Measuring changes over time. Is a word or phrase used more or less now than in the early 1990s? Which verbs are really on the increase during the last 2-3 years? No way to check this with Google or other search engines.

Grammar-based searches. Is end up VERB-ing (e.g. ended up paying too much) on the increase or decrease? Is the passive (be VERB-ed: e.g. was seen) used more in spoken or academic? Google doesn't allow you to search by part of speech or lemma (e.g. all of the forms of a word). You'd have to search for each string individually (e.g. all forms of end + up + every conceivable verb).

Semantically-based searches. How are fair, or strike, or sign used in the language? In order to find out, you need to look at collocates (nearby words), since (as corpus linguists are fond of saying) "the words that a word 'hangs out with' can tell you a lot about its meaning". But Google doesn't do collocates.

And more semantically-based searches. Since Google can't do collocates, it obviously can't use them to compare word meanings in different genres (e.g. chair in fiction and academic), or to see how they're changing over time (e.g. green = "environmentally friendly").

And even more complex semantically-based searches. Google only really knows how to search for specific words and strings. It doesn't let you search by words that are related in meaning, such as all of the synonyms of a given word, or all of the 100+ words in a list you've created (related to fashion, or food, or clothing, or whatever) as part of a query. Our corpus can do both of these.

Finding the word when you don't know what the word is. What are the nouns that are found mainly is engineering articles, collocates of hard that are used more in fiction, or synonyms of strong that are found mainly in spoken? Google allows you to find the occurrence of a given form that you already know, but it can't produce a list of words for you that match criteria like these.

Searching for strings of words. Sure, on Google you can search for a phrase like "might be taken for a". Go ahead and try it. How many hits does it say there are? Our search today shows 92,400. Start paging through the hits, though, and they run out at about 740. In other words, Google's "guess" is more than 100 times more than what it should be. This is because Google usually doesn't "know" the frequency of anything more than single words -- it's usually just guessing.

So if you want to find web pages dealing with a certain topic, then Google is fine. But using Google as a full-blown linguistic search engine has real drawbacks. None of the preceding types of searches -- which are some of the most interesting ones that you can carry out to see what's going on with the language -- are possible with Google (or any other search engine). But they are all possible -- quickly and easily -- with the Corpus of Contemporary American English.

The Corpus of Contemporary American English (COCA) is probably the best corpus of English (online or anywhere else) for looking at a wide range of ongoing changes in the language (see specific examples below). In order to look at ongoing changes, a corpus would ideally have the following characteristics:

Large (probably 100 million words or more)

Recent texts (ideally, it would be updated to within a year of the present time)

Balance between several genres (e.g. not just newspapers)

Roughly the same genre balance from year to year

An architecture that shows frequency over time and which allows one to compare frequencies between different periods

As we have discussed in a recent journal article in the journal Literary and Linguistic Computing, other corpora have some of these features, but none of them has all five. For example:

The British National Corpus is large and has many genres, but it is now 16 years out of date and will likely never be updated.

The Brown family of corpora (Brown, Frown, LOB, FLOB) are neither large nor recent (the last texts are from 1991)

The American National Corpus has none of the five characteristics listed above

The Bank of English and the Oxford English Corpus are both large and fairly recent (up through 2005 / 2006), but their genre balance varies a great deal from year to year. As a result, there is no way to know if the changes they show are indicative of actual changes in the "real world", or whether they just reflect changes in the corpus itself. To give a simple example, a higher frequency of "fiction" words (pale, smile, sparkle, etc) in the early 2000s than in the late 1990s might simply reflect an increase in the total number of words in fiction texts during that time, but this would give no evidence that these words or phrases had actually increased in real world usage. (In addition, another serious problem with these two corpora is that neither is freely-available to the public.)

The Web (via Google) and text archives are not genre-balanced, and (most importantly) there is no way to measure change over time. In order to do so, one would have to know the frequency of an item in a given year and then know the overall size of all texts in that year (to get normalized frequency statistics). There are also real problems in terms of searches involving phrases, and not just individual words.

[Note that the corpora mentioned above might be great for other things, just not as a monitor corpus to look at ongoing changes.]

The Corpus of American English was designed from the ground up as a "monitor corpus" (a corpus that allows us to look at changes over time), and it is the only corpus (online or elsewhere) that has all of the five characteristics listed above.

Let's briefly look at a few examples of data from the corpus relating to ongoing changes in English. Just click on any of the following links to run the queries. Note that in the comparisons below, the 2000s are on the left and the 1990s are on the right. Also, note that not every entry in the tables is meaningful, but they are a good starting-point.

Lexical change (words and phrases): What is the frequency of jonesing, morph, old-school, gift (as a verb), freak out, perfect storm, (think) outside the box, on the hook for, or [be] likely a|the over time? (Note that you can click on [See all sections] in the Chart Display to see the frequency by individual years as well). What verbs or nouns or adjectives or phrasal verbs with up are used a lot more in 2005-09 than in 1990-94? (Wait 5-6 seconds for the noun and adjective queries to run.)

Morphological change (word formation): Are words with the suffix -gate (indicating "scandal") and the suffix -friendly more frequent in the 1990s or the 2000s? What is the frequency of words ending in -ism (e.g. communism, terrorism) in each time period since the early 1990s, and which -ism words are more common in the 2000s than the 1990s (and vice versa)?

Syntactic change (grammar): Are the following increasing or decreasing (and when): end up V-ing, get passive (got hired), "quotative like" (he's like, I'm not going), so not ADJ (I'm so not interested in her)

Semantic change (word meaning): Changes over time with collocates (nearby words) can often indicate changes in meaning or the usage of a given word. See if this is true for the following words: green, web, engine.

Discourse analysis ("what are we saying about X?") Compare the collocates for the given words in the 1990s and the 2000s: crisis, terror, gay, religion. Or look at the collocates of nuclear and crisis in each time period since the early 1990s. How does this data give insight into changes in American culture and society during this time?

The corpus is composed of more than 425 million words (details) in more than 175,000 texts (actually 176,389), including 20 million words each year from 1990-2011. For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

Spoken: (90 million words [90,065,764]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts).

Fiction: (85 million words [84,965,507]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

Popular Magazines: (90 million words [90,292,046]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

Newspapers: (87 million words [86,670,479]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.

Academic Journals: (86 million words [85,791,918]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per yearYEAR SPOKEN FICTION MAGAZINE NEWSPAPER ACADEMIC TOTAL

1990 4,332,983 4,176,786 4,061,059 4,072,572 3,943,968 20,587,368

1991 4,275,641 4,152,690 4,170,022 4,075,636 4,011,142 20,685,131

1992 4,493,738 3,862,984 4,359,784 4,060,218 3,988,593 20,765,317

1993 4,449,330 3,936,880 4,318,256 4,117,294 4,109,914 20,931,674

1994 4,416,223 4,128,691 4,360,184 4,116,061 4,008,481 21,029,640

1995 4,506,463 3,925,121 4,355,396 4,086,909 3,978,437 20,852,326

1996 4,060,792 3,938,742 4,348,339 4,062,397 4,070,075 20,480,345

1997 3,874,976 3,750,256 4,330,117 4,114,733 4,378,426 20,448,508

1998 4,424,874 3,754,334 4,353,187 4,096,829 4,070,949 20,700,173

1999 4,417,997 4,130,984 4,353,229 4,079,926 3,983,704 20,965,840

2000 4,414,772 3,925,331 4,353,049 4,034,817 4,053,691 20,781,660

2001 3,987,514 3,869,790 4,262,503 4,066,589 3,924,911 20,111,307

2002 4,329,856 3,745,852 4,279,955 4,085,554 4,014,495 20,455,712

2003 4,404,978 4,094,865 4,295,543 4,022,457 4,007,927 20,825,770

2004 4,330,018 4,076,462 4,300,735 4,084,584 3,974,453 20,766,252

2005 4,396,030 4,075,210 4,328,642 4,089,168 3,890,318 20,779,368

2006 4,304,513 4,081,287 4,279,043 4,085,757 4,028,620 20,779,220

2007 3,882,586 4,028,998 4,185,161 3,975,474 4,267,452 20,339,671

2008 3,635,622 4,155,298 4,205,477 4,031,769 4,015,545 20,043,711

2009 3,969,587 4,143,814 3,855,815 3,971,607 4,144,064 20,084,887

2010 4,095,393 3,929,160 3,806,011 4,258,633 3,816,420 19,905,617

2011 1,061,878 1,081,972 1,130,539 1,081,497 1,110,333 5,466,219*

TOTAL 90,065,764 84,965,507 90,292,046 86,670,481 85,791,918 437,785,716

* The latest update for 2011 was in April 2011, and includes texts from Jan-Mar 2011.

We wanted to have a fifth of the corpus (80+ million words) be from spoken American English. It would have been impossible, however, to create a corpus that size by tape recording lectures, conversations, etc. The only option was to use transcripts of conversations, which were already in electronic form. Therefore, we obtained transcripts of unscripted conversation on TV and radio programs like All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer (syndicated), etc.

There are several questions, of course, regarding the use of transcripts like these. Perhaps the three most important ones are:

1) Do they faithfully represent the actual conversations?

2) Is the conversation really unscripted?

3) How well does it represent "non-media" varieties of Spoken American English?

Regarding the first question, we feel confident that the transcripts do represent very well the actual spoken conversation. Two examples should suffice. First, look at this video from the Larry King show on CNN (it comes on after the ad) and then compare it to the transcript. Another example is from the "Talk of the Nation" show on NPR. Compare the audio recording (click on "Listen Now") with the transcript (as contained in our corpus). Our sense is that the transcripts do an excellent job transcribing the conversation, including interruptions, false starts, and so on.

The second question is whether the conversation is really unscripted. In the Larry King interview above, there are a handful of "formulaic / scripted" sentences like "Welcome to the program", "We'll now go to a commercial break", etc. But probably 97-98% or more of the conversation is unscripted. In the NPR transcript, there is a bit more scripted material -- a paragraph or two at the beginning of the show, and some announcements for upcoming commercial breaks. But about 95% or so is still unscripted. The question is whether you would rather have an 80+ million word spoken corpus with about 5% scripted material (but still leaving more than 75 million words of unscripted material), or a "completely pure" corpus that is so small (1-2 million words) that it is unusable for many types of research. We opted for the former.

In terms of the third issue (naturalness), there is one aspect of these texts that does make them somewhat unlike completely natural conversation. That is of course the fact that the people knew that they were on a national TV or radio program, and they therefore probably altered their speech accordingly -- such as relatively little profanity and perhaps avoiding highly stigmatized words and phrases like "ain't got none". In terms of overall word choice and "natural conversation" (false starts, interruptions, and so on), though, it does seem to represent "off the air" conversation quite nicely. But no spoken corpus (even those created by linguists with tape recorders in the early 1990s) will be 100% authentic for real conversation -- as long as people know that they're being recorded.

Finally, it is possible to do some quick searches that show the overwhelming "spoken" nature of these texts. The following phrases are ones that we would expect to occur much more frequently in spoken (American) English than in other genres. Click on them to see how common they are in spoken and the other genres:and I'm like, so not ADJ I guess that. Well,. Sure. Do you think, you know,

One other note:

In the transcripts there is text indicating who the speaker is, or codes referring to "voice-overs" or other notations made by the transcriber. For example:

SUMMIT It should be a C-note. Mr. CARY ANDERSON That's it. Mr. ANDERSON Oh, very good. See, you didn't have to get nervous, Mr. Cronick. You were really very good at it. SUMMIT All right. Mr. ANDERSON You -- you were coming fast and furious here. It was great. I could sleep. (Laughing) I appreciate it.

We were able to "separate out" most of these (shown in gray above), although there is a bit more of this in the transcripts from 2008. Where they have been separated out, they are not included in the overall word count, nor can they be searched for (this is on purpose, since they're not "spoken"), but they do appear in the Keyword in Context displays.

time corpus american english wordlists word lists fre

Although the majority of the texts are copyrighted, we are using them in this corpus under what we claim is "Fair Use". The following are the four criteria used to determine whether materials fall under the provisions of the Fair Use Law:

Criteria

What favors Fair Use status

The Corpus of Contemporary American English

The amount and substantiality of the portion taken

Small portions of the original text, rather than full-text access

Under no circumstances whatsoever do end users have access to entire texts (e.g. newspaper, magazine, or journal articles, or short stories). All access is via the web interface, and the vast majority of what users see are simply frequency charts showing the frequency of words or phrases in different parts of the corpus. Access to small portions of the original text is more of an "afterthought", rather than the central feature of the interface.

Access to actual portions of the original text is limited to very short "Keyword in Context" displays, where users see just a handful of words to the left and the right of the word(s) searched for. In addition, all access is logged, and users can only perform a limited number of searches per day. As a result, it would be difficult for end users to re-create even one paragraph from the original text, and it would be virtually impossible to re-create an entire page of text, much less the entire article.

This "snippet defense" (which relies on limited access to the original text via small snippets from the web interface) is the same one used by Google Books for its use of millions of copyrighted materials. In addition, we have consulted two lawyers who specialize in Internet copyright law (names available upon request). They have both stated that because of our limited access to end users, as well as our status with regards to the other three factors shown here, we are clearly in accord with the provisions of the Fair Use statute.

The purpose and character of the use

Academic, non-commercial

Our use of the texts is strictly for academic research, and is purely non-commercial.

The nature of the copyrighted work

Non-creative works

There are some creative works (e.g. short stories and small sections of novels) in the corpus, but more than 80% of the corpus is composed of transcripts of TV shows, and articles from newspapers, magazines, and academic journals.

The effect of the use upon the potential market

Little or no effect on the copyright holder

Because of the very limited access via our web interface (see the first item above), it is extremely unlikely that anyone would use this corpus as a "substitute" for other access to the original texts. Other sources make these texts available as "complete articles", which are meant to be read in their entirety. That is completely impossible with our interface.

Access to the texts via our interface, as compared to access via other sources, serves two completely different audiences. Our interface is designed for linguists and language learners who want to see the frequency of words, phrases, synonyms, etc., and it is completely inadequate for anyone who wishes to read the entire text of an article. As a result, there is very little or no "competition" between our service and that provided by others, and therefore virtually no market impact.

In addition to the copyright issues, there are also licensing issues, in terms of the sources from which we obtained some of the texts in the corpus. We were very careful, however, to retrieve the materials over a very long period of time (four years -- 2005-2008), so as to not violate licensing agreements on how much material could be retrieved in a particular timeframe.

Дата добавления: 2015-09-29; просмотров: 56 | Нарушение авторских прав

<== предыдущая лекция	\|	следующая лекция ==>
1. MANABOZHO’S BIRTH (Menomini)	\|

mybiblioteka.su - 2015-2024 год. (0.245 сек.)