Студопедия
Случайная страница | ТОМ-1 | ТОМ-2 | ТОМ-3
АвтомобилиАстрономияБиологияГеографияДом и садДругие языкиДругоеИнформатика
ИсторияКультураЛитератураЛогикаМатематикаМедицинаМеталлургияМеханика
ОбразованиеОхрана трудаПедагогикаПолитикаПравоПсихологияРелигияРиторика
СоциологияСпортСтроительствоТехнологияТуризмФизикаФилософияФинансы
ХимияЧерчениеЭкологияЭкономикаЭлектроника

Key terms of Corpus Linguistics

Читайте также:
  1. A LIST OF SOME LESS FAMILIAR TERMS
  2. ANCIENT TERMS OF UKRAINIAN LAW: ETYMOLOGICAL RECONSTRUCTIONS AND SEMANTIC OBSERVATIONS
  3. Applied linguistics
  4. British dialects corpus – IvofE corpus
  5. Consult Glossary on page 43 and check the meaning of the following terms. Explain how they are linked to the context of the chapters.
  6. Cooking Terms
  7. David Smith travels through the country's museums and discovers a nation struggling to come to terms with its past

 

n Type

n token

n hapax

n lemma

n word-form

n tag

n parse

n annotate

TOKEN

n For example, if there are 27 words in a passage, there are 27 sequences of letters separated by spaces or punctuation. In other words there are 27 tokens.

 

n Counting each repeated item once only, so that only different words are counted, gives 23 types.

 

n The words that occur only once are called hapax legomena or hapax.

LEMMA & WORD –FORM

 

n Unit & units are two word-forms belonging to the same lemma.

n Eat, eating, eats, ate are word-forms of the lemma EAT.

 

 

TAG, TAGGING

n The term tagging is normally used to refer to the addition of a code to each word in a corpus, indicating the part of speech.

 

n Tags are useful components of word searches (e.g. WORK (noun) VS WORK (verb)

 

Rule-based taggers rely strictly on linguistic categorization and level hierarchy of language.

 

PARSERS, PARSING

 

 

ANNOTATION

n A superordinate term for tagging and parsing is annotation.

n Annotation is also used to describe other kinds of information (annotation of spoken corpus for intonation; annotation for anaphora which identifies the cohesion)

 

Almost ideal annotation is found in ICE-Ireland Corpus of English. Annotation includes

Ø detailed documentation of coding of texts files which give samples of both written and spoken speech, totally 1079775 words, (Spoken – 652966 words, Written – 426809 words)

Ø detailed description of spelling, transcription, diacritics

Ø classification of texts in line with speakers’ biodata, namely

Ø zone (place of residence): N - Northern Ireland, S – Republic of Ireland, M – mixed, X – non- Corpus speaker

Ø date – all texts are subdivided into three categories in line with the date of first publication or recording: А- 1990-1994, B – 1995-2001, C- 2002-2005. The period the text belong to really testifies to the facts that they mirror current English in Ireland.

Ø age – all speakers fall into one of several groups: 0: 0-18, 1: 19-25, 2: 26-33, 3: 34 – 41, 4: 42 - 49, 5: – 50+. Naturally, data for the two polar groups looks less objective in comparison with other categories of speakers as teenagers and young people who are 16-18 years old have quite different language performance, the same is true about fifty-something, on the one hand, and seventy-something, on the other hand.

Ø sex, education, occupation, religion, knowledge of foreign languages.

Generally, any spoken text is provided with the 15-criterion-information about 945 speakers (Speakers Biodata). Biodata can be found in the comprehensive table.

 

Precise and scrupulous punctuation in ICE-Ireland gives information about prosodic characteristics of spoken texts, for example:

 

<S1A-016$A> <#> She 's <.> ex </.> <#> Yes <#> Yes very <,> very very <#> She 's <,> yes <,> very scary <#> Lovely lady <,> <{> <[> lovely lady </[>

 

Uh <,> and I think that <,> I I I think that that the point is that that things have to be

 

<S1B-024$C> <#> Well I I I I think yes merit is <.> i </.> is the issue <#> We want

 

<S1B-025$Q> <#> To me I think animals have more brains than humans <,> <{> <[> some of them </[>

<S1B-025$A> <#> <[> If they have </[> </{> if they have more brains and and they have a a personality uh Vivienne why why do you we as a society allow uhm experimentation on them

 

<S1B-026$A> <#> <[> People </[> </{> have talked about a a dependency culture in some quarters <#> Do you think that exists

<S1B-026$B> <#> I <.> do </.> I think people make rational decisions <#> I 'd be

 

<S1A-043$A> <#> So aye they were pushing Mary up the hill and Mary was really

 

<S1A-043$A> <#> Wee small <{1> <[1> fellow </[1> <#> Aye <&> laughter </&>

 

 

<S1B-070 AG Cross exam>

<S1B-070$A> <#> Uhm <,> no I I I was anxious to to discover how uh that uh conclusion was reached that this was the first case and with respect I don't think you 've answered it <#> Maybe it 's a more appropriate one to put it to the official concerned <#> Was it simply by by his uh uh memory <#> I mean <,> <{> <[> can I </[> can I just say <,> on your views on Mr Fitzsimons and I think this is relevant <,> uh uh regarding the running of your office <,> uh an office which you say you were you were in and I accept what you say <,> from eight a-m onwards every day but uh

 

CONCORDANCE. CONCORDANCING PROGRAM

 

Many linguists use and access the corpora through a concordancing program. Its is a ‘word-based’ method of investigation corpora. A concordancer is a program that searches a corpus for a selected word or phrase. Producing concordance lines is probably the most basic way of processing corpus information, and most corpus users rely heavily on concordances and their interpretation.

 

n Concordance lines bring together many instances of use of a word or phrase, allowing the user to observe the regularities

 

 

Prof. Aston claims that methods of conversation analysis are being used in Corpus Linguistics. Barth uses the corpus as a convenient way of storing texts with the additional advantage that selected features can be tagged. BUT mostly concordance lines are used.

 

NB!

n Concordance lines present information; they do not interpret it.

n Interpretation requires the insight and intuition of the observer.

 

 

NB!

Grammars of English, for example Cambridge Grammar of English, include chapters on data gathering and search of information.

 

Information on concordance.

‘Concordances help researchers see how words are actually used in context. Words or phrases which researchers are interested in are displayed in a vertical arrangement on the computer screen along with their surrounding co-text: we see what came just before the word and what came just after. For example, these sample lines from a concordance for the adverb yet in the spoken corpus show us that a negative environment is very common, but not in questions (negative items and question marks in bold), and that as yet is a recurrent pattern. The A–Z entry for yet in this book, and much of our grammatical description, is based on this type of observation.

The concordance also gives us a code on the right of the screen (in green here) which tells us what type of conversation each line occurs in, and leads us to the corpus database where we can verify who the speakers are, what age, gender, and social profile they have, how many people were involved in the conversation, where it took place, etc. We are therefore able to say something is in common.’

 

(Cambridge Grammar of English, 2006)

 

FREQUENCY

Another important way for information search and analysis within any corpora is frequency.

 

n The words in a corpus can be arranged in order of their frequency in that corpus.

Word Frequency Comparison Across Corpora

 

  1. THE
  2. OF
  3. TO
  4. AND
  5. A
  6. IN
  7. THAT
  8. S
  9. IS
  10. IT
 
11. FOR 12. I 13.WAS 14.ON 15.HE 16.WITH 17.AS 18.YOU 19.BE 20.AT   21.BY 22.BUT 23.HAVE 24.ARE 25.HIS 26.FROM 27.THEY 28.THIS 29.NOT 30.HAD   31.HAS 32.AN 33.WE 34.N’T 35.OR 36.SAID 37.ONE 38.THERE 39.WILL 40.THEIR   41.WHICH 42.SHE 43.WERE 44.ALL 45.BEEN 46.WHO 47.HER 48.WOULD 49.UP 50.IF  

 

Information on frequency in Cambridge Grammar of English, 2006

 

‘The corpus was analysed in a variety of ways in the preparation of this book. One way was to compile frequency lists. A frequency list simply ranks words, phrases and grammatical phenomena (e.g. how many words end in -ness or -ity, or how many verb phrases consist of have + a verb ending in -en) in a list. In this way, we are able to see not only which items are most and least frequent, but also how they are distributed across speech and writing and across different registers (e.g. newspapers, academic lectures, conversations at home). For example, the list of the twenty most frequent word-forms in the CIC for spoken and written texts (based on five-million-word samples of each) are different.

 

In the spoken list, I and you rise to the top, indicating the high interactivity of face-to-face conversation. Know is at number 14, indicative of the high frequency of the discourse marker you know (106b), and mm and er reflect the frequency with which listeners vocalise their acknowledgement of what the speaker is saying, or whereby speakers fill silences while planning their speech in real time or while hesitating. It’s and yeah reflect the informality of much of the talk in the CANCODE spoken corpus.’

(Cambridge Grammar of English, 2006)

 

The twenty most frequent word-forms in spoken and written texts

 

  Spoken   Written
1 The 1 the
2 I 2 to
3 And   and
4 You 4 of
5 It 5 a
6 to 6 in
7 a 7 was
8 yeah 8 it
9 that 9 i
10 of 10 he
11 in 11 that
12 was 12 she
13 it’s 13 for
14 know 14 on
15 is 15 her
16 mm 16 you
17 er 17 is
18 but 18 with
19 so 19 his
20 they 20 had

 

(Cambridge Grammar of English, 2006)

Deciding what to include

In deciding on priorities with regard to the description of items and patterns, both quantitative and qualitative approaches are important. On the quantitative side, the corpus evidence can often show striking differences in distribution of items between speaking and writing. For example, the forms no one and nobody are, on the face of it, synonymous, yet their distribution across five million words each of spoken and written data is very different, with nobody greatly preferred in the spoken corpus, as shown below.

 

 

The interpretation of such statistics then depends on a more qualitative interpretation of the data, observing how nobody tends to correlate with the more informal end of the spectrum. A similar pattern of usage, in this case more clearly related to formality, can be seen for who and whom, where whom is shown to be relatively rare in conversation, only occurring in more formal contexts.


 

‘CENTRAL’, ‘TYPICAL’, & ‘PROTOTYPICAL’

Studies of frequency triggered other observations and conclusions, namely observing ‘central’, ‘typical’, and ‘prototypical’. As a result we can speak about two categories – centrality VS typicality.

 

n ‘Typical ’ might be used to describe the most frequent meanings or collocates or phraseology of an individual word or a phrase.

 

n The concept of ‘centrality’ can be applied to categories of things rather than to individual words.

 

 

n Although speakers of a language may have intuitions about typicality, these intuitions do not accord with the evidence of frequency. Barlow (1996) & Shortall (1999) use the term prototypical to indicate a usage which is commonly felt to be typical but is not necessarily frequent.

 

The following conclusions exemplify these observations within CL:

 

n The verbs used most often with reflexives are SEE, IMAGINE, VISUALISE, CONSIDER, ASK than verbs of physical action such as he hit himself. Each meaning is associated with particular pattern.

 

n Unemployment is used much more often than unemployed. As in Great Britain the discourse focuses mainly on abstract demographic trends not on categories of people.

 

With the help of concordancer and frequency criterion linguists can spot new trends, emerging collocations which can be used to indicate the growth of new concepts, and changes in the meaning of words. For instance,

 

n collocations like single parent families & unmarried mothers signal important changes in social structures;

n young speakers noticeably favour okay, hi, hey, wow and adjectives such as weird, massive, horrible, sick, funny.

n Swedish speakers of English use a few, informal connectors such as but with great frequency, but others such as however, though & yet, less frequently than native speakers do.

 


Дата добавления: 2015-10-23; просмотров: 219 | Нарушение авторских прав


Читайте в этой же книге: Производственный процесс — основа деятельности предприятия | Определение производственного процесса | Структура предприятия. Общая и производственная структуры предприятия | Производственная программа | Производственная мощность | The 17th- 18th century English Dictionaries | Varieties of English | English speakers in line with NationMaster | Experimental Phonetics |
<== предыдущая страница | следующая страница ==>
British dialects corpus – IvofE corpus| Розділ 1.2. Школа і педагогіка епох Середньовіччя та Відродження

mybiblioteka.su - 2015-2024 год. (0.016 сек.)