Читайте также: |
|
The beta-version of the corpus contains machine-readable, prosodically labelled speech data from five urban varieties:
o - Belfast English
o - British Punjabi English spoken in Bradford
o - Cambridge English
o - Leeds English
o - Newcastle English
o The data was collected in urban secondary schools, and the speakers are 16 years old.
o The authors recorded six male and six female speakers from each variety.
o The sentence directory contains sentences of five different syntactic structures.
o There is also the Cinderella directory, which includes a passage of a famous fairy tale retold by the speakers of different dialects.
o By means of modern software one can analyse the sound material and see the difference in pronunciation of sentences in the dialects.
Phonetics in Britain
RP
Estuary English
Dialects & Accents in Britain
London School of Phonetics
o Daniel Jones and A.C.Gimson set up principles of phonetic research in Great Britain
General Phonetics
o Voice quality across languages: different languages spoken with typical voices qualities?
English Phonetics
o Redefining RP – what are the characteristics of contemporary standard English pronunciation in England?
Received Pronunication
RP is a minority accent
o RP is changing rapidly
o Mainstream RP
o Adoptive RP
o Near RP
o Conservative RP –older generation
o General RP – most commonly used
o Advanced RP –mainly young people
o U-RP –upper-crust RP
RP VS BBC
BBC English or Oxford English is a kind of educated speech which is more upper-middle than upper:
It lacks how-how tones, vowel-swallowing
It is more intelligible
o Connected speech phenomena – how is the pronunciation of English words modified in authentic running speech?
English Phonetics
o Stressing & accentuation – irregularity and idiomaticity
o Intonation - the new high-rise nucleus (‘upspeak’);HIGH-RISE TERMINAL (HRT)
English Phonetics
o L2 & interlanguage – what are the phonetic characteristics of learners of English? How are they to be explained?
o Pronunciation preferences – how do people’s preferences change over the time?
Dialect levelling in Britain
1990-2000
o BrE in the 20th century is characterized by dialect leveling & standardization
o 1st stage: affected traditional rural dialects once spoken by the majority of population, by the beginning of the 20th century – only 50%:
o There are fewer differences between ways of speaking in different parts of the country
o New dialects emerged different from standard English in pronunciation and grammar. Families have abandoned rural dialects in favor of a type of speech which was more urban of the local city
o More urban ways of speaking were labelled modern dialects or mainstream dialects by Peter Trudgill
(1998)
o They are more like standard English in phonology, grammar & vocabulary
o 2nd stage affected urbanised varieties of English themselves. The dialects are subjected to further levelling. It’s impossible to say where the person comes from. The differences are subtle, purely phonetic ones
o Factors that made impact on dialect levelling:
o Migration
o Mass media
o Modern dialect studies moved from the country to the city → urban dialectology ( city talk, urban observations)
o The mechanism of standardization lies in a network of social contacts.
o People accommodate to the speech of those who they communicate at work, usually people of higher social status – upward convergence. Rarely one can come across downward convergence
Social & regional variation of English – triangle model by Daniel Jones
S
O .standard language
C
I
A
L
o _____._____._____._____.______.___
R e g i o n a l variation
o Regional variation decreases & minimizes at the top of the triangle where we have Standard English
o The triangle model is accurate as it involves continua – both high status & low status accents, & a geographical accent from one end of the country to another
o Regional accents have become more acceptable nowadays
o
Social differences in pronunciation in Britain
o Low ranks drop consonants (‘alf past ten’, ankercheef’)
o Upper classes drop vowels (‘hpstn’, ‘hnkrchf’)
o Upper classes at least articulate consonants correctly
o Low classes pronounce ‘th’ as ‘f’ (teeth→teef, that →vat, Worthing →Worving,
o something → somefink,
o Nothing →nuffink,
o Mispronunciation of words is a signal of low class, indicating low-educated speaker
Estuary English (EE)
Ø Estuary English - grassroots strike back???
Ø First described by Edward Roswarne in 1984 & much criticized
Ø EE is a variety of modified regional speech, a mixture of non-regional & local south-eastern English pronunciation and intonation.
Ø The heartland of EE lies by the banks of the Thames & its estuary, but it is influential in the south-east of England
Ø Now it is heard in the House of Commons and sometimes used by members of the Lords, in the City, business circles, bohemian world
Ø RP speakers make 3% of the population, though in the 1970s it seemed that RP speakers were everywhere
Ø EE is becoming the most popular pronunciation model
Ø Within a continuum with RP and London speech at both ends EE is in the middle:
Ø London speech → → →EE→ → →RP
Ø Very often EE is described as cockneyfied, esturian English, localized English
Ø EE can easily influence English pronunciation in future.
Ø Grammar & vocabulary are as in Standard English, the difference is in pronunciation
Ø Vocalization of preconsonantal final and associated vowel neutralization /l/ / miwk bottoo/ – milk-bottle
Ø H-dropping (‘and on ‘eart)
Ø TH –fronting (I fink)
Ø Realization of / r/
Ø Yod coalescence
Ø Change of st- (station, estuary, Christian etc.) and str- (street, instruction, industrial etc.) into sh- like in she.
Ø Tone & pitch features – prominence given to prepositions and auxiliary verbs which are not normally stressed in RP
Ø Rise-fall intonation
Ø More frequent use of tag-questions
§ Corpus is a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings collected for linguistic study. More recently, the word was reserved for collections of texts stored and accessed electronically. The corpus is stored in such a way that it can be studied non-linearly, and both quantitatively and qualitatively.
§ Corpora are large collections of computer-readable texts which might be created and used for different purposes nowadays.
§ Corpus Linguistics is a study of language that includes all processes related to processing, usage, analysis of written or spoken machine-readable corpora. CL is a relatively new term used to refer to methodology based on computer science and example of “real life” language use.
Corpus Linguistics is gaining more and more popularity lately as a source of objective data for many directions in Linguistics. More and more often linguists admit that they use different corpora as the material for their research. Sometimes they use corpora in addition to other data, sometimes their research is based on some corpus only. Besides individual research and projects carried out by small teams of linguists under supervision of well-known scholars, one may find results of large-scale and long-term research based essentially on corpora, for example, modern dictionaries of English. The authors and editors of these fundamental editions readily admit and are justly proud of the fact they their research was based on authentic language data available via corpora.
For instance, published in 1995 “ The Cambridge International Dictionary of English (CIDE) is the result of several years’ detailed language research and analysis. It is aimed at learners and users of English from immediate level upwards. Built around the Cambridge Language Survey corpus of 100 million words, both written and spoken, CIDE gives today unrivalled access to the English they need.’ Longman Exams Dictionary states that they rely on LONGMAN CORPUS NETWORK. Longman Dictionary of English and Culture mentions in addition The British National Corpus. Oxford Dictionary of Collocations states The British National Corpus is a collaborative project involving the Oxford University Press, Longman, Chambers, the Universities of Oxford and Lancaster and the British Library.
Published in 2009 ‘ Grammar for CAE and Proficiency is informed by the Cambridge International Corpus and The Cambridge Learner Corpus to ensure that grammar is presented in genuine contexts and the book covers the errors students really make.’ ‘ The Cambridge International Corpus (CIC) is a collection of over 1 billion words of real spoken and written English. The texts are stored in a database that can be searched to see how English is used. The CIC also includes the Cambridge Learner Corpus, a unique collection of over 95.000 exam papers from ESOL. It shows real mistakes students make and highlights the parts of English which cause problems for students.’
New reference books of English grammar like Cambridge Grammar of English (2006), go further giving detailed information of the data their recommendations are based on.
“CGE is a grammar book that is informed by the corpus. The word ‘informed’ is used advisedly because we are conscious that it is no simple matter to import real data into a reference book in the belief that authentic language is always the right language for the purposes of learning the language. In places, this means that corpus examples which contain cultural references of the kind that are so common in everyday language use are either not selected or, while ensuring that the key grammatical patterns are preserved, are slightly modified so that they do not cause undue difficulties of interpretation. It is our strong view that language corpora, such as the Cambridge International Corpus, can afford considerable benefits for language teaching but the pedagogic process should be informed by the corpus, not driven or controlled by it.”
“Many of the examples in this book are taken from a multi-million-word corpus of spoken and written English called the Cambridge International Corpus (CIC). The corpus is international in that it draws on different national varieties of English (e.g. Irish, American). This corpus has been put together over many years and is composed of real texts taken from everyday written and spoken English. At the time of writing, the corpus contained over 700 million words of English. The CIC corpus contains a wide variety of different texts with examples drawn from contexts as varied as: newspapers, popular journalism, advertising, letters, literary texts, debates and discussions, service encounters, university tutorials, formal speeches, friends talking in restaurants, families talking at home.
One important feature of CIC is the special corpus of spoken English – the CANCODE corpus. CANCODE stands for Cambridge and Nottingham Corpus of Discourse in English, a unique collection of five million words of naturally-occurring, mainly British (with some Irish), spoken English, recorded in everyday situations. The CANCODE corpus has been collected throughout the past ten years in a project involving Cambridge University Press and the School of English Studies at the University of Nottingham, UK. In CGE dialogues and spoken examples are laid out as they actually occur in the transcripts of the CANCODE recordings, with occasional very minor editing of items which might otherwise distract from the grammar point being illustrated.
The CANCODE corpus is a finely-grained corpus. The CANCODE research team have not simply amassed examples of people speaking; they have tried to obtain examples from a range of sociolinguistic contexts and genres of talk. There is considerable advantage in being able to demonstrate statistical evidence over many millions of words and broad general contexts.
“A carefully constructed and balanced corpus can help to differentiate between different choices relative to how much knowledge speakers assume, what kind of relationship they have or want to have, whether they are at a dinner party, in a classroom, doing a physical task, in a service transaction in a shop, or telling a story (for example, our corpus tells us that ellipsis is not common in narratives, where the aim is often to create rather than to assume a shared world). By balancing these spoken genres against written ones, our corpus can also show that particular forms of ellipsis are widespread in certain types of journalism, in magazine articles, public signs and notices, personal notes and letters and in certain kinds of literary text. In descriptions of use, the most typical and frequent uses of such forms are described in relation to their different functions and in relation to the particular contexts in which they are most frequently deployed.’
Learner corpus
“We also had access during the writing of this book to a large learner corpus cultures, coded for error and inappropriate use. This, along with our own language-teaching experience and that of our reference panel, has enabled us to give warnings of common areas of potential error where appropriate. These error warnings are signaled by the symbol. *’
History
First steps towards present-day CL were made in the 1960s. Then The Brown University Corpus of American English (Brown Corpus) came into being. It comprised 500 written text samples of 2000 words each from systematic publications in the USA since 1961. Initially the corpus was a result of manual data capture. Current progress of Corpus Linguistics became possible due to the achievements of information technologies, computer science and specialized software. The texts in any corpus are machine –readable what is a precondition for automatic processing and automatic transmission.
Methods
Gradually Corpus Linguistics as any discipline devised methods for research of language data which are generalized as corpus method. Methodology of CL is connected with the two facts:
1) Corpus Linguistics is a computerized research
2) the texts within any corpus are computer-readable.
Popularity of Corpus Linguistics is accounted for general benefits of the corpus method.
General benefits of the Corpus method:
§ large corpora of computerized authentic rather than invented language
§ computers can process enormous amounts of data
§ retrieval of the data is objective, not intuitive, that implies that the search can be replicated
§ specific corpora selected from particular types of texts allow to compare the use and frequency of certain features in different text types, provided the corpora is large enough
Corpora can be classified
as general, specialized, parallel, learner corpus, historical/ diachronic
the Trans-European Language Resource Infrastructure research project)
contains corpora of languages from newly independent European countries,
such as Estonian, Latvian, Lithuanian, Slovenian, Ukrainian and
Uzbek as well as English, French & Spanish
Parallel corpora can be used for data-driven learning (DDL). One of the latest innovations is reciprocal learning with parallel corpora and two or more languages involved, especially when learners share a common first language. International students learn themselves from authentic examples, reveal unnoticed patterns.
(texts from 700 to 1700 and
comprises 1.5 million words)
NB!
n However, the proportion of text types remains constant, so that each year is directly comparable with every other.
§ Corpora of English available nowadays
n online corpus Any corpus of primary linguistic data which can be accessed online through a dedicated user interface. The data for such a corpus can consist of material from the Internet itself, for example blogs, personal pages, online media such as newspapers, journals, periodicals, etc., or it can consist of material from elsewhere which is then placed online for user interrogation via an appropriate interface. An example of the former would be the corpus of global web-based English and of the latter, the British national corpus. On corpora and the Internet, see Hundt, Biewer & Nesselhauf (eds 2007 [1.1.5]).
Corpora of other languages
BNC is one of the most popular corpora among linguists. Initial reports about BNC appeared in 1991-1992. The Source is the Bank of English. Leading partner is Oxford University Press. Text selection, data capture, transcription of spoken texts by Longman, storage, encoding, distribution by Oxford University Press.
§ fiction texts since 1960,
§ informative texts since 1975
§ imaginative texts
§ religious texts
§ newspapers
NB!
ICE-Ireland Corpus, for instance, is known for its balanced approach to the material, namely proportionate presentation of written and spoken language, as well as genre variation:
Spoken (300) – Dialogue (180) –Private (100) & Public (80), Monologue (120)
Written (200) –Non- printed (50), Printed (150) – Informational Writing (100), Instructional Writing (20), Persuasive Writing (10), Creative Writing (20).
All the texts are 5-7 pages long. Unlike BNC, all texts exemplify cooperative not conflicting communication.
The scope of communication situations is diverse and impressive, spanning different discourse types: Riding, Dinner, Nursing Hospital, Chat, Lovely Bread, Lunch, Summer plans, Catching up, Drama, Province Town, Clothes, Pizza, Christmas, Football, Restaurant, Chess Club, Office Space, Australia, Photos, Boyfriends, Student grants, Holistic Medicine, Modern man, Glasses, Shoes, America trip, Kissogram, General elections, Designer clothes, Haircut, Househunting, Sociolingistics, Byzantium, Clinitians, Workforce, Chernobyl, Belfast politics, Attorney General, Multi-Party Talks, Budget, Traffic Accident, Adoption, Resignation, Medical Evidence, Doctor and Patient, Flatfinders, Ulster Footbal, Football Match, Queen’s visit, Clinton’s arrival, Clinton’s departure, Horse Racing, Biochemistry Talk, Students Placement, Programming, Education in Irish, Employment, Training, Ceasefire, Radio News, Columbus, Geographical data, Unionist politics, English Literature etc.
Corpus Purpose:
Within Linguistics BNC is used
NB!
Compared with BNC, ICE-Ireland is disadvantageous as
Ø it does not have Internet address
Ø it remains unchangeable what makes a precondition of its purchase and usage despite requirements of current CL [see Susan Hunston 2005]
Ø it lack concordance program or any other search method
Ø it does not contain audio information and thus deprives potential users of the chance to familiarize themselves with the samples of spoken speech. SPICE- Ireland Corpus (Kirk et all, 2005, 2007) which contains prosodic, pragmatic and discourse characteristics of spoken language in Ireland, offers a sort of partial compensation.
Corpora are used by teams of linguists in different ways and for different purposes. In some countries linguists use corpora just as a source of computerized texts what makes search of data, copying and pasting much easier. They are completely satisfied with the fact and do not trouble themselves with the corpora potential. No analysis of the data is made.
A corpus is planned, though a chance may play a part in texts collection.
It is true that Corpus Linguistics might study corpus as it is; so it is compared with a microscope which was used initially to examine butterflies and only later was implemented in Biology. In other words, Corpus Linguistics can not be absolutely objective.
Meanwhile some present-day versions of some corpora offer unique opportunities for the linguistic analysis unimaginable some years ago. These new opportunities require good knowledge of diverse and complex linguistic analysis form the researches. Obviously, Corpus Linguistics might give impetus to statistical and quantitative analysis which was always neglected in Linguistics.
Corpora within Applied Linguistics
The use of linguistic corpora in Applied Linguistics have expanded rapidly over the past 20 years. The factors which contributed to the radical change of attitude are as follows
Corpora can be used in
n stylometrics – a discipline devoted to literary style.
n Forensic Linguistics.
Corpora are used to assist the studies of a language produced in different situations.
Дата добавления: 2015-10-23; просмотров: 169 | Нарушение авторских прав
<== предыдущая страница | | | следующая страница ==> |
Experimental Phonetics | | | Key terms of Corpus Linguistics |