Corpus Linguistics: What It Is and How It Can Be Applied to Teaching

What is Corpus Linguistics?

Corpora, Concordancing, and Usage

In order to conduct a study of language which is corpus-based, it is necessary to gain access to a corpus and a concordancing program. A corpus consists of a databank of natural texts, compiled from writing and/or a transcription of recorded speech.  A concordancer is a software program which analyzes corpora and lists the results.  The main focus of corpus linguistics is to discover patterns of authentic language use through analysis of actual usage. The aim of a corpus based analysis is not to generate theories of what is possible in the language, such as Chomsky's phrase structure grammar which can generate an infinite number of sentences but which does not account for the probable choices that speakers actually make.  Corpus linguistics� only concern is the usage patterns of the empirical data and what that reveals to us about language behavior.

Register Variation

One frequently overlooked aspect of language use which is difficult to keep track of without corpus analysis is register. Register consists of varieties of language which are used for different situations.  Language can be divided into many registers, which range from the general to the highly specific, depending upon the degree of specificity that is sought.  A general register could include fiction, academic prose, newspapers, or casual conversation, whereas a specific register would be sub-registers within academic prose, such as scientific texts, literary criticism, and linguistics studies, each with their own field specific characteristics.  Corpus analysis reveals that language often behaves differently according to the register, each with some unique patterns and rules.

The Advantages of Doing Corpus-Based Analyses

Corpus linguistics provides a more objective view of language than that of introspection, intuition and anecdotes. John Sinclair (1998) pointed out that this is because speakers do not have access to the subliminal patterns which run through a language.  A corpus-based analysis can investigate almost any language patterns--lexical, structural, lexico-grammatical, discourse, phonological, morphological--often with very specific agendas such as discovering male versus female usage of tag questions, children's acquisition of irregular past participles, or counterfactual statement error patterns of Japanese students. With the proper analytical tools, an investigator can discover not only the patterns of language use, but the extent to which they are used, and the contextual factors that influence variability. For example, one could examine the past perfect to see how often it is used in speaking versus writing or newspapers versus fiction.  Or one might want to investigate the use of synonyms like begin and start or big/large/great to determine their contextual preferences and frequency distribution.


Applying Corpus Linguistics to Teaching

According to Barlow (2002), three realms in which corpus linguistics can be applied to teaching are syllabus design, materials development, and classroom activities.

Syllabus Design

The syllabus organizes the teacher's decisions regarding the focus of a class with respect to the students� needs.  Frequency and register information could be quite helpful in course planning choices. By conducting an analysis of a corpus which is relevant to the purpose a particular class, the teacher can determine what language items are linked to the target register.

Materials Development

The development of materials often relies on a developer's intuitive sense of what students need to learn. With the help of a corpus, a materials developer could create exercises based on real examples which provide students with an opportunity to discover features of language use.  In this scenario, the materials developer could conduct the analysis or simply use a published corpus study as a reference guide.

Classroom Activities

These can consist of hands on student-conducted language analyses in which the students use a concordancing program and a deliberately chosen corpus to make their own discoveries about language use.  The teacher can guide a predetermined investigation which will lead to predictable results or can have the students do it on their own, leading to less predictable findings. This exemplifies data driven learning, which encourages learner autonomy by training students to draw their own conclusions about language use.

Teacher/Student Roles and Benefits

The teacher would act as a research facilitator rather than the more traditional imparter of knowledge. The benefit of such student-centered discovery learning is that the students are given access to the facts of authentic language use, which comes from real contexts rather than being constructed for pedagogical purposes, and are challenged to construct generalizations and note patterns of language behavior. Even if this kind of study does not have immediately quantifiable results, studying concordances can make students more aware of language use.  Richard Schmidt (1990), a proponent of consciousness-raising, argues that �what language learners become conscious of -- what they pay attention to, what they notice...influences and in some ways determines the outcome of learning." According to Willis (1998), students may be able to determine:

  • the potential different meanings and uses of common words
  • useful phrases and typical collocations they might use themselves
  • the structure and nature of both written and spoken discourse
  • that certain language features are more typical of some kinds of text than others

Barlow (1992) suggests that a corpus and concordancer can be used to:

  • compare language use--student/native speaker, standard English/scientific English, written/spoken
  • analyze the language in books, readers, and course books
  • generate exercises and student activities
  • analyze usage--when is it appropriate to use obtain rather than get?
  • examine word order
  • compare similar words--ask vs. request

Problematic Issues Involved

Several challenges are involved in implementing the use of a corpus for the purpose of teaching.  The first is that of corpus selection. For some teaching purposes, any large corpus will serve.  Some corpora are available on-line for free (see appendix 2) or on disk.  But the teacher needs to make sure that the corpus is useful for the particular teaching context and is representative of the target register.  Another option is to construct a corpus, especially when the target register is highly specific. This can be done by using a textbook, course reader, or a bunch of articles which the students have to read or are representative of what they have to read.  A corpus does not need to be large in order to be effective.  The primary consideration is that of relevance to the students--it ought to be selected with the learning objectives of the class in mind, matching the purpose for learning with the corpus.

Related to the issue of corpus selection is that of corpus bias, which can cause frustration for the teacher and student.  This is because the data can be misleading; if one uses a very large general corpus, it may obscure the register variation which reveals important contextual information about language use.  The pitfall is that a corpus may tell us more about itself than about language use.  Another obstacle to confront is the comprehensibility issue: if you use concordancing in a class, it can be quite difficult for the students (or even the teacher) to understand the data that it provides.  Lastly, the issue of learning style differences--for some students, discovery learning is simply not the optimal approach. All of these points reinforce the caveat that careful consideration is required before a new technology is introduced in the classroom, especially one which has not been thoroughly explored and streamlined.

 Exploiting a Corpus for a Classroom Activity

Although corpora may sound reasonable in theory, applying it to the classroom is challenging because the information it provides appears to be so chaotic.  For this reason, it is the teacher's responsibility to harness a corpus by filtering the data for the students.  Although I support having students conduct their own analyses, at present I see corpora�s greatest potential as a source for materials development.  Susan Conrad (2000) suggests that materials writers take register specific corpus studies into account.  Biber, Conrad and Reppen (1998) emphasize the need for materials writers to acknowledge the frequency which corpus studies reveal of words and structures in their materials design. (See Appendix 1 for an example).

Taking a Closer Look at "Any"

As an English teacher, I have always taught "any" in the following way:

Interrogatives: Are there any Turkish students in your class?

Negatives: No, there aren't any Turkish students in my class.

Affirmatives: *Yes, there are any Turkish students in my class.

A corpus study by Mindt (1998) concluded that 50% of any usage takes place in affirmative statements, 40% in negative statements, and only 10% in interrogatives.  My own concordance analysis bore his claim out, so I constructed the following exercise to represent the percentage distribution of the three structural uses of any, using ten representative examples. The purpose of this exercise is to get the students to discover three usage patterns and their relative frequency.  These concordance lines can also be exploited for other purposes such as defining functions and common language chunks of any. It is assumed that an exercise like this would be part of a lesson context in which the students were studying quantifiers or something related.