Wiktionary:BNC spoken freq

From Wiktionary

Introduction[change]

These "baseword" lists are four of the fourteen made using family frequency figures from the 10-million-token spoken section of the British National Corpus (BNC). They were compiled by Paul Nation. A description of the making of the lists based on the BNC, their validation and their use can be found in an article that appeared in the Canadian Modern Language Review. Nation, I.S.P. (2006) "How large a vocabulary is needed for reading and listening?".

The division of the words in fourteen 1000-word-family lists was done using range and frequency data from running the word families through the 10,000,000 token spoken section of the British National Corpus. Previously the lists had been sequenced using figures from the whole BNC but because of the overwhelming amount of formal written material this resulted in lists that did not satisfactorily represent informal spoken uses of English. Within a given 1000-word list, words are in alphabetical order, not frequency order.

Names of countries, the people of those countries and the major languages of those countries were included in the lists according to their range, frequency and dispersion figures. Most other proper nouns, such as people’s names, the names of cities, and the names of mountain ranges, were not included in the lists. There is however a separate list of proper nouns that can be run with the program called basewrd15.txt. These are words indicated as being proper nouns in the BNC list plus a large number of other words that appeared in the various corpora and texts that have been run through the program while the lists were being developed.

A few very common abbreviations such as ROM and UNESCO were included in the lists. Abbreviations of words already in the files, for example hon for honourable and revd for reverend were included as family members of the headword. Compound words were included as headwords, even when they were transparent, for example, airbase, alehouse, breastfeed. The criteria used to make word families were based on Bauer and Nation’s (1993) level 6, which includes all the affixes from levels 2 to 6 (see Table 2).

Table 2: Word family levels[change]

Level 1

  • A different form is a different word. Capitalization is ignored.

Level 2

  • Regularly inflected words are part of the same family. The inflectional categories are - plural; third person singular present tense; past tense; past participle; -ing; comparative; superlative; possessive.

Level 3

  • -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, all with restricted uses.

Level 4

  • -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, -ous, in-, all with restricted uses.

Level 5

  • -age (leakage), -al (arrival), -ally (idiotically), -an (American), -ance (clearance), -ant (consultant), -ary (revolutionary), -atory (confirmatory), -dom (kingdom; officialdom), -eer (black marketeer), -en (wooden), -en (widen), -ence (emergence), -ent (absorbent), -ery (bakery; trickery), -ese (Japanese; officialese), -esque (picturesque), -ette (usherette; roomette), -hood (childhood), -i (Israeli), -ian (phonetician; Johnsonian), -ite (Paisleyite; also chemical meaning), -let (coverlet), -ling (duckling), -ly (leisurely), -most (topmost), -ory (contradictory), -ship (studentship), -ward (homeward), -ways (crossways), -wise (endwise; discussion-wise), anti- (anti-inflation), ante- (anteroom), arch- (archbishop), bi- (biplane), circum- (circumnavigate), counter- (counter-attack), en- (encage; enslave), ex- (ex-president), fore- (forename), hyper- (hyperactive), inter- (inter-African, interweave), mid- (mid-week), mis- (misfit), neo- (neo-colonialism), post- (post-date), pro- (pro-British), semi- (semi-automatic), sub- (subclassify; subterranean), un- (untie; unburden).

Level 6

  • -able, -ee, -ic, -ify, -ion, -ist, -ition, -ive, -th, -y, pre-, re-.


BNC lists (edit)

Main Lists


1–1000, 1001–2000, 2001–3000, 3001–4000


Other lists


BNC1

Compressed, by headwords: A–E F–M N–R S–Z

Compressed, by letter: A–E F–M N–R S–Z

Only headwords: Long and Compressed

BNC2

Compressed, by headwords: A–E F–M N–R S–Z

Compressed, by letter: A–E F–M N–R S–Z

Only headwords: Long and Compressed

BNC3

Compressed, by headwords: A–E F–M N–R S–Z

Compressed, by letter: A–E F–M N–R S–Z

Only headwords: Long and Compressed

BNC4

Compressed, by headwords: A–E F–M N–R S–Z

Compressed, by letter: A–E F–M N–R S–Z

Only headwords: Long and Compressed