This data set contains a table of the relative frequencies (per 1000 words) of 65 linguistic features (Biber 1988, 1995) for each text document in the British National Corpus (Aston & Burnard 1998).

Biber (1988) introduced these features for the purpose of a multidimensional register analysis. Variables in the data set are numbered according to Biber's list (see e.g. Biber 1995, 95f).

Feature frequencies were automatically extracted from the British National Corpus using query patterns based on part-of-speech tags (Gasthaus 2007). Note that features 60 and 65 had to be omitted because they cannot be identified with sufficient accuracy by the automatic methods. For further information on the extraction methodology, see Gasthaus (2007, 20-21). The original data set and the Python scripts used for feature extraction are available from; the version included here contains some bug fixes.




A numeric matrix with 4048 rows and 65 columns, specifying the relative frequencies (per 1000 words) of 65 linguistic features. Documents are listed in the same order as the metadata in BNCmeta and rows are labelled with text IDs, so it is straightforward to combine the two data sets.

A. Tense and aspect markers
f_01_past_tense Past tense
f_02_perfect_aspect Perfect aspect
f_03_present_tense Present tense
B. Place and time adverbials
f_04_place_adverbials Place adverbials (e.g., above, beside, outdoors)
f_05_time_adverbials Time adverbials (e.g., early, instantly, soon)
C. Pronouns and pro-verbs
f_06_first_person_pronouns First-person pronouns
f_07_second_person_pronouns Second-person pronouns
f_08_third_person_pronouns Third-person personal pronouns (excluding it)
f_09_pronoun_it Pronoun it
f_10_demonstrative_pronoun Demonstrative pronouns (that, this, these, those as pronouns)
f_11_indefinite_pronoun Indefinite pronounes (e.g., anybody, nothing, someone)
f_12_proverb_do Pro-verb do
D. Questions
f_13_wh_question Direct wh-questions
E. Nominal forms
f_14_nominalization Nominalizations (ending in -tion, -ment, -ness, -ity)
f_15_gerunds Gerunds (participial forms functioning as nouns)
f_16_other_nouns Total other nouns
F. Passives
f_17_agentless_passives Agentless passives
f_18_by_passives by-passives
G. Stative forms
f_19_be_main_verb be as main verb
f_20_existential_there Existential there
H. Subordination features
f_21_that_verb_comp that verb complements (e.g., I said that he went.)
f_22_that_adj_comp that adjective complements (e.g., I'm glad that you like it.)
f_23_wh_clause wh-clauses (e.g., I believed what he told me.)
f_24_infinitives Infinitives
f_25_present_participle Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe ran out the door.)
f_26_past_participle Past participial adverbial clauses (e.g., Built in a single week, the house would stand for fifty years.)
f_27_past_participle_whiz Past participial postnominal (reduced relative) clauses (e.g., the solution produced by this process)
f_28_present_participle_whiz Present participial postnominal (reduced relative) clauses (e.g., the event causing this decline)
f_29_that_subj that relative clauses on subject position (e.g., the dog that bit me)
f_30_that_obj that relative clauses on object position (e.g., the dog that I saw)
f_31_wh_subj wh relatives on subject position (e.g., the man who likes popcorn)
f_32_wh_obj wh relatives on object position (e.g., the man who Sally likes)
f_33_pied_piping Pied-piping relative clauses (e.g., the manner in which he was told)
f_34_sentence_relatives Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgusting thing I've ever heard of.)
f_35_because Causative adverbial subordinator (because)
f_36_though Concessive adverbial subordinators (although, though)
f_37_if Conditional adverbial subordinators (if, unless)
f_38_other_adv_sub Other adverbial subordinators (e.g., since, while, whereas)
I. Prepositional phrases, adjectives and adverbs
f_39_prepositions Total prepositional phrases
f_40_adj_attr Attributive adjectives (e.g., the big horse)
f_41_adj_pred Predicative adjectives (e.g., The horse is big.)
f_42_adverbs Total adverbs
J. Lexical specificity
f_43_type_token Type-token ratio (including punctuation)
f_44_mean_word_length Average word length (across tokens, excluding punctuation)
K. Lexical classes
f_45_conjuncts Conjuncts (e.g., consequently, furthermore, however)
f_46_downtoners Downtoners (e.g., barely, nearly, slightly)
f_47_hedges Hedges (e.g., at about, something like, almost)
f_48_amplifiers Amplifiers (e.g., absolutely, extremely, perfectly)
f_49_emphatics Emphatics (e.g., a lot, for sure, really)
f_50_discourse_particles Discourse particles (e.g., sentence-initial well, now, anyway)
f_51_demonstratives Demonstratives
L. Modals
f_52_modal_possibility Possibility modals (can, may, might, could)
f_53_modal_necessity Necessity modals (ought, should, must)
f_54_modal_predictive Predictive modals (will, would, shall)
M. Specialized verb classes
f_55_verb_public Public verbs (e.g., assert, declare, mention)
f_56_verb_private Private verbs (e.g., assume, believe, doubt, know)
f_57_verb_suasive Suasive verbs (e.g., command, insist, propose)
f_58_verb_seem seem and appear
N. Reduced forms and dispreferred structures
f_59_contractions Contractions
n/a Subordinator that deletion (e.g., I think [that] he went.)
f_61_stranded_preposition Stranded prepositions (e.g., the candidate that I was thinking of)
f_62_split_infinitve Split infinitives (e.g., He wants to convincingly prove that ...)
f_63_split_auxiliary Split auxiliaries (e.g., They were apparently shown to ...)
O. Co-ordination
f_64_phrasal_coordination Phrasal co-ordination (N and N; Adj and Adj; V and V; Adv and Adv)
n/a Independent clause co-ordination (clause-initial and)
P. Negation
f_66_neg_synthetic Synthetic negation (e.g., No answer is good enough for Jones.)
f_67_neg_analytic Analytic negation (e.g., That's not likely.)


Stefan Evert (; feature extractor by Jan Gasthaus (2007).


Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at

Biber, Douglas (1988). Variations Across Speech and Writing. Cambridge University Press, Cambridge.

Biber, Douglas (1995). Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press, Cambridge.

Gasthaus, Jan (2007). Prototype-Based Relevance Learning for Genre Classification. B.Sc.\ thesis, Institute of Cognitive Science, University of Osnabrück. Data sets and software available from

