← All terms

Corpus

Also known as: Language Corpus, Text Corpus, British National Corpus, BNC

A corpus is a large, structured collection of texts used to train, tune, or evaluate language-processing systems. Representative examples include the British National Corpus (BNC, 100 million words of British English), the Penn Treebank, and more recently Common Crawl and domain-specific corpora. In accessibility research, corpora underpin the language models used for predictive text, automatic captioning, speech synthesis, AAC word prediction, and text simplification. Corpus choice has consequences for accessibility: a model trained only on standard-written English will handle dialectal, non-native, or AAC-mediated text less well, potentially disadvantaging users whose language patterns are under-represented in the training data.

Category: Natural Language Processing · Data · Research Methods

Related: Natural Language Processing · Machine Learning

Sources