Notes on "Natural Language Toolkit" - Chapter 2: Accessing Text Corpora and Lexical Resources

1. Accessing Text Corpora

1.1 Gutenberg Corpus

What is a corpora?

A large structured collection of texts.

How to get the text list of a corpus?

corpus.fileids()

How to access a default corpus in NLTK?

from nltk.corpus import gutenberg
nltk.corpus.gutenberg.words('austen-emma.txt')

How to obtain the content of a file without any linguistic processing (not split up into tokens)?

gutenberg.raw(fileid)

How to divide the text up into its sentences?

gutenberg.sents('shakespeare-macbeth.txt')

1.2 Web and Chat Text

How to access default web texts in NLTK?

from nltk.corpus import webtext

How to access default chat conversations in nltk?

from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

1.3 Brown Corpus

What is stylistics?

The study of systematic differences between genres. Word counts might distinguish genres: the most frequent modal in the ‘news’ genre is ‘will’, while the most frequent modal in the ‘romance’ genre is ‘could’.

What is the Brown Corpus?

A convenient resource for studying systematic differences between genres:

from nltk.corpus import brown

1.4 Reuters Corpus

What is the Reuters Corpus?

For training and testing algorithms that automatically detect the topic of a document. Text categories in the Reuters corpus overlap with each other.

from nltk.corpus import reuters

1.5 Inaugural Address Corpus

What is the Inaugural Address Corpus?

A temporal corpus representing language uses over time.

from nltk.corpus import inaugural

1.6 Annotated Text Corpora

How to get a list of all NLTK corpus?

Visit http://nltk.org/data

1.7 Corpora in Other Languages

Universal Declaration of Human Rights in over 300 languages

from nltk.corpus import udhr

1.8 Text Corpus Structure

How to access the categories of a corpus?

corpus.categories()

How to list the words contained in the corpus?

corpus.words()

1.9 Loading your own Corpus

How to load your own Corpus?

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')

from nltk.corpus import BracketParseCorpusReader
corpus_root = r'C:\corpora\penntreebank\parsed\mrg\wsj'
file_pattern = r'.*/wsj_.*\.mrg'
ptb = BracketParseCorpusReader(corpus_root, file_pattern)