Notes on "Natural Language Toolkit" - Chapter 2: Accessing Text Corpora and Lexical Resources
1. Accessing Text Corpora
1.1 Gutenberg Corpus
What is a corpora?
A large structured collection of texts.
How to get the text list of a corpus?
corpus.fileids()
How to access a default corpus in NLTK?
from nltk.corpus import gutenberg
nltk.corpus.gutenberg.words('austen-emma.txt')
How to obtain the content of a file without any linguistic processing (not split up into tokens)?
gutenberg.raw(fileid)
How to divide the text up into its sentences?
gutenberg.sents('shakespeare-macbeth.txt')
1.2 Web and Chat Text
How to access default web texts in NLTK?
from nltk.corpus import webtext
How to access default chat conversations in nltk?
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
1.3 Brown Corpus
What is stylistics?
The study of systematic differences between genres. Word counts might distinguish genres: the most frequent modal in the ‘news’ genre is ‘will’, while the most frequent modal in the ‘romance’ genre is ‘could’.
What is the Brown Corpus?
A convenient resource for studying systematic differences between genres:
from nltk.corpus import brown
1.4 Reuters Corpus
What is the Reuters Corpus?
For training and testing algorithms that automatically detect the topic of a document. Text categories in the Reuters corpus overlap with each other.
from nltk.corpus import reuters
1.5 Inaugural Address Corpus
What is the Inaugural Address Corpus?
A temporal corpus representing language uses over time.
from nltk.corpus import inaugural
1.6 Annotated Text Corpora
How to get a list of all NLTK corpus?
Visit http://nltk.org/data
1.7 Corpora in Other Languages
Universal Declaration of Human Rights in over 300 languages
from nltk.corpus import udhr
1.8 Text Corpus Structure
How to access the categories of a corpus?
corpus.categories()
How to list the words contained in the corpus?
corpus.words()
1.9 Loading your own Corpus
How to load your own Corpus?
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
OR
from nltk.corpus import BracketParseCorpusReader
corpus_root = r'C:\corpora\penntreebank\parsed\mrg\wsj'
file_pattern = r'.*/wsj_.*\.mrg'
ptb = BracketParseCorpusReader(corpus_root, file_pattern)