Notes on "Natural Language Toolkit" - Chapter 1: Language Processing and Python
1. Computing with Language: Texts and Words
1.1 Getting Started with Python
1.2 Getting Started with NLTK
How to download corpora?
import nltk
nltk.download()
How to load corpora?
from nltk.book import *
1.3 Searching Text
What is a concordance view?
A concordance view displays every occurrence of a given word in its context (= preceding and following words).
string.concordance('word')
What other words appear in a similar range of contexts?
text.similar('word')
How to examine shared contexts between words?
text.common_contexts(['word', 'writer'])
What’s a dispersion plot? How to obtain it?
A dispersion plot displays the locations of a word in the text,
each stripe represents an instance of a word.
text.dispersion_plot(['citizens', 'democracy', 'freedom', 'duties', 'America'])
1.4 Counting Vocabulary
What’s a token?
A sequence of characters: a word and/or a punctuation symbol.
How to obtain the number of tokens?
len(text)
What’s the vocabulary of a text?
A vocabulary is the set of tokens contained in a text.
set(text)
How to sort an array in Python?
sorted(array)
What’s a word type?
A word considered to be an unique item within a given vocabulary.
How do you quantify the lexical richness of a text?
Divide the number of distinct words over the total number of words. The result is a percentage of distincts words.
len(set(text)) / len(text)
How to count occurrences of a specific word within a text?
text.count('word')
How to create a function in Python?
def function_x(param):
return true
2. A Closer Look at Python: Texts as Lists of Words
2.1 Lists
How to create a list in Python?
[‘’, ‘’ etc.]
How to concatenate lists in Python?
[ ] + [ ]
How to append an element to a list in Python?
list.append(element)
2.2 Indexing lists
What’s an index?
It’s the position of an item inside an array/list.
How to access an index from a value?
text.index(value)
How to access a value given an index?
text[index]
What is slicing?
It’s retrieving a subpart of an array/list.
text[index1:index2] ; text[index1:] ; text[:index2]
2.3 Variables
2.4 Strings
What’s a string?
Strings are lists of characters, so a string shares the same properties than a list.
How to convert a list to a string?
' '.join(list)
How to convert a string to a list?
string.split()
3. Computing with Language: Simple Statistics
3.1 Frequency Distributions
What is a frequency distribution?
It’s a matrix where each row represents the frequency of a vocabulary item in a given text.
fdist = FreqDist(text)
How to get the most frequent tokens?
fdist.most_common(<number of tokens to get>)
How to obtain a cumulative frequency plot?
A cumulative frequency plot tells us what proportion of a text is taken by the most common tokens:
fdist.plot(50, cumulative=True)
What’s an hapaxe?
An hapaxe is word that occur only once in a text. Hapaxes are considered as outliers in data analysis, and thus not generally useful.
fdist.hapaxes()
3.2 Fine-grained Selection of Words
How to operate a fine-grained word selection by word length and frequency?
Obtain words which are at least 7 character long and that appear at least 7 times in the text:
w for w in set(text) if len(w) > 7 and fdist[w] > 7
The result is useful to identify key words in a text content-wise.
3.3 Collocations and Bigrams
What’s a bigram?
A pair of words.
list(bigrams(['more', 'is', 'said', 'than', 'done']))
What is a collocation?
A sequence of words that occur together unusually often, and which are resistant to substitution with words that have similar meanings. A collocation is a frequent bigram.
text.collocations()
3.4 Counting Other Things
How to get the frequency distribution of the different word lengths?
fdist = FreqDist(len(w) for w in text)
How to obtain the max value in a list?
fdist.max()
How to access a given frequency in a frequency distribution?
fdist.freq(frequency_index)
4. Back to Python: making decisions and taking control
How to create an if statement in Python?
if len(word) < 5:
... print('word length is less than 5')
elif token.istitle():
... print(token, 'is a titlecase word')
else:
... print(token, 'is punctuation')
How to create a loop in Python?
for word in ['Call', 'me', 'Ishmael', '.']:
... print(word)
How to operate on every element of a loop?
[function(w) for w in text]
5. Automatic Natural Language Understanding
What is Word Sense Disambiguation?
It’s an area of NLP where we want to discover the intended meaning of a word in a given context.
What is Pronoun Resolution
It’s about detecting the subjects and objects of verbs, finding the antecedents of a word.
What is Anaphora Resolution?
It’s a part of pronoun resolution where we identify what a pronoun or noun refers to.
What is Semantic Role Labeling?
It’s about identifying how a noun relates to the verb. Also a part of Pronoun Resolution.
What is Text Alignment?
It’s a program automatically pairing up sentences. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text for example.
What is a Spoken Dialogue System?
It’s a pipeline of language understanding components to generate a speech answer to an audio question.
What is RTE (Recognizing Textual Entailment)?
It’s a challenge in language understanding where you try to automatically verify an hypothesis from statements given previously.