Notes on "Natural Language Toolkit" - Chapter 1: Language Processing and Python

1. Computing with Language: Texts and Words

1.1 Getting Started with Python

1.2 Getting Started with NLTK

How to download corpora?

import nltk
nltk.download()

How to load corpora?

from nltk.book import *

1.3 Searching Text

What is a concordance view?
A concordance view displays every occurrence of a given word in its context (= preceding and following words).

string.concordance('word')

What other words appear in a similar range of contexts?

text.similar('word')

How to examine shared contexts between words?

text.common_contexts(['word', 'writer'])

What’s a dispersion plot? How to obtain it?
A dispersion plot displays the locations of a word in the text,
each stripe represents an instance of a word.

text.dispersion_plot(['citizens', 'democracy', 'freedom', 'duties', 'America'])

1.4 Counting Vocabulary

What’s a token?
A sequence of characters: a word and/or a punctuation symbol.

How to obtain the number of tokens?

len(text)

What’s the vocabulary of a text?
A vocabulary is the set of tokens contained in a text.

set(text)

How to sort an array in Python?

sorted(array)

What’s a word type?
A word considered to be an unique item within a given vocabulary.

How do you quantify the lexical richness of a text?
Divide the number of distinct words over the total number of words. The result is a percentage of distincts words.

len(set(text)) / len(text)

How to count occurrences of a specific word within a text?

text.count('word')

How to create a function in Python?

def function_x(param):
    return true

2. A Closer Look at Python: Texts as Lists of Words

2.1 Lists

How to create a list in Python?

[‘’, ‘’ etc.]

How to concatenate lists in Python?

[ ] + [ ]

How to append an element to a list in Python?

list.append(element)

2.2 Indexing lists

What’s an index?
It’s the position of an item inside an array/list.

How to access an index from a value?

text.index(value)

How to access a value given an index?

text[index]

What is slicing?
It’s retrieving a subpart of an array/list.

text[index1:index2] ; text[index1:] ; text[:index2]

2.3 Variables

2.4 Strings

What’s a string?
Strings are lists of characters, so a string shares the same properties than a list.

How to convert a list to a string?

' '.join(list)

How to convert a string to a list?

string.split()

3. Computing with Language: Simple Statistics

3.1 Frequency Distributions

What is a frequency distribution?
It’s a matrix where each row represents the frequency of a vocabulary item in a given text.

fdist = FreqDist(text)

How to get the most frequent tokens?

fdist.most_common(<number of tokens to get>)

How to obtain a cumulative frequency plot?
A cumulative frequency plot tells us what proportion of a text is taken by the most common tokens:

fdist.plot(50, cumulative=True)

What’s an hapaxe?
An hapaxe is word that occur only once in a text. Hapaxes are considered as outliers in data analysis, and thus not generally useful.

fdist.hapaxes()

3.2 Fine-grained Selection of Words

How to operate a fine-grained word selection by word length and frequency?
Obtain words which are at least 7 character long and that appear at least 7 times in the text:

w for w in set(text) if len(w) > 7 and fdist[w] > 7

The result is useful to identify key words in a text content-wise.

3.3 Collocations and Bigrams

What’s a bigram?
A pair of words.

list(bigrams(['more', 'is', 'said', 'than', 'done']))

What is a collocation?
A sequence of words that occur together unusually often, and which are resistant to substitution with words that have similar meanings. A collocation is a frequent bigram.

text.collocations()

3.4 Counting Other Things

How to get the frequency distribution of the different word lengths?

fdist = FreqDist(len(w) for w in text)

How to obtain the max value in a list?

fdist.max()

How to access a given frequency in a frequency distribution?

fdist.freq(frequency_index)

4. Back to Python: making decisions and taking control

How to create an if statement in Python?

if len(word) < 5:
...    print('word length is less than 5')
elif token.istitle():
...     print(token, 'is a titlecase word')
else:
...     print(token, 'is punctuation')

How to create a loop in Python?

for word in ['Call', 'me', 'Ishmael', '.']:
...    print(word)

How to operate on every element of a loop?

[function(w) for w in text]

5. Automatic Natural Language Understanding

What is Word Sense Disambiguation?
It’s an area of NLP where we want to discover the intended meaning of a word in a given context.

What is Pronoun Resolution
It’s about detecting the subjects and objects of verbs, finding the antecedents of a word.

What is Anaphora Resolution?
It’s a part of pronoun resolution where we identify what a pronoun or noun refers to.

What is Semantic Role Labeling?
It’s about identifying how a noun relates to the verb. Also a part of Pronoun Resolution.

What is Text Alignment?
It’s a program automatically pairing up sentences. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text for example.

What is a Spoken Dialogue System?
It’s a pipeline of language understanding components to generate a speech answer to an audio question.

What is RTE (Recognizing Textual Entailment)?
It’s a challenge in language understanding where you try to automatically verify an hypothesis from statements given previously.