Python 3 Text Processing with NLTK 3 Cookbook

Python 3 Text Processing with NLTK 3 Cookbook
August 26, 2014
by Jacob Perkins

Python Text Processing with NLTK 2.0 Cookbook
November 11, 2010
by Jacob Perkins

Python Source Code Encoding

2.2. The Interpreter and Its Environment
2.2.1. Source Code Encoding
By default, Python source files are treated as encoded in UTF-8.
In that encoding, characters of most languages in the world can be used simultaneously in string literals, identifiers and comments — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow.

PEP 0263 — Defining Python Source Code Encodings

Python will default to ASCII as standard encoding if no other encoding hints are given.

The default encoding was set to “ascii” in version 2.5.

ANSI format


‘\ufeffChapter’ .decode(‘utf-8-sig’)

Guide to writing

Plural Noun Forms
by Capital Community College Foundation
used by:
Action Is Character: Exploring Character Traits with Adjectives
Grades: 6 – 8

base form of the verb

4. Built-in Types

newpythonlogo4. Built-in Types

s[i] = x item i of s is replaced by x

del s[i:j] same as s[i:j] = []

4.4. break and continue Statements
The break statement, like in C, breaks out of the smallest enclosing for or while loop.

4.10. Mapping Types — dict

loop that searches for prime numbers


AttributeError: ‘str’ object has no attribute ‘append’

Built-in Functions
sorted(iterable[, key][, rev erse]

Escape sequence

str.find(sub[, start[, end]])
str.partition(sep) -> tupla
str.split(sep=None, maxsplit=-1) -> list

Text Processing Services

6.3 The dir() Function

>>> import string

# cf dir(str)
# dir(”.join)
# print(”.join.__doc__)

[… ‘ascii_letters‘, ‘ascii_lowercase’, ‘ascii_uppercase‘, ‘capwords’, ‘digits‘, ‘punctuation‘, ‘whitespace’]

# cf “Hola”[0].islower()
>>> “Hola”[0] in string.ascii_lowercase
>>> “Hola”[1] in string.ascii_lowercase

#cf print(chr(i), end=”)
>>> for i in range(ord(“a”), ord(“z”) + 1):

>>>l = ”

>>> for i in range(ord(‘a’), ord(‘z’)+1):
____l += chr(i)

>>> l == string.ascii_lowercase

>>> counter = 0
>>> for i in string.punctuation:
print(“This is string.punctuation[“, counter, “]: “, i)
counter += 1

>>> counter = 0
>>> for i in string.whitespace:
print(“This is string.whitespace[“, counter, “]: “, i)
counter += 1

16.2. io — Core tools for working with streams

file object

>>> with open(‘spamspam.txt’, ‘w’, opener=opener) as f:
… print(‘This will be written to somedir/spamspam.txt’, file=f)

The Python Tutorial > 7. Input and Output > 7.2.1. Methods of File Objects¶
#f is a file object
for line in f:
… print(line, end=”)



How do I iterate through the alphabet in Python?

Find all occurrences of a substring in Python
[m.start() for m in re.finditer(‘test’, ‘test test test test’)]

>>> help(str.find)
“we can build it ourselves!”

Sequences also support slicing: a[i:j]

>>> “”.find(“a”)

Beginner’s Guide to Python

Python for Beginners

print(b, end=‘ ‘)
The end=' ' command will display the output on the same line …


to nest lists

operator: in, is

& | used with sets


NLP with Python [BOOK]
By Steven Bird, Ewan Klein, Edward Loper
O’Reilly Media
June 2009
The Natural Language Toolkit (NLTK) is a Python package for
natural language processing.  NLTK requires Python 2.6, 2.7, or 3.2+.
Author: Steven Bird

CHILDES Corpus Readers
Count the number of words and sentences of each file.

The (deliberately naive) grammar sql.fcfg translates from English to SQL:
What cities are in China?

Lessons From The Language Boot Camp

Lessons From The Language Boot Camp For Mormon Missionaries
June 07, 2014

On a sunny Wednesday in Provo, Utah, a long line of cars spits out about 300 new arrivals to the Missionary Training Center. The facility, known as MTC, is the largest language training school for members of the Church of Jesus Christ of Latter-day Saints.

Every year, about 36,000 students come to the center before they leave on missions around the world to spread the Mormon faith.

The approach has also gained traction in the U.S. military. In fact, the ties between the U.S. military and the MTC run pretty deep. The Army’s Intelligence Brigade, made up of linguists, is based in Utah and draws on former missionaries to fill its ranks.

The military trains soldiers in much the same way the church trains missionaries; they’re not conjugating verbs, they’re acting out real situations.

“I’m not going to give you multiple-choice questions. I’m not going to give you fill-in-the-blanks,” says Betty Lou Leaver, the provost at the Defense Language Institute in Monterey, Calif. “Instead, we’re going to actually do something. So a task is something you might actually do in your life.”