Python 3 Text Processing with NLTK 3 Cookbook

Python 3 Text Processing with NLTK 3 Cookbook
August 26, 2014
by Jacob Perkins
http://www.amazon.com/Python-Text-Processing-NLTK-Cookbook/dp/1782167854

Python Text Processing with NLTK 2.0 Cookbook
November 11, 2010
by Jacob Perkins
http://www.amazon.com/Python-Text-Processing-NLTK-Cookbook/dp/1849513600

Python Source Code Encoding

2.2. The Interpreter and Its Environment
2.2.1. Source Code Encoding
https://docs.python.org/3/tutorial/interpreter.html
By default, Python source files are treated as encoded in UTF-8.
In that encoding, characters of most languages in the world can be used simultaneously in string literals, identifiers and comments — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow.

PEP 0263 — Defining Python Source Code Encodings
http://legacy.python.org/dev/peps/pep-0263

Python will default to ASCII as standard encoding if no other encoding hints are given.


The default encoding was set to “ascii” in version 2.5.

ANSI format
http://stackoverflow.com/questions/701882/what-is-ansi-format

encoding123

‘\ufeffChapter’ .decode(‘utf-8-sig’)
http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string

Guide to writing

Plural Noun Forms
by Capital Community College Foundation
http://grammar.ccc.commnet.edu/grammar/plurals.htm

http://www.scrabblefinder.com/ends-with/s

http://grammar.ccc.commnet.edu/grammar/adjectives.htm
used by:
Action Is Character: Exploring Character Traits with Adjectives
Grades: 6 – 8
http://www.readwritethink.org/classroom-resources/lesson-plans/action-character-exploring-character-175.html

base form of the verb
http://grammar.ccc.commnet.edu/grammar/tenses/simple_future.htm

4. Built-in Types

newpythonlogo4. Built-in Types
https://docs.python.org/3.4/library/stdtypes.html

s[i] = x item i of s is replaced by x

del s[i:j] same as s[i:j] = []

4.4. break and continue Statements
The break statement, like in C, breaks out of the smallest enclosing for or while loop.
https://docs.python.org/3.4/tutorial/controlflow.html

4.10. Mapping Types — dict

loop that searches for prime numbers
https://docs.python.org/2/tutorial/controlflow.html
http://www.programiz.com/python-programming/examples/prime-number-intervals

array.append(x)
https://docs.python.org/3.4/library/array.html

lista.append(“a”)
AttributeError: ‘str’ object has no attribute ‘append’

Built-in Functions
https://docs.python.org/3.4/library/functions.html
len()
sorted(iterable[, key][, rev erse]

Keywords
Escape sequence
https://docs.python.org/3.4/reference/lexical_analysis.html

str.find(sub[, start[, end]])
str.join(iterable)
str.lower()
str.partition(sep) -> tupla
str.split(sep=None, maxsplit=-1) -> list
https://docs.python.org/3.4/library/stdtypes.html

Text Processing Services
https://docs.python.org/3.4/library/text.html

6.3 The dir() Function
>>>dir(sys)

>>> import string

# cf dir(str)
# dir(”.join)
# print(”.join.__doc__)

>>>dir(string)
[… ‘ascii_letters‘, ‘ascii_lowercase’, ‘ascii_uppercase‘, ‘capwords’, ‘digits‘, ‘punctuation‘, ‘whitespace’]
related:
https://franzcalvo.wordpress.com/2015/12/06/dir-globals-locals-vars

# cf “Hola”[0].islower()
>>> “Hola”[0] in string.ascii_lowercase
False
>>> “Hola”[1] in string.ascii_lowercase
True

#cf print(chr(i), end=”)
>>> for i in range(ord(“a”), ord(“z”) + 1):
____print(chr(i))

>>>l = ”

>>> for i in range(ord(‘a’), ord(‘z’)+1):
____l += chr(i)

>>> l == string.ascii_lowercase
True

>>> counter = 0
>>> for i in string.punctuation:
print(“This is string.punctuation[“, counter, “]: “, i)
print(“”)
counter += 1

>>> counter = 0
>>> for i in string.whitespace:
print(“This is string.whitespace[“, counter, “]: “, i)
print(“”)
counter += 1

16.2. io — Core tools for working with streams
https://docs.python.org/3.4/library/io.html
readline()

file object
https://docs.python.org/3/glossary.html#term-file-object

>>> with open(‘spamspam.txt’, ‘w’, opener=opener) as f:
… print(‘This will be written to somedir/spamspam.txt’, file=f)

The Python Tutorial > 7. Input and Output > 7.2.1. Methods of File Objects¶
https://docs.python.org/3/tutorial/inputoutput.html
#f is a file object
for line in f:
… print(line, end=”)

====================

split()
http://www.dotnetperls.com/split-python

How do I iterate through the alphabet in Python?
http://stackoverflow.com/questions/228730/how-do-i-iterate-through-the-alphabet-in-python

Find all occurrences of a substring in Python
http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python
[m.start() for m in re.finditer(‘test’, ‘test test test test’)]

>>> help(str.find)
“we can build it ourselves!”

Sequences also support slicing: a[i:j]
https://docs.python.org/3/reference/datamodel.html

>>> “”.find(“a”)
-1

Beginner’s Guide to Python
https://wiki.python.org/moin/BeginnersGuide

Python for Beginners
https://www.python.org/about/gettingstarted

print(b, end=‘ ‘)
The end=' ' command will display the output on the same line …
http://www.wikihow.com/Start-Programming-in-Python

============================

to nest lists

operator: in, is
https://docs.python.org/3.5/library/operator.html

& | used with sets
http://stackoverflow.com/questions/6488928/where-are-the-ampersand-and-vertical-bar-characters-used-in-python

NLTK

NLP with Python [BOOK]
http://www.nltk.org/book
By Steven Bird, Ewan Klein, Edward Loper
O’Reilly Media
June 2009

http://www.nltk.org
The Natural Language Toolkit (NLTK) is a Python package for
natural language processing.  NLTK requires Python 2.6, 2.7, or 3.2+.
Author: Steven Bird

CHILDES Corpus Readers
http://www.nltk.org/howto/childes.html
Count the number of words and sentences of each file.

Chat-80
http://www.nltk.org/howto/chat80.html
The (deliberately naive) grammar sql.fcfg translates from English to SQL:
What cities are in China?

Lessons From The Language Boot Camp

Lessons From The Language Boot Camp For Mormon Missionaries
June 07, 2014
http://www.npr.org/2014/06/07/319805068/lessons-from-the-language-boot-camp-for-mormon-missionaries

On a sunny Wednesday in Provo, Utah, a long line of cars spits out about 300 new arrivals to the Missionary Training Center. The facility, known as MTC, is the largest language training school for members of the Church of Jesus Christ of Latter-day Saints.

Every year, about 36,000 students come to the center before they leave on missions around the world to spread the Mormon faith.

The approach has also gained traction in the U.S. military. In fact, the ties between the U.S. military and the MTC run pretty deep. The Army’s Intelligence Brigade, made up of linguists, is based in Utah and draws on former missionaries to fill its ranks.

The military trains soldiers in much the same way the church trains missionaries; they’re not conjugating verbs, they’re acting out real situations.

“I’m not going to give you multiple-choice questions. I’m not going to give you fill-in-the-blanks,” says Betty Lou Leaver, the provost at the Defense Language Institute in Monterey, Calif. “Instead, we’re going to actually do something. So a task is something you might actually do in your life.”