>>>from nltk import word_tokenize
>>with open(‘h.txt’) as f:
Source code for nltk.collocations()
>>> with open(‘Histology14_Ch01_ALL.txt’) as f:
labeled secondary; <=======
situ hybridization; <=======
>> with open(‘Histology14_Ch01_i.txt‘) as f:
Naturally, the quality of the collocations is also higher than computer-generated lists – as we would expect from a manually produced compilation.
phrasal verbs: good example of a collocation with often non-adjacent words
Morphological similarity: Stemming
Stemming and Lemmatization with Python NLTK
Martin Porter’s official site:
a Perl module that implements a variety of semantic similarity and relatedness measures based on information found in the lexical database WordNet.
A lexical database for English
UMLS::Similarity v1.41 released! (July 17, 2014)
North American Chapter of the Association for Computational Linguistics
How Strong Is Your Vocabulary?
Introduction to Natural Language Processing
University of Michigan
Coursera, October 5 – December 27, 2015
The Python Standard Library > 6. Text Processing Services > re
Regular expressions HOWTO:
For … science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.
Metacharacters are not active inside classes: … '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
Perhaps the most important metacharacter is the backslash, \.
Some of the special sequences beginning with '\' represent predefined sets of characters
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Pay careful attention to the difference between * and +;
match() versus search()
Python Regular Expressions
Christopher Potts emoticons
Python: Regular expressions
University of Cambridge
Python 3 Text Processing with NLTK 3 Cookbook
August 26, 2014
by Jacob Perkins
Python Text Processing with NLTK 2.0 Cookbook
November 11, 2010
by Jacob Perkins
NLP with Python [BOOK]
By Steven Bird, Ewan Klein, Edward Loper
The Natural Language Toolkit (NLTK) is a Python package for
natural language processing. NLTK requires Python 2.6, 2.7, or 3.2+.
Author: Steven Bird
CHILDES Corpus Readers
Count the number of words and sentences of each file.
The (deliberately naive) grammar sql.fcfg translates from English to SQL:
What cities are in China?