NLTK word_tokenize()

cell_structureNLTK word_tokenize()
http://www.nltk.org/book/ch03.html#tokenization_index_term

>>>from nltk import word_tokenize

>>>word_tokenize(‘cells’ structure’)
[‘cells‘, ‘structure’]

>>with open(‘h.txt’) as f:
____word_tokenize(f.read())

Source code for nltk.collocations()
http://www.nltk.org/api/nltk.html
http://www.nltk.org/_modules/nltk/collocations.html

>>> dir(nltk.collocations)

>>> print(nltk.collocations.__doc__)

>>> with open(‘Histology14_Ch01_ALL.txt’) as f:
____nltk.Text(word_tokenize(f.read())).collocations()

cell nuclei;
electron microscopy;
fluorescent compounds;
glass slides;
gold particles;
labeled secondary; <=======
light microscope;
light microscopy;
MEDICAL APPLICATION;
nucleic acids;
objective lens;
organic solvents;
primary antibody;
resolving power;
secondary antibody;
secretory granules;
situ hybridization; <=======
tissue components;
tissue section;
tissue sections

>> with open(‘Histology14_Ch01_i.txt‘) as f:
nltk.Text(word_tokenize(f.read())).collocations()

matrix components;
tissue biology

 

Naturally, the quality of the collocations is also higher than computer-generated lists – as we would expect from a manually produced compilation.
p. 174
http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
phrasal verbs: good example of a collocation with often non-adjacent words

related:
Collocations dictionary
https://franzcalvo.wordpress.com/2015/09/07/collocations

 

 

One thought on “NLTK word_tokenize()

  1. Pingback: Collocations Dictionary | franzcalvo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s