NLTK word_tokenize()

cell_structureNLTK word_tokenize()

>>>from nltk import word_tokenize

>>>word_tokenize(‘cells’ structure’)
[‘cells‘, ‘structure’]

>>with open(‘h.txt’) as f:

Source code for nltk.collocations()

>>> dir(nltk.collocations)

>>> print(nltk.collocations.__doc__)

>>> with open(‘Histology14_Ch01_ALL.txt’) as f:

cell nuclei;
electron microscopy;
fluorescent compounds;
glass slides;
gold particles;
labeled secondary; <=======
light microscope;
light microscopy;
nucleic acids;
objective lens;
organic solvents;
primary antibody;
resolving power;
secondary antibody;
secretory granules;
situ hybridization; <=======
tissue components;
tissue section;
tissue sections

>> with open(‘Histology14_Ch01_i.txt‘) as f:

matrix components;
tissue biology


Naturally, the quality of the collocations is also higher than computer-generated lists – as we would expect from a manually produced compilation.
p. 174
phrasal verbs: good example of a collocation with often non-adjacent words

Collocations dictionary



One thought on “NLTK word_tokenize()

  1. Pingback: Collocations Dictionary | franzcalvo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s