scikit-learn 0.15.2


>>> sklearn.__version__

scikit-learn is tested to work under Python 2.6, Python 2.7, and Python 3.4.
The required dependencies to build the software are

  • a working C/C++ compiler.

Create a separate environment


NumPy (4.6 MB) download
the below notes are about building Numpy, which for most users is *not* the recommended way to install Numpy.  Instead, use either a complete scientific Python distribution or a binary installer


Dragomir Radev, September 2015
Here is how to install a specific older version of a Python library:
pip uninstall scikit-learn
pip uninstall sklearn
pip install scikit-learn==0.15.2

Hint: the following packages conflict with each other:
  – scikit-learn ==0.15.2
  – python 3.5*


© Continuum Analytics

.tar.gz files

Assignment on word similarity

CS224d: Deep Learning for Natural Language Processing
March-June 2015

International Workshop on Semantic Evaluation 2015

SDP 2015: Broad-Coverage Semantic Dependency Parsing

NL generation & information extraction
NL generation
NACLO for Week 7
week 6
week 5
week 3
week 2

Assignment 2, part 3A

some good papers in NLP
NLP libraries in Java

this course is more introductory than …

the assignments for this class have been developed and tested on Python 2.7 and NLTK 2.

volunteers: covering installation of Python and NLTK on different platforms





LTAG! is an absurd, irreverent card game based on Lexicalized Tree Adjoining Grammar

Compete and co-operate to generate offensive yet grammatical English sentences made of partial syntactic trees.
The first player to use up all of their cards wins!

The Birth of a Word

The Birth of a Word
(a doctoral thesis)
by Brandon Cain Roy
MIT. February 2013

using a small number of words (52 actively used words by 16 months of age) in concert with situational context to communicate effectively in a wide range of day-to-day situations

From a more technical standpoint, the fundamental meaningful unit in a language isn’t the word but the morpheme, a word’s more basic constituents.

The Word Birth Browser was developed in Java, retrieving data from a local SQLite3 database containing a version of the corpus.

Dromi notes that in the weeks of decreasing vocabulary growth rate, her daughter seemed to be exploring the words she had already learned, refining their use, and generally consolidating the lexicon. We find this a compelling idea, and since our first analysis in (Roy et al., 2009) we have wondered whether the drop in word birth rate could coincide with an increase in the child’s use of syntax.

It is hard to imagine that with 669 words the child’s communicative needs are satisfied. Then again, the child has responsive caregivers and the range of activities in a 9{24 month old’s life are limited. The introduction of a new toy, activity or other experience (such as going to the zoo) could contribute new words in the child’s lexicon, but at a certain point the child’s vocabulary may be sufficient for the activities of everyday life.
p. 95

If the drop in vocabulary growth rate is not a statistical artifact, as suggested in the previous section, what else could contribute to the “vocabulary implosion” observed? Before 19 months of age, the child has 444 words in his productive vocabulary. If word learning is partly fueled by “communicative need”, does the decrease in vocabulary growth rate indicate that the  child has achieved some level of communicative sufficiency at 18 months? Or does communicative growth transition from learning new words to combining words together in new ways?

By 24 months of age, the child had learned 669 words. He learned these words through exposure to them in his environment. But why did he learn these words, and in the order that he learned them? In the next chapter, we consider the relationship between lexical acquisition and the rich linguistic environment of a young child’s first years.

Children’s early language learning is sometimes described as “effortless”, and to adults witnessing the seemingly autonomous birth and growth of language it may indeed appear so. But a better adjective might be “remarkable” when one accounts for the numerous challenges that young learners face in acquiring their first language.

Children’s exposure to language is primarily through speech, and unlike text there are no “spaces” marking word boundaries. As Peters (1983) discusses, although the units of speech are words, children do not necessarily partition the speech stream into their final adult word forms. Even assuming the words and the concepts are available to the child, the mapping between them must be learned.

Elizabeth Spelke and her colleagues argue that children come into the world equipped with systems of core knowledge about objects, agents, number, geometry as well as social knowledge (Spelke, 1994; Spelke and Kinzler, 2007).
Such systems of core knowledge may provide a necessary substrate for early learning, including language acquisition. Children are also sensitive to statistical regularities in the speech they hear, which can help in segmenting words (Saffran et al., 1996). Another skill children bring to bear, of particular relevance to word learning, is the ability to infer the referential intent of others. In the case of learning names for objects, a child must associate the name to what the speaker is referring to, even if that is not the child’s focus of attention when the name is uttered (Baldwin, 1991).

Paul Bloom (2000, p. 90) says, “People cannot learn words unless they are exposed to them”. We can explain much of the character of children’s vocabularies in terms of this banal fact” and as such, characterizing the learning environment is crucial in understanding early word learning.

In the case of word learning, strong evidence for the positive link between the total amount of maternal speech and children’s vocabulary size was provided by Hart and Risley (1995).

Exposure to caregiver speech affects more than just the words that are learned. In recent work, Hurtado et al. (2008) showed that it also positively impacts children’s speech processing efficiency.
Children exposed to more caregiver speech at 18 months knew more words and were faster at word recognition at 24 months. One of the interesting results of this study was the substantial overlap in the effect of maternal speech input on these two outcomes, suggesting that increased processing efficiency supports faster lexical learning, but also that greater lexical knowledge contributed to faster processing efficiency. To use Snow’s analogy, these findings suggest that the developmental “strands” of speech processing skill and lexical knowledge are both entangled and mutually supportive.

whether a word is salient in particular contexts. It need not be salient in all contexts to have a high recurrence, but if is salient in some situations …

general argument for the role of structured, predictable context as supporting word learning.

But frequency is the weakest predictor in the ensemble of variables we have considered. Instead, in the purely linguistic domain, a word’s recurrence better predicts its age of acquisition.
Recurrence measures how clustered a word is in time; a high recurrence word is one that, when it is used, is used repeatedly over a short duration. For learners with a limited working memory, a word with high recurrence may occur frequently enough in a short duration to take hold in memory.

KL-divergence is measuring a word’s scope or “groundedness”, with the idea that more grounded words are more strongly tied to other aspects of experience and are more tightly woven into the child’s understanding.


Hurtado, N., Marchman, V., and Fernald, A. (2008). Does input in
uence uptake? Links between maternal talk, processing speed and vocabulary size in Spanish-learning children.
Developmental Science, 11(6):F31{F39.

TED video: semantic analysis, influencer, it’s like building a microscope …

How To Write A Sentence

Think You Know ‘How To Write A Sentence’?
July 14, 2011

Most people know a good sentence when they read one, but New York Times columnist Stanley Fish says most of us don’t really know how to write them ourselves. His new book, How To Write A Sentence: And How To Read One, is part ode, part how-to guide to the art of the well-constructed sentence.

Mining the Web for Synonyms (2001)

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney
Institute for Information Technology, National Research Council of Canada
Proceedings of the Twelfth European Conference on Machine Learning, (2001), Freiburg, Germany, 491-502

The task of recognizing synonyms is, given a problem word and a set of alternative words, choose the member from the set of alternative words that is most similar in meaning to the problem word.

The quality of the algorithm’s performance depends on:
– the size of the document collection that is indexed by the search engine and
– the expressive power of the search engine’s query language.
The results presented here are based on queries to the AltaVista search engine

Recognizing synonyms is often used as a test to measure a (human) student’s mastery of a language.

Latent Semantic Analysis (LSA) is another unsupervised learning algorithm that has been applied to the task of recognizing synonyms.

LSA is a statistical algorithm based on Singular Value Decomposition (SVD). A variation on this algorithm has been applied to information retrieval, where it is known as Latent Semantic Indexing (LSI)

synonym recognition

Statistical approaches to synonym recognition are based on co-occurrence [9].
Manning and Schütze distinguish between co-occurrence (or association) and collocation: collocation refers to “grammatically bound elements that occur in a particular order”, but co-occurrence and association refer to “the more general phenomenon of words that are likely to be used in the same context” [9].
Order does not matter for synonyms, so we say that they co-occur, rather than saying that they are collocated

PMI-IR was implemented as a simple, short Perl program.

cited by:
Introduction to Natural Language Processing
University of Michigan
Coursera, October 5 – December 27, 2015