for a pre-crawled SQLite db: see README.txt

Capstone: Retrieving, Processing, and Visualizing Data with Python
University of Michigan. June 2016




import sqlite3

conn = sqlite3.connect(’emaildb.sqlite’)
cur = conn.cursor()

cur.execute(”’DROP TABLE IF EXISTS Counts”’)

cur.execute(”’CREATE TABLE Counts (org TEXT, count INTEGER)”’)

fname = input(‘Enter file name: ‘)
if ( len(fname) < 1 ) : fname = ‘mbox.txt’
fh = open(fname)
for line in fh:
____if not line.startswith(‘From: ‘) : continue
pieces = line.split()
email = pieces[1]
organization = email.split(‘@’)[1]
cur.execute(‘SELECT count FROM Counts WHERE org = ? ‘, (organization, ))
row = cur.fetchone()
if row is None:
____cur.execute(”’INSERT INTO Counts (org, count)
____VALUES ( ?, 1 )”’, ( organization, ) )
else :
____cur.execute(‘UPDATE Counts SET count=count+1 WHERE org = ?’,
(organization, ))

sqlstr = ‘SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10’

for row in cur.execute(sqlstr) :
____print(str(row[0]), row[1])



The Birth of a Word

The Birth of a Word
(a doctoral thesis)
by Brandon Cain Roy
MIT. February 2013

using a small number of words (52 actively used words by 16 months of age) in concert with situational context to communicate effectively in a wide range of day-to-day situations

From a more technical standpoint, the fundamental meaningful unit in a language isn’t the word but the morpheme, a word’s more basic constituents.

The Word Birth Browser was developed in Java, retrieving data from a local SQLite3 database containing a version of the corpus.

Dromi notes that in the weeks of decreasing vocabulary growth rate, her daughter seemed to be exploring the words she had already learned, refining their use, and generally consolidating the lexicon. We find this a compelling idea, and since our first analysis in (Roy et al., 2009) we have wondered whether the drop in word birth rate could coincide with an increase in the child’s use of syntax.

It is hard to imagine that with 669 words the child’s communicative needs are satisfied. Then again, the child has responsive caregivers and the range of activities in a 9{24 month old’s life are limited. The introduction of a new toy, activity or other experience (such as going to the zoo) could contribute new words in the child’s lexicon, but at a certain point the child’s vocabulary may be sufficient for the activities of everyday life.
p. 95

If the drop in vocabulary growth rate is not a statistical artifact, as suggested in the previous section, what else could contribute to the “vocabulary implosion” observed? Before 19 months of age, the child has 444 words in his productive vocabulary. If word learning is partly fueled by “communicative need”, does the decrease in vocabulary growth rate indicate that the  child has achieved some level of communicative sufficiency at 18 months? Or does communicative growth transition from learning new words to combining words together in new ways?

By 24 months of age, the child had learned 669 words. He learned these words through exposure to them in his environment. But why did he learn these words, and in the order that he learned them? In the next chapter, we consider the relationship between lexical acquisition and the rich linguistic environment of a young child’s first years.

Children’s early language learning is sometimes described as “effortless”, and to adults witnessing the seemingly autonomous birth and growth of language it may indeed appear so. But a better adjective might be “remarkable” when one accounts for the numerous challenges that young learners face in acquiring their first language.

Children’s exposure to language is primarily through speech, and unlike text there are no “spaces” marking word boundaries. As Peters (1983) discusses, although the units of speech are words, children do not necessarily partition the speech stream into their final adult word forms. Even assuming the words and the concepts are available to the child, the mapping between them must be learned.

Elizabeth Spelke and her colleagues argue that children come into the world equipped with systems of core knowledge about objects, agents, number, geometry as well as social knowledge (Spelke, 1994; Spelke and Kinzler, 2007).
Such systems of core knowledge may provide a necessary substrate for early learning, including language acquisition. Children are also sensitive to statistical regularities in the speech they hear, which can help in segmenting words (Saffran et al., 1996). Another skill children bring to bear, of particular relevance to word learning, is the ability to infer the referential intent of others. In the case of learning names for objects, a child must associate the name to what the speaker is referring to, even if that is not the child’s focus of attention when the name is uttered (Baldwin, 1991).

Paul Bloom (2000, p. 90) says, “People cannot learn words unless they are exposed to them”. We can explain much of the character of children’s vocabularies in terms of this banal fact” and as such, characterizing the learning environment is crucial in understanding early word learning.

In the case of word learning, strong evidence for the positive link between the total amount of maternal speech and children’s vocabulary size was provided by Hart and Risley (1995).

Exposure to caregiver speech affects more than just the words that are learned. In recent work, Hurtado et al. (2008) showed that it also positively impacts children’s speech processing efficiency.
Children exposed to more caregiver speech at 18 months knew more words and were faster at word recognition at 24 months. One of the interesting results of this study was the substantial overlap in the effect of maternal speech input on these two outcomes, suggesting that increased processing efficiency supports faster lexical learning, but also that greater lexical knowledge contributed to faster processing efficiency. To use Snow’s analogy, these findings suggest that the developmental “strands” of speech processing skill and lexical knowledge are both entangled and mutually supportive.

whether a word is salient in particular contexts. It need not be salient in all contexts to have a high recurrence, but if is salient in some situations …

general argument for the role of structured, predictable context as supporting word learning.

But frequency is the weakest predictor in the ensemble of variables we have considered. Instead, in the purely linguistic domain, a word’s recurrence better predicts its age of acquisition.
Recurrence measures how clustered a word is in time; a high recurrence word is one that, when it is used, is used repeatedly over a short duration. For learners with a limited working memory, a word with high recurrence may occur frequently enough in a short duration to take hold in memory.

KL-divergence is measuring a word’s scope or “groundedness”, with the idea that more grounded words are more strongly tied to other aspects of experience and are more tightly woven into the child’s understanding.


Hurtado, N., Marchman, V., and Fernald, A. (2008). Does input in
uence uptake? Links between maternal talk, processing speed and vocabulary size in Spanish-learning children.
Developmental Science, 11(6):F31{F39.

TED video: semantic analysis, influencer, it’s like building a microscope …