about
syllabus
All example source code

- Word counting — source code
- Parts of Speech Concordance — source code
- Keyword extraction - TF-IDF — source code
- Text Classification - Naive Bayes — source code
- Node concordance

- Secret Life of Pronouns, Pennebaker Ted Talk
- TF-IDF Single Page Tutorial
- Paul Graham’s A Plan for Spam and Better Bayesian Filtering
- Introduction to Bayesian Filtering
- Monty Hall and Bayes
- An Intuitive Explanation of Bayes’ Theorem by Eliezer S. Yudkowsky

- SPEECH COMPARISON by Rune Madsen
- Book-Book by Sarah Groff-Palermo
- Word Tree by Jason Davies
- Wordle
- OK GO album covers by Stefanie Posavec
- Luke Dubois’ Missed Connections
- Luke Dubois’ HindSight is always 20/20
- Nicholas Felton’s 2013 Annual Report, NY Times Article
- Lyrical Indicators and Parsing the State of the Union by Jonathan Corum

- Visualize the results of a concordance using canvas (or some other means).
- Expand the information the concordance holds so that it keeps track of word positions (i.e. not only how many times do the words appear in the source text, but where do they appear each time.)
- Implement some of the ideas specific to spam filtering to the bayesian classification example.
- In James W. Pennebaker’s book The Secret Life of Pronouns, Pennebaker describes his research into how the frequency of words that have little to no meaning on their own (I, you, they, a, an, the, etc.) are a window into the emotional state or personality of an author or speaker. For example, heavy use of the pronoun “I” is an indicator of “depression, stress or insecurity”. Create a page sketch that analyzes the use of pronouns. For more, visit analyzewords.com.
- Use the ideas to find similarities between people. For example, if you look at all the e-mails on the ITP student list, can you determine who is similar? Consider using properties in addition to word count, such as time of e-mails, length of e-mails, who writes to whom, etc.

You know that thing we call an array? Yes, that’s right, an ordered list of data. Each element of an array is numbered and accessed by its numeric index.

What if, however, instead of numbering the elements of an array we could name them? This element is named “Sue”, this one “Bob”, this one “Jane”, and so on and so forth. In programming, this kind of data structure is often referred to as an “associative array”, “map”, “hash” or “dictionary.” It’s a collection of `key/value`

pairs. The key is `Sue`

, the value is `24`

. It’s just like having a dictionary of words and when you look up, say, `Sue`

the definition is `24`

.

Associative arrays can be incredibly convenient for various applications. For example, you could keep a list of student IDs (`student name/id`

) or a list of prices (`product name/price`

) in a dictionary. The fundamental building block of just about every text analysis application is a concordance, a list of all words in a document along with how many times each word occurred. A dictionary is the perfect data structure to hold this information. Each element of the dictionary consists of a String paired with a number.

Most programming languages and environments have specific classes or objects for a variety of data structures (a dictionary is just one example). JavaScript, however, does not. But all is not lost. Remember that thing called a JavaScript object?

That’s right. A JavaScript object is a collection of name-value pairs. And so while it might be more convenient to have a custom-tailored dictionary object, we’re going to be able to get all the functionality we need out of just a plain old object itself.

To start writing a concordance all we need is an empty object.

A value (in this case a count) can be paired with a word by naming the key as a String.

The above is just another way of writing:

We’ll need this new way since we’ll be pulling the names for the object as strings from a source text.

In the case of our examples, we’re going to take a text document, split it into an array of Strings and increase the value associated with a particular key (i.e. word) each time we encounter the same String. Let’s assume we have some text in a variable named `data`

. First, we’ll split into word “tokens”.

Then we’ll go through each one a a time.

The tricky thing here is we have to determine if each token (each element of the resulting array) is a new word or one we’ve already encountered. If it’s new, we need to set its initial count at 1. If it’s not, we need to increase its count by one.

There, we now have a concordance object that stores all the words and their counts! The question, however, remains: what do we do with this thing?

The first thing you might want to do is simply examine the results. For example, let’s say we wanted to display the most frequent words (in sorted order). Unfortunately, the fields of a JavaScript object have no order to them and cannot be easily sorted. One solution to this problem is to keep a separate array of all the keys. This array can be sorted and used to iterate over all the name/value pairs in the concordance object.

While we could write our own sorting algorithm in JavaScript to sort the array, we might as well make use of the `sort()`

function available as part of the Array prototype. The tricky thing here is that the sort function expects as an argument which a function itself!

This is pretty typical of JavaScript and functional programming. Here we have an anonymous function that we pass into the `sort()`

function itself. This function takes two arguments: `a`

and `b`

. The function is a **comparison** function and should return true if element `b`

should appear before `a`

in the sorted result.

This can be condensed since a positive number is evaluated as `true`

and a negative one as `false`

.

Now that we have sorted keys, we can iterate over the concordance.

Here is a text concordance example and its source code.

One common application of a text concordance is TF-IDF or term frequency–inverse document frequency. Let’s consider a corpus of wikipedia articles. Is there a way we could automatically generate keywords or tags for an article based on its word counts?

TF-IDF has two components. Term frequency is one that we are already quite familiar with. How frequent is a given term in a document? This is exactly what we calculated in the concordance. We could stop here and say that keyword generation is: “The words that appear most frequently are most important in a document.” While there is some merit to this idea, what we’ll see is that the most frequent words are just the words that appear frequently in all text: junk words like ‘to’, ‘a’, ‘and’, ‘you’, ‘me’, etc. Ironically, these junk words may hold the key to unlocking a world of information about a particular text. Nevertheless, these are clearly not related to a document’s subject matter as keywords.

TF-IDF takes a different approach. Yes, a word that appears frequently in a document (TF) is one key indicator. But adding in another indicator such as inverse document frequency (is it a word that rarely appears in other documents?) takes the junk words out of the equation Let’s consider a wikipedia article about rainbows. Here are some of the counts:

Using this as a keyword score alone is not enough since the most important word is ‘the’. Now let’s say we looked at five other wikipedia articles. Let’s now count how many articles each of these words appear at least once in.

This is a somewhat obvious result: ‘the’ and ‘and’ appear in all the articles and ‘rainbow’ and ‘droplet’ appear in both. We could therefore compute a score for each of these as:

Now we’re getting somewhere!

TF-IDF is meant to be run on a much larger corpus and in order to dampen the effect of the IDF value, a common solution is to use the logarithm of IDF.

If logarithmic scale is new to you, this Khan Academy video may help. (Note how if a term appears in every single document the tf-idf score is always zero.)

We can improve this one more step by using not just the raw count of how many times a term (such as “rainbow”) appears in a document, but the ratio of of its count to the total number of words in the document. This normalizes the score by document length. So if the total number of words in the article is 100, the score would now be:

In the case of only examining this document it makes no difference, but if we were looking at the score for “rainbow” across multiple documents without this change the score would be biased towards longer documents.

For a wonderful example of TF-IDF out in the world, take a look at Nicholas Felton’s 2013 Annual Report.

Consider the following scenario:

- 1% of all ITP students are afflicted with a rare disease known as ITPosis
- There is a test you can take to determine if you have it, known as a TID (Test for Interactive Disease).
- 90% of all students with ITPosis will receive a positivie TID (i.e. 10% that have the disease will receive a false negative).
- 95% of students without ITPosis will receive a negative TID (i.e. 5% will receive false positives).

You have received a positive TID, what is the likelihood you have ITPosis?

As you might expect, there is a very precise answer to this question but it’s probably not what you initially guess. Bayesian reasoning is counter-intuitive and takes quite a bit of getting used to. In fact, when given a similar question related to breast cancer and mammograms, only 15% of doctors get the answer correct.

The answer — 15.3% — is calculated via Bayes’ Theorem. Let’s look at it again with this scenario:

- There are 1000 students.
- 10 of them have ITPosis.
- 9 of those 10 with the disease will receive a positive TID.
- Out of the 990 w/o ITPosis, ~50 will receive positive TIDs.
- Therefore, 59 total students receive positive TIDs, 9 of which actually have the disease, 50 do not.
- The chance one has the disease if the test is positive is therefore 9 / 58 = 15.5% (off slightly from the exact result b/c of rounding).

This video illustrates the problem quite nicely.

The problem our brains run into are those rascally 90% and 95% numbers. 90% of students who test positive have the disease and 95% who don’t test negative, if I test positive, I should probably have it, right?!! The important thing to remember is that only 1% of students actually have the disease. Sure testing positive increases the likelihood, but because 5% of students without the disease receive a false positive, it only increases the chances to 15%. All of this is explained in incredibly thorough and wonderful detail in Eliezer Yudkowsky’s article An Intuitive Explanation of Bayesian Reasoning. My explanation is simply adapted from his.

By the way, we could have calculated it as follows:

P (ITPosis | Positive TID) = (90% * 1%) / (90% * 1% + 5% * 99%)

This reads as “the probability that a positive TID means you have ITPosis” equals:

So why do we care? This type of reasoning can be applied quite nicely to text analysis. A common example is spam filtering. If we know the probability that a spam e-mail contains a specific words, we can calculate the likelihood that an e-mail is spam based on its concordance.

A wonderful resource for this approach is Paul Graham’s A Plan for Spam as well as Better Bayesian Filtering.

The example code that follows is not a perfect text classifier by any means. It’s a simple implementation of the idea that outlines the basic steps one might take to apply Bayesian Filtering to text.

The first thing we need to do is expand on the concordance example that stores a single number associated with each word. For classification, we’ll need to know things like how many times that word appears in spam e-mails versus good (aka ‘ham’) e-mails. And then we’ll need to use these values to calculate the probability that each word would appear in a spam or ham e-mail.

Instead of storing a single number like `dictionary['the'] = 16;`

we now need to associate an object with multiple data points with each key.The process of running the filter works as follows:

- Train the filter with known category A (for example: spam) e-mails and known category B (ham) e-mails.
- For every word, check if it’s new. If it is add it, if not, simply increase the counter for “A” or “B” (depending on whether it’s found in A or B).

Here’s how this might look:

The above steps are repeated over and over again for all training documents. Once all the “training” files are read, the probabilities can be calculated for every word.

Once we’ve gone through the process of counting the occurrences in each category (‘A’ or ‘B’, spam or ham, etc.), we can the calculate the probabilities according to Bayes rule.

The above formula might look a little bit simpler to you than the original Bayes rule. This is because I am leaving out the “prior probability” and assuming that any document has a 50% chance of being category A or B.

Now, all that is left to do is take a new document, and compute the total probability for that document according to the formula specified in Graham’s essay. For this step, we need to calculate combined probability as outlined by Graham. For more about combined probability, here’s another resource.

Now we know the probability the document is in category A!

One important aspect of this analysis that I’ve left out is the “interesting-ness” of any given word. An interesting rating is defined as how different, say, the spam probability is from 0.5 (i.e. 50/50 is as boring as it gets) or the absolute value of `probA - 0.5`

. Graham’s spam filter, for example, only uses the probability of the top 15 most interesting words. If you are looking for an exercise, you might try adding this feature to the Bayesian classifier example.