.. Created by Adam Cunnningham on Fri Jun 3 2016. .. Report Description added Thur Nov 24 2016. **Statistical Language Models** =============================== A statistical language model assigns a probability :math:`P(S)` to a sentence of n words, :math:`S = w_1 w_2 \ldots w_n`. The chain rule in probability theory describes how to split :math:`P(w_1, w_2, \ldots, w_n)` into the probabilities of each word given the words that precede it. .. math:: P(S) = P(w_1) P(w_2 | w_1) P(w_3 | w_1 w_2) \cdots P(w_n | w_1, w_2, ..., w_{n - 1}) We use here the notation :math:`P(A | B)`. If we know that an event *B* has occurred, then the probability of an event *A* given that *B* has already occurred is called the *conditional probability of A given B*, denoted :math:`P (A | B)`. N-Gram Models ------------- In practice, most possible word sequences are never observed even in a very large *corpus* (a body of text). One solution when constructing a language model is to make the *Markov assumption* that the probability of a word depends only on the most recent words that preceded it. An n-gram model does just this, and predicts the probability of a word based only on the n - 1 words before. For n = 1, 2 and 3 we have: - Unigram: :math:`P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i)` - Bigram: :math:`P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-1})` - Trigram: :math:`P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-1}, w_{i-2})` Report Description ------------------ In this report we download texts from online sources, analyze the distribution of words in these texts, and use them to generate statistical language models of increasing levels of complexity. The report consists of the following exercises: **Exercise 1. Word Frequency Analysis** - Download the text of Shakespeare's “Hamlet” from Project Gutenberg. - Make a list of all the words in this text, ordered by frequency. - Identify the most commonly used words. - Make a pie chart of the words, showing their frequency. - Plot the frequency of words against their rank in frequency order and comment on any patterns. - Perform the word frequency analysis on another text of your choice and comment on any differences. .. important:: Please do not print out the complete text of "Hamlet" (or any other text) in the report you submit. It would serve no purpose and would make your report unnecessarily long. **Exercise 2. N-Gram Models** - Select an author of your choice and download either their collected works or as many individual works as possible from Project Gutenberg. .. tip:: Strings can be concatenated simply using '+'. For example: .. code-block:: python mytext = "MTH" + "337" If you download multiple files from Project Gutenberg, you can use this to create a single string that contains all the texts before you start generating the models. - Construct unigram, bigram, trigram and quadrigram models from this corpus. .. tip:: The unigram model consists of one list of words and another list of their associated probabilities. All other models are stored as dictionaries. - For the bigram model, the dictionary keys are single words. - For the trigram model, the dictionary keys are (word1, word2) tuples. - For the quadrigram model, the dictionary keys are (word1, word2, word3) tuples. The dictionary values associated with these keys are the lists of possible following words and their conditional probabilities. - Print the number of distinct words, bigrams, trigrams and quadrigrams found in this corpus. **Exercise 3. Generate "Random" Sentences** - Generate at least a paragraph of "random" sentences using each model created in Exercise 2. .. tip:: Each word in a random sentence is generated by passing a list of words and their associated probabilities to **numpy.random.choice**. - How large does 'N' need to be before these sentences seem gramatically correct? - How large do you think 'N' needs to be before your chosen author could be identified from these sentences? **Exercise 4. Smoothing**