Michael Collins' NLP Class Note 1
This article is a collection of notes for Michael Collins’ NLP Class on Coursera.
Week1
SECTION1 Intro
1.1 The Definition of Natural Language Processing(NLP)
Computers using natural language as input and/or output.
 Natural Language Understanding(NLU)
 Natural Language Generating(NLG)
1.2 Applications of Natural Language Processing
 machine translation
 information extraction
 text summarization
 dialogue systems
1.3 Major Problems of NLP
 Tagging problem(partofspeech tagging, named entity recognition)
 Parsing problem
1.4 Sources of Ambiguity, where the difficulties lie
 Acoustic
 Syntactic
 Semantic
 Discourse
SECTION2 Language Model
2.1 Formal Definition of Vocabulary and Sentence
 Vocabulary set $\mathcal{V}$: all words in the language, finite.
 Sentence: a sequence of words $\mathcal{x}_1\mathcal{x}_2…\mathcal{x}_n$ where $n$ is integer and $n>1$, $x_i\in\mathcal{V}$ for $i \in {1,…,n1}$, and $x_n$ is a special symbol, STOP.
 Sentence set $\mathcal{V}^\dagger$: all sentences with the vocabulary $\mathcal{V}$, infinite.
(KEY) 2.2 Defining Language Model
 Two major parts: a finite set $\mathcal{V}$ and a function $p(x_1,x_2,…,x_n)$
 $\forall\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger$, $p(x_1,x_2,…,x_n)\ge0$, and
Personally, it is trying to grasp an overall understanding of how words and sentences are distributed in a training set.
 It’s not hard to get this distribution via a maximum likelihood method.
2.2 Markov Process
 Chain Rule in Joint Probability
 FirstOrder Hence,
 SecondOrder Hence,
 General Form where $x_0=x_{1}=*$, which is a special “start” symbol.
 Push it into a more varying form: the length $n$ becomes a random variable.
(KEY) 2.3 Trigram Language Model
 Estimating parameter: $q(w\mid u,v)$
 The probability of sentences $p$:
 Key Problem: to estimate the parameters
2.4 Using Maximum Likelihood Method to Estimate

Estimating Formula:

Problems:
 large sum of bigrams
 numerator being zero
 denominator being zero
2.5 Evaluation Metric: Perplexity
 Applying Heldout data to test the model’s ability to deal with unseen data
 What is perplexity: to estimate the model’s efficient vocabulary size. The smaller it is, the better the model performs.
For example, if there is only one sentence $x^{(1)}$ in the test corpus, and it’s probability $p(x^{(1)})=0.1$, then it means approximately in ten words the sentences will appear, which means the model can detect the sentences in ten words. So the effective vocabulary is ten.
 Definition of Perplexity: Where
SECTION3 Smoothed Estimation of Trigram Models
3.1 Linear Interpolation
 Core idea: using a linear combination of probabilities to estimate trigram probability. where
 Define the loss function:
 To choose the parameter $\lambda_1,\lambda_2,\lambda_3$:
 A practical way of choosing these $\lambda$s (bucketing):
3.2 Discounting Methods
 Discounted counts:
 Missing Probability:
 Two sets:
 Bigram:
Week2
Applying Generative Models to Tagging Problems
op1=>operation: Probabilities
op2=>subroutine: Trigram HMMs
op1(right)>op2
op3=>operation: Estimation
op4=>subroutine: Maximum likelihood
op3(right)>op4
op5=>operation: Maximization
op6=>subroutine: The Viterbi Algorithm
op5(right)>op6
SECTION4 Elaborating Tagging Problems
4.1 Definition of Tagging Problems
From a set of training examples, $(x^{(i)},y^{(i)})\ \forall i = 1,…,m$, where each $x^{(i)}$ is a sentence $x^{(i)}1…x^{(i)}{n_i}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}1…y^{(i)}{n_i}$, with the assumption that the length of sentences are varying in different sentences, the task is to learn a function(or more formally, a mapping) that maps sentences to tag sequences from these training examples.
4.2 Two major classes of Tagging Problems
 Part of Speech Recognition
 NamedEntity Recognition
4.3 Two Sources of Constraints

Local the probability of a meaning of a word appears is much higher 
Contextual the probability of a meaning of a word appears another word is much higher  Sometimes there are conflicts
SECTION5 Generative Models
5.1 Introducing the Supervised Learning
(omitted)
5.2 Discriminative Models

Estimating $p(y \mid x)$ directly

Learning a conditional model:

Using the following function $f(x)$ to predict:
5.3 Generative Models (Noisy Channel Model)

Estimating $p(x,y)$ instead

Learning a joint distribution model:

Bayes Equation:

Using the following function $f(x)$ to predict:
(KEY) 5.4 Generative Tagging Models
 Learning Process: learn a probability distribution $p$:
 $\forall \ s=\langle x_1…x_n,y_1…y_n \rangle\in\mathcal{S},p(s)\geq0$
 $\sum_{s\in\mathcal{S}}p(s)=1$

Decision process: using the following function $f$ to make a decision:
 Notations
5.5 Major Problems with Generative Models
 How to elaborate the probability $p(x_1…x_n,y_1…y_n)$?
 How to estimate the parameters of the model from training examples?
 How to efficiently find the maximum in the decision process?
SECTION6 Solving the Problems in Generative Models
6.1 Trigram Hidden Markov Models > Probability

Contains a finite set $\mathcal{V}$ of all possible words and a finite set $\mathcal{K}$ of all possible tages. Define $\mathcal{S}$ to be the set of all tag/sequence pairs $s=\langle x_1…x_n,y_1…y_{n+1} \rangle$ and note that $y_{n+1}=\mbox{STOP}$.

Parameter1: for any trigram $(u,v,s)$ such that $s\in\mathcal{K}\cup{\mbox{STOP}}$ and $u,v\in\mathcal{K}\cup{\mbox{*}}$.

Parameter2: for any $x\in\mathcal{v}, s\in\mathcal{K}$.

Formula: for any $s=\langle x_1…x_n,y_1…y_{n+1} \rangle\in\mathcal{S}$, , where $y_0=y_{1}=*$

Derivation of the formula is generally same as the noisy channel model shown in section 5.3 except the independent assumption below(equation2.7 in the note): It ignores the dependencies both for the precedent words and for other irrelevant tags.
6.2 Maximum Likelihood Method > Estimating Parameters
Just count the appearance number:
Advanced estimation tricks are shown above in Section3.
6.3 The Viterbi Algorithm > Finding the Maximum
LAST UPDATED ON Feb 17, 2017