# Michael Collins' NLP Class Note 1

January 27, 2017 - 5 minute read -

# Week1

## SECTION1 Intro

### 1.1 The Definition of Natural Language Processing(NLP)

Computers using natural language as input and/or output.

• Natural Language Understanding(NLU)
• Natural Language Generating(NLG)

### 1.2 Applications of Natural Language Processing

• machine translation
• information extraction
• text summarization
• dialogue systems

### 1.3 Major Problems of NLP

• Tagging problem(part-of-speech tagging, named entity recognition)
• Parsing problem

• Acoustic
• Syntactic
• Semantic
• Discourse

## SECTION2 Language Model

### 2.1 Formal Definition of Vocabulary and Sentence

• Vocabulary set $\mathcal{V}$: all words in the language, finite.
• Sentence: a sequence of words $\mathcal{x}_1\mathcal{x}_2…\mathcal{x}_n$ where $n$ is integer and $n>1$, $x_i\in\mathcal{V}$ for $i \in {1,…,n-1}$, and $x_n$ is a special symbol, STOP.
• Sentence set $\mathcal{V}^\dagger$: all sentences with the vocabulary $\mathcal{V}$, infinite.

### (KEY) 2.2 Defining Language Model

• Two major parts: a finite set $\mathcal{V}$ and a function $p(x_1,x_2,…,x_n)$
• $\forall\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger$, $p(x_1,x_2,…,x_n)\ge0$, and $\sum_{\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger} p(x_1,x_2,...,x_n)=1,\ \forall x \in \mathcal{V}^\dagger$

Personally, it is trying to grasp an overall understanding of how words and sentences are distributed in a training set.

• It’s not hard to get this distribution via a maximum likelihood method.

### 2.2 Markov Process

• Chain Rule in Joint Probability $P(A,B) = P(A)\times P(B\mid A)$
• First-Order $P(X_n=x_n \mid X_{n-1}=x_{n-1},...,X_1=x_1) = P(X_n=x_n \mid X_{n-1}=x_{n-1})$ Hence, $P(X_1=x_1,X_{2}=x_2,...,X_n=x_n) = P(X_1=x_1)\prod_{i=2}^n P(X_i=x_i \mid X_{i-1}=x_{i-1})$
• Second-Order % Hence, %
• General Form $P(X_1=x_1,X_{2}=x_2,...,X_n=x_n) = \prod_{i=1}^n P(X_i=x_i \mid X_{i-1}=x_{i-1},X_{i-2}=x_{i-2})$ where $x_0=x_{-1}=*$, which is a special “start” symbol.
• Push it into a more varying form: the length $n$ becomes a random variable.

### (KEY) 2.3 Trigram Language Model

• Estimating parameter: $q(w\mid u,v)$
• The probability of sentences $p$: $p(x_1...x_n)=\prod_{i=1}^n q(x_1\mid x_{i-1},x_{i-2})$
• Key Problem: to estimate the parameters

### 2.4 Using Maximum Likelihood Method to Estimate

• Estimating Formula: $q(w\mid u,v) = \frac{c(u,v,w)}{c(u,v)}$

• Problems:

1. large sum of bigrams
2. numerator being zero
3. denominator being zero

### 2.5 Evaluation Metric: Perplexity

• Applying Held-out data to test the model’s ability to deal with unseen data
• What is perplexity: to estimate the model’s efficient vocabulary size. The smaller it is, the better the model performs.

For example, if there is only one sentence $x^{(1)}$ in the test corpus, and it’s probability $p(x^{(1)})=0.1$, then it means approximately in ten words the sentences will appear, which means the model can detect the sentences in ten words. So the effective vocabulary is ten.

• Definition of Perplexity: $2^{-l}=\frac{1}{t}$ Where $l=\frac{1}{M}\sum_{i=1}^m\log_2p(x^{(i)}), \quad t = \sqrt[M]{\prod_{i=1}^mp(x^{(i)})}$

## SECTION3 Smoothed Estimation of Trigram Models

### 3.1 Linear Interpolation

• Core idea: using a linear combination of probabilities to estimate trigram probability. $q(w\mid u,v) = \lambda_1\times q_{ml}(w\mid u,v)+\lambda_2\times q_{ml}(w\mid v)+\lambda_3\times q_{ml}(w)$ where $\lambda_1+\lambda_2+\lambda_3=1$
• Define the loss function:$L(\lambda_1,\lambda_2,\lambda_3)=\sum_{u,v,w}c'(u,v,w)\log q(w\mid u,v)$
• To choose the parameter $\lambda_1,\lambda_2,\lambda_3$: $\mathop{argmax}_{\lambda_1,\lambda_2,\lambda_3} L(\lambda_1,\lambda_2,\lambda_3)$
• A practical way of choosing these $\lambda$s (bucketing): $% $

### 3.2 Discounting Methods

• Discounted counts: $c^*(v,w)=c(v,w)-\beta$
• Missing Probability: $\alpha(v)=1-\sum_{w,\forall c(v,w)>0}\frac{c^*(v,w)}{c(v)}$
• Two sets: % 0 \}\\ \mathcal{B}(v) = & \{ w\mid c(v,w)=0 \}\\ \end{align} %]]>
• Bigram: $% $

# Week2

Applying Generative Models to Tagging Problems

op1=>operation: Probabilities
op2=>subroutine: Trigram HMMs
op1(right)->op2

op3=>operation: Estimation
op4=>subroutine: Maximum likelihood
op3(right)->op4
op5=>operation: Maximization
op6=>subroutine: The Viterbi Algorithm
op5(right)->op6

## SECTION4 Elaborating Tagging Problems

### 4.1 Definition of Tagging Problems

From a set of training examples, $(x^{(i)},y^{(i)})\ \forall i = 1,…,m$, where each $x^{(i)}$ is a sentence $x^{(i)}1…x^{(i)}{n_i}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}1…y^{(i)}{n_i}$, with the assumption that the length of sentences are varying in different sentences, the task is to learn a function(or more formally, a mapping) that maps sentences to tag sequences from these training examples.

### 4.2 Two major classes of Tagging Problems

• Part of Speech Recognition
• Named-Entity Recognition

### 4.3 Two Sources of Constraints

•  Local the probability of a meaning of a word appears is much higher
•  Contextual the probability of a meaning of a word appears another word is much higher
• Sometimes there are conflicts

## SECTION5 Generative Models

(omitted)

### 5.2 Discriminative Models

• Estimating $p(y \mid x)$ directly

• Learning a conditional model:$p(y \mid x)$

• Using the following function $f(x)$ to predict: $f(x)=\mathop{argmax}_{y\in\mathcal{Y}} \ p(y\mid x)$

### 5.3 Generative Models (Noisy Channel Model)

• Estimating $p(x,y)$ instead

• Learning a joint distribution model:$p(x,y)=p(y)p(x\mid y)$

• Bayes Equation: $p(y\mid x)=\frac{p(y)p(x\mid y)}{p(x)}$

• Using the following function $f(x)$ to predict: $f(x)=\mathop{argmax}_{y\in\mathcal{Y}} \ p(y)p(x\mid y)$

### (KEY) 5.4 Generative Tagging Models

• Learning Process: learn a probability distribution $p$:
1. $\forall \ s=\langle x_1…x_n,y_1…y_n \rangle\in\mathcal{S},p(s)\geq0$
2. $\sum_{s\in\mathcal{S}}p(s)=1$
• Decision process: using the following function $f$ to make a decision: $f(x_1...x_n)=\mathop{argmax}_{y_1...y_n}\ p(x_1...x_n,y_1...y_n)$

• Notations $% $

### 5.5 Major Problems with Generative Models

1. How to elaborate the probability $p(x_1…x_n,y_1…y_n)$?
2. How to estimate the parameters of the model from training examples?
3. How to efficiently find the maximum in the decision process?

## SECTION6 Solving the Problems in Generative Models

### 6.1 Trigram Hidden Markov Models -> Probability

• Contains a finite set $\mathcal{V}$ of all possible words and a finite set $\mathcal{K}$ of all possible tages. Define $\mathcal{S}$ to be the set of all tag/sequence pairs $s=\langle x_1…x_n,y_1…y_{n+1} \rangle$ and note that $y_{n+1}=\mbox{STOP}$.

• Parameter1: $q(s\mid u,v)$ for any trigram $(u,v,s)$ such that $s\in\mathcal{K}\cup{\mbox{STOP}}$ and $u,v\in\mathcal{K}\cup{\mbox{*}}$.

• Parameter2: $e(x\mid s)$ for any $x\in\mathcal{v}, s\in\mathcal{K}$.

• Formula: for any $s=\langle x_1…x_n,y_1…y_{n+1} \rangle\in\mathcal{S}$, $p(s)=\prod_{i=1}^{n+1}{q(y_i\mid y_{i-2},y_{i-1})}\prod_{i=1}^{n}{e(x_i\mid y_i)}$, where $y_0=y_{-1}=*$

• Derivation of the formula is generally same as the noisy channel model shown in section 5.3 except the independent assumption below(equation2.7 in the note): % It ignores the dependencies both for the precedent words and for other irrelevant tags.

### 6.2 Maximum Likelihood Method -> Estimating Parameters

Just count the appearance number:

Advanced estimation tricks are shown above in Section3.

### 6.3 The Viterbi Algorithm -> Finding the Maximum

LAST UPDATED ON Feb 17, 2017