Michael Collins' NLP Class Note 1

January 27, 2017 - 5 minute read -

This article is a collection of notes for Michael Collins’ NLP Class on Coursera.

Week1

SECTION1 Intro

1.1 The Definition of Natural Language Processing(NLP)

Computers using natural language as input and/or output.

  • Natural Language Understanding(NLU)
  • Natural Language Generating(NLG)

1.2 Applications of Natural Language Processing

  • machine translation
  • information extraction
  • text summarization
  • dialogue systems

1.3 Major Problems of NLP

  • Tagging problem(part-of-speech tagging, named entity recognition)
  • Parsing problem

1.4 Sources of Ambiguity, where the difficulties lie

  • Acoustic
  • Syntactic
  • Semantic
  • Discourse

SECTION2 Language Model

2.1 Formal Definition of Vocabulary and Sentence

  • Vocabulary set $\mathcal{V}$: all words in the language, finite.
  • Sentence: a sequence of words $\mathcal{x}_1\mathcal{x}_2…\mathcal{x}_n$ where $n$ is integer and $n>1$, $x_i\in\mathcal{V}$ for $i \in {1,…,n-1}$, and $x_n$ is a special symbol, STOP.
  • Sentence set $\mathcal{V}^\dagger$: all sentences with the vocabulary $\mathcal{V}$, infinite.

(KEY) 2.2 Defining Language Model

  • Two major parts: a finite set $\mathcal{V}$ and a function $p(x_1,x_2,…,x_n)$
  • $\forall\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger$, $p(x_1,x_2,…,x_n)\ge0$, and

Personally, it is trying to grasp an overall understanding of how words and sentences are distributed in a training set.

  • It’s not hard to get this distribution via a maximum likelihood method.

2.2 Markov Process

  • Chain Rule in Joint Probability
  • First-Order Hence,
  • Second-Order Hence,
  • General Form where $x_0=x_{-1}=*$, which is a special “start” symbol.
  • Push it into a more varying form: the length $n$ becomes a random variable.

(KEY) 2.3 Trigram Language Model

  • Estimating parameter: $q(w\mid u,v)$
  • The probability of sentences $p$:
  • Key Problem: to estimate the parameters

2.4 Using Maximum Likelihood Method to Estimate

  • Estimating Formula:

  • Problems:

    1. large sum of bigrams
    2. numerator being zero
    3. denominator being zero

2.5 Evaluation Metric: Perplexity

  • Applying Held-out data to test the model’s ability to deal with unseen data
  • What is perplexity: to estimate the model’s efficient vocabulary size. The smaller it is, the better the model performs.

For example, if there is only one sentence $x^{(1)}$ in the test corpus, and it’s probability $p(x^{(1)})=0.1$, then it means approximately in ten words the sentences will appear, which means the model can detect the sentences in ten words. So the effective vocabulary is ten.

  • Definition of Perplexity: Where

SECTION3 Smoothed Estimation of Trigram Models

3.1 Linear Interpolation

  • Core idea: using a linear combination of probabilities to estimate trigram probability. where
  • Define the loss function:
  • To choose the parameter $\lambda_1,\lambda_2,\lambda_3$:
  • A practical way of choosing these $\lambda$s (bucketing):

3.2 Discounting Methods

  • Discounted counts:
  • Missing Probability:
  • Two sets:
  • Bigram:

Week2

Applying Generative Models to Tagging Problems

op1=>operation: Probabilities
op2=>subroutine: Trigram HMMs
op1(right)->op2

op3=>operation: Estimation
op4=>subroutine: Maximum likelihood
op3(right)->op4
op5=>operation: Maximization
op6=>subroutine: The Viterbi Algorithm
op5(right)->op6

SECTION4 Elaborating Tagging Problems

4.1 Definition of Tagging Problems

From a set of training examples, $(x^{(i)},y^{(i)})\ \forall i = 1,…,m$, where each $x^{(i)}$ is a sentence $x^{(i)}1…x^{(i)}{n_i}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}1…y^{(i)}{n_i}$, with the assumption that the length of sentences are varying in different sentences, the task is to learn a function(or more formally, a mapping) that maps sentences to tag sequences from these training examples.

4.2 Two major classes of Tagging Problems

  • Part of Speech Recognition
  • Named-Entity Recognition

4.3 Two Sources of Constraints

  • Local the probability of a meaning of a word appears is much higher
  • Contextual the probability of a meaning of a word appears another word is much higher
  • Sometimes there are conflicts

SECTION5 Generative Models

5.1 Introducing the Supervised Learning

(omitted)

5.2 Discriminative Models

  • Estimating $p(y \mid x)$ directly

  • Learning a conditional model:

  • Using the following function $f(x)$ to predict:

5.3 Generative Models (Noisy Channel Model)

  • Estimating $p(x,y)$ instead

  • Learning a joint distribution model:

  • Bayes Equation:

  • Using the following function $f(x)$ to predict:

(KEY) 5.4 Generative Tagging Models

  • Learning Process: learn a probability distribution $p$:
    1. $\forall \ s=\langle x_1…x_n,y_1…y_n \rangle\in\mathcal{S},p(s)\geq0$
    2. $\sum_{s\in\mathcal{S}}p(s)=1$
  • Decision process: using the following function $f$ to make a decision:

  • Notations

5.5 Major Problems with Generative Models

  1. How to elaborate the probability $p(x_1…x_n,y_1…y_n)$?
  2. How to estimate the parameters of the model from training examples?
  3. How to efficiently find the maximum in the decision process?

SECTION6 Solving the Problems in Generative Models

6.1 Trigram Hidden Markov Models -> Probability

  • Contains a finite set $\mathcal{V}$ of all possible words and a finite set $\mathcal{K}$ of all possible tages. Define $\mathcal{S}$ to be the set of all tag/sequence pairs $s=\langle x_1…x_n,y_1…y_{n+1} \rangle$ and note that $y_{n+1}=\mbox{STOP}$.

  • Parameter1: for any trigram $(u,v,s)$ such that $s\in\mathcal{K}\cup{\mbox{STOP}}$ and $u,v\in\mathcal{K}\cup{\mbox{*}}$.

  • Parameter2: for any $x\in\mathcal{v}, s\in\mathcal{K}$.

  • Formula: for any $s=\langle x_1…x_n,y_1…y_{n+1} \rangle\in\mathcal{S}$, , where $y_0=y_{-1}=*$

  • Derivation of the formula is generally same as the noisy channel model shown in section 5.3 except the independent assumption below(equation2.7 in the note): It ignores the dependencies both for the precedent words and for other irrelevant tags.

6.2 Maximum Likelihood Method -> Estimating Parameters

Just count the appearance number:

Advanced estimation tricks are shown above in Section3.

6.3 The Viterbi Algorithm -> Finding the Maximum

LAST UPDATED ON Feb 17, 2017