Deeplearning

Deep Learning: An overview

Ahmed Touati

Data Scientist @ AXA Data Innovation Lab

Success stories

Record performance
Real applications

Plan

Why deep learning? history and motivation
Learning algorithm and regularization
Computer vision: convolutional neural network
Natural language processing: recurrent neural network

Historical waves of artificial neural networks

Cyberneticsin the 1940s-1960s with the developement of theories of biological learning and implementation of perceptron

Connectionism approach with back-propagation to train neural network with one or two hidden layer

Deep learning since 2006.

Cybernetics (1948)

Cybernetics connects control (actions taken in hope of achieving goals) with communication (connection and information flow between the actor and the environment).

Systems achieve goals through iterative processes or “feedback” loops

McCulloch & Pitts (1943)

A simplified neuron model: the Linear Threshold Unit.

Perceptron: first trained machine

First learning machine build at Cornell in 1960
The perceptron was a linear classifier on top of a simple features extractor
$y = \text{sign}(\sum_{i=1}^Nw_iF_i(x)+b)$

Limitation of linear classifier

Cover theorem 1966: Let $\{x_1, x_2...x_P\}$ be vector in $\mathbb{R}^N$, the number of distinct dichotomies that can be realized by a plane is: $C(N, P) = 2\sum_{k=0}^{N-1}(_{k}^{P-1})$
Abandon perceptrons and other analog computers.
Develop symbolic computers and symbolic AI techniques.

Symbolic AI

Developpement of knowledge-based system caoable of giving advice in paticular domain utilizing knowledge provided by human experts.
Tow main components:
- Knowledge represented in a number of ways such as rules, frames and cases.
- Inference engine which uses knowledge to arrive at conclusion.

Quillian’s hierarchical propositional model (1968)

Connectionism

From psychological ideas of the XIXth and XXth centuries.
The central connectionist principle is that mental phenomena can be described by interconnected networks of simple and often uniform units.
Parallel Distributed Processing:
- Neural representations are distributed.
- Neural computation is parallel.
- Processing units, connectivity, propagation rule, learning rule.

Rumelhart’s propositional network

Training Network

Replace threshold unit by sigmoid unit.
Collect training examples: $$(..(\text{Item}_{i}, \text{Relation}_{i}, \text{DesiredOutput}_i)..)$$
Form the mean squared error: $$ E = \sum_k(\text{DesiredOutput}(k) - \text{Output}(k))^2$$
Initialize with random weights and optimize by gradient descent.

Activations

Representations

Idea: Basis Function

$f(x, w) = \sum_{i=1}^mw_ih_i(x)$
Basis function may be sample-centered (kernels) or centered in areas of high sample density (using an unsupervised clustering algorithm)
An interesting type of basis function is $\sigma(U^{T}x) = \frac{1}{1+\exp(-U^{T}x)}$ (single hidden layer neural network)

Universal approximation theorem (Cybenko 1989):
Theoritician's dilemma: We can approximate any function as close as we want with shallow architecture. Why we need deep ones?

The need of depth

N-bit parity: requires N-1 XOR gates in a tree of $\log(N)$ where it requires an exponential gates if we restrict ourselves to 2 layers.

The need of depth

A series of hidden layers extracts increasingly abstract features from the image.

the ﬁrst layer can easilyidentify edges, by comparing the brightness of neighboring pixels.

the second hidden layer can easily search for corners and extended contours, which are recognizable as collections of edges.

What does deep learning became very popular now?

Deep learning has become more useful as the amount of available training data has increased.
Deep learning models have grown in size over time as computer hardware and software infrastructure for deep learning has improved (nvidia GPU, Google TPU).

Dataset Over Algorithm

The average elapsed time between key algorithm proposals and corresponding advances was about 18 years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than 3 years, or about 6 times faster.

Imagenet Large Scale Visual Recognition Challenge

Error rates on the imagenet dataset have fallen dramatically since the introduction of deep learning systems for classification and localization of objects in images. Imagenet 2011 results, Imagenet 2012 results, Imagenet 2013 results, Imagenet 2014 results, Imagenet 2015 results.

Imagenet Large Scale Visual Recognition Challenge

Multilayer perceptron

Neural network with L hidden layers

layer pre-activation for $k>0$ ($h^{(0)}(x)=x$): $$ a^{(k)}(x) = b^{(k)} + W^{(k)}h^{(k-1)}(x)$$
hidden layer activation ($k \in \{1,..L\}$): $$ h^{(k)}(x) = g(a^{(k)}(x))$$
output layer activation (k=L+1): $$ h^{(L+1)}(x) = o(a^{(L+1)}(x)) = f(x)$$

Activation function

sigmoid function: $$g(x) = \frac{1}{1+\exp(-x)}$$
Tanh function: $$g(x) = \tanh(x) = \frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-x)}$$
ReLu (rectified linear unit)function: $$g(x) = max(x, 0)$$
We use the softmax activation function at the output layer: $$o(a) = \text{softmax}(a) = [\frac{\exp(a_1)}{\sum_c \exp(a_c)},..\frac{\exp(a_C)}{\sum_c \exp(a_c)} ]$$

Gradient descent algorithm

initialize $\theta, (\theta = \{W^{(1)}, b^{(1)},.. W^{(L+1)}, b^{(L+1)}\} )$
for K iterations: $$ \delta = - \frac{1}{N}\sum_{t \in \{1,..N\}}\nabla_{\theta}l(f(x^{(t)}; \theta), y^{(t)})$$ $$ \theta_{k+1} = \theta_k + \eta \delta $$
For classification, the loss function is the negative log-likelihood $f(x)_c = p(y=c|x)$ (sometimes referred to as cross-entropy): $$l(f(x), y) = - \sum_c 1_{(y=c)}\log f(x)_c = - \log f(x)_y $$

Backpropagation algorithm

Rumelhart and al 1986: First time back-propagation became popular

Gradient computation

Loss gradient at output pre-activation:

Partial derivative: $$\frac{\partial}{\partial a^{(L+1)}(x)_c}- \log f(x)_y$$ $$= - (1_{(y=c)} - f(x)_c)$$
Gradient: $$\nabla_{a^{(L+1)}(x)} (-\log f(x)_y)$$ $$= - (e(y) - f(x))$$

Gradient computation

Chain rule

if a function p(a) can be written as a function of intermediate results $q_i(a)$ then we have: $$ \frac{\partial p(a)}{\partial a} = \sum_i \frac{\partial p(a)}{\partial q_i(a)} \frac{\partial q_i(a)}{\partial a} $$

Gradient computation

Loss gradient at hidden layers: let's apply chain rule

Partial derivative: $$ \frac{\partial}{\partial h^{(k)}(x)_j} -\log f(x)_y = \sum_i \frac{\partial - \log f(x)_y}{\partial a^{(k+1)}(x)_i} \frac{\partial a^{(k+1)}(x)_i}{\partial h^{(k)}(x)_j}$$
Gradient: $$\nabla_{h^{(k)}(x)} - \log f(x)_y = (W^{(k+1)})^T (\nabla_{a^{(k+1)}(x)} - \log f(x)_y)$$

Gradient computation

Loss gradient at hidden layers pre-activation

Partial derivative: \begin{align} &\frac{\partial}{\partial a^{(k)}(x)_j} - \log f(x)_y\\ &= \frac{\partial - \log f(x)_y}{\partial h^{(k)}(x)_j} \frac{\partial h^{(k)}(x)_j}{\partial a^{(k)}(x)_j}\\ &= \frac{\partial - \log f(x)_y}{\partial h^{(k)}(x)_j} g'(a^{(k)}(x)_j) \end{align}
Gradient:

Gradient computation

Loss gradient at parameters

Partial derivative: $$\frac{\partial}{\partial W^{(k)}_{i,j}} - \log f(x)_y = \frac{\partial - \log f(x)_y}{\partial a^{(k)}(x)_i} \frac{\partial a^{(k)}(x)_i}{\partial W^{(k)}_{i,j}}$$ $$\frac{\partial}{\partial b^{(k)}_{i}} - \log f(x)_y = \frac{\partial - \log f(x)_y}{\partial a^{(k)}(x)_i} \frac{\partial a^{(k)}(x)_i}{\partial b^{(k)}_i}$$
Gradient: $$\nabla_{W^{(k)}}-\log f(x)_y = (\nabla_{a^{(k)}}-\log f(x)_y)h_{(k-1)}(x)^T$$ $$\nabla_{b^{(k)}}-\log f(x)_y = \nabla_{a^{(k)}}-\log f(x)_y$$

Backpropagation algorithm

compute output gradient
for k from L+1 to 1:
- compute gradient of hidden layer parameter: $$\nabla_{W^{(k)}}-\log f(x)_y = (\nabla_{a^{(k)}}-\log f(x)_y)h_{(k-1)}(x)^T$$ $$\nabla_{b^{(k)}}-\log f(x)_y = \nabla_{a^{(k)}}-\log f(x)_y$$
- compute gradient of hidden layer below: $$\nabla_{h^{(k-1)}(x)} - \log f(x)_y = (W^{(k)})^T \nabla_{a^{(k)}(x)} - \log f(x)_y$$ \begin{align} \nabla_{a^{(k-1)}} - \log f(x)_y &= (\nabla_{h^{(k-1)}(x)}-\log f(x)_y) \\ &. [..g'(a^{(k-1)}(x)_j..] \end{align}

Gradient descent

Batch Optimization $$ \theta_{k+1} = \theta_k - \eta \frac{1}{N}\sum_{t \in \{1,..N\}}\nabla_{\theta}l(f(x^{(t)}; \theta), y^{(t)})$$

More expensive step
Can choose among a wide range of optimization algorithms
Opportunities for parallelism

Stochastic Optimization $$ \theta_{k+1} = \theta_k - \eta \nabla_{\theta}l(f(x^{(t)}; \theta), y^{(t)})$$ $$ t \in \{1,...,N\} \text{is chosen randomly}$$

Very cheap iteration; gradient w.r.t. just 1 data point
Stochastic process dependent on the choice of i
Descent in expectation

Theoretical motivation: strongly convex case

Let's $R_n(\theta) = \frac{1}{n}\sum_{t \in \{1,...,n\}}l(f(x^{(t)}; \theta), y^{(t)})$ and $\theta^*=\text{argmin}(R_n)$
Batch gradient: linear convergence $$R_n(\theta_k)-R_n(\theta^*) \leq O(\rho^k)$$ Per iteration cost proportional to n
Stochastic gradient: sublinear convergence $$\mathbb{E}[R_n(\theta_k)-R_n(\theta^*)] = O(\frac{1}{k})$$ Per iteration cost and convergence constant independant of n

	Batch	Stochastic
convergence rate	$\rho^k$	$\frac{1}{k}$
$\epsilon$ optimality	$n\log(\frac{1}{\epsilon})$	$\frac{1}{\epsilon}$

Practical motivation

Fast initial progress of SG followed by drastic slowdown

Beyond stochastic GD

Mini-batch gradient descent: is widely used in practice: $$\theta_{k+1} = \theta_k - \eta \frac{1}{|S_k|}\sum_{t \in S_k}\nabla_{\theta}l(f(x^{(t)}; \theta), y^{(t)})$$ $$ S_k \subset \{1,...,n\} \text{is chosen randomly in each iteration}$$

allows some degree of parallelization
reduces variance of the stochastic gradients estimates

Schematic of two-dimensional spectrum of optimization methods

Second-order methods

Batch gradient step $\eta \nabla_{\theta}F(\theta_k)$ ignores curvature of the function:

Newton scaling $B = (\nabla_{\theta}^2F(\theta_k))^{−1/2}$: gradient step moves to the minimizer:

..corresponds to minimizing a quadratic model of F in the original space $\theta_{k+1} = \theta_k - \eta s_k, \text{ where } \nabla_{\theta}^2F(\theta_k)s_k = -\nabla_{\theta}F(\theta_k)$

Gradient descent methods

We recall the gradient descent update: $$x_{t+1} = x_t + \delta_t, \text{where } \delta_t = -\eta g_t$$
Adagrad: large gradients have smaller learning rates and small gradients have large learning rates: $$ \delta_t = \frac{\eta}{\sqrt{\sum_{\tau=1}^t g_{\tau}^2}}g_t$$
Momentum: accelerates progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change. $$ \delta_t = \rho \delta_{t-1} - \eta g_t$$

Gradient descent methods

Nesterov Method: First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction.

Other approches: Adam, AdaDelta and RMSProp

What have we seen last time?

Multilayer perceptron: basic neural architecture
Learning algorithm: gradient descent
Backpropagation: efficient way to compute gradient
Gradient based optimization: Theoretical and practical motivation

What will we see today?

Non convex optimization
Regularization
Convolutional neural network: this is the main part
What does a neural network like in tensorflow?

Non convex Optimzation

The mirage of convexity

Local minima dominate in low-D, but saddle points dominate in high-D
Most local minima are close to the bottom (global minimum error)

Are saddle points or local minima more common?

Imagine for each eigenvalue, you flip a coin
If heads, the eigenvalue is positive, if tails, negative
Need to get all heads to have a minimum
Higher dimensions -> exponentially less likely to get all heads
Random matrix theory: The coin is weighted; the lower R is, the more likely to be heads
So most local minima have low R!
Most critical points with high J are saddle points!

[Ian Goodfellow, Deep Learning Summer School Montreal]

Saddle points during traning

Slowly approaching a saddle point.
Escaping it

Regularization

Regularization: Dropout

A Simple Way to Prevent Neural Networks from Overfitting

Regularization: Dropout

Dropout is form of model averaging: let's consider a single hidden layer network with H hidden units. Dropout is equivalent to sampling from $2^H$ models. Each models only gets one traning example.
At evaluation time, instead of sampling many different models and take geometric mean of their output distribution, we use one model with all the hidden units activated but we halve their outgoing weights.

Geoffrey hinton's online course of neural networks for machine learning

Computer vision

Classification task example:

challenges:

View point variation
Scale variation
Illumination conditions
Deformation
...

Speed of processing in the human visual system

Hubel & Wiesel (1962)

Simple cells detect local features
Complex cells pool local features in a retinotopic neighborhood

The Neocognitron

LeNet5 (1998)

Convolutional neural network use sequence of 3 layers: convolution, pooling, non-linearity.
Use convolution to extract spatial features.
Subsample using spatial average of maps.
Non-linearity in the form of tanh or sigmoids.
Multi-layer neural network (MLP) as final classifier.
Sparse connection matrix between layers to avoid large computational cost.

Local connexions

Convolution

Convolve the filter with images: slide over the images spatially and compute the dot products

Multiple convolution

Convolution

Pooling layer

introduces local translation invariance
reduces dimentionnality

Convolutional neural network (Convnet) layers

Input layer: holds the raw pixel values of the image. 3 dimensions: width, height, and depth (color channels).
Convolutional layer: computes the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume.
ReLu layer: applies an elementwise activation function.
Pooling layer: performs a downsampling operation along the spatial dimensions (width, height).
Fully connected layer: computes the class scores

Traditional approaches

Manually designed features are often over-specified, incomplete and take a long time to design and validate.

Example of descriptor: Scale-Invariant Feature Transform and vector quantization

End to end training

ImageNet challenge

14 million images gathered from internet and labeled via Amazon Turk, 20k classes.

ImageNet challenge

Winner: SuperVision team: Alex Krizhevsky,Ilya Sutskever,Geoffrey Hinton, University of Toronto.
Next best model is not convnet

AlexNet architecture

7 layers, 650 000 neurons, 60 000 000 parameters
Trained on 2 GPUs in two weeks
Dropout with 0.5 probability
In order to avoid overfitting, they trained on 224x224 patches extracted randomly from 256x256 images, and also their horizontal reflections

Convnet visualizing

Use deconvolution to map activations at higher layers back to in the input
same operations as Convnet but in reverse:
- Unpool features maps
- Rectify unpooled maps with the same non-linearity
- Convolve rectified unpooled maps with transposed learned filters

Convnet visualizing

For a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map.
Alongside these visualizations we show the corresponding image patches.

Convnet visualizing

What does a neural network like in tensorflow?

TensorFlow is Google open-sourced machine learning library

Hello word: handwritten digits classification

Simple model: Softmax classification

$$ P(y=k) = \frac{\exp(\sum_{i=1}^{784}w_{k,i}x_i + b_k)}{\sum_{k=0}^9\exp(\sum_{i=1}^{784}w_{k,i}x_i + b_k)} = \text{softmax}(\sum_{i=1}^{784}w_{k,i}x_i + b_k)$$

matrix mutiplication: a batch of 100 images at a time

X : Images [100, 784], W: Weights [784, 10], b : Biases [10], Y : predictions [100, 10]


					Y = tf.nn.softmax(tf.matmul(X, W) + b)

Loss function


					cross_entropy = -tf.reduce_sum(Y_*tf.log(Y))

Tensorflow: Initialisation


import tensorflow as tf

X = tf.placeolder(tf.float32, [None, 28, 28, 1]) 
# None will become the batch size, 100
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

init = tf.initialize_all_variables()

tensorflow: loss function and accuracy


# model
Y = tf.nn.softmax(tf.matmul(tf.reshape(X, [-1, 784]), W) + b)
# placefolder for correct labels
Y_ = tf.placeholder(tf.float32, [None, 10])

# loss function
cross_entropy = -tf.reduce_sum(Y_*tf.log(Y))

# % of correct label in batch
is_correct = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_, 1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

tensorflow: training


optimizer = tf.train.GradientDescentOptimizer(0.001)
train_step = optimizer.minimize(cross_entropy)

TensorFlow: RUN


sess = tf.Session()
sess.run(init)
for i in range(1000):
# load batch images and labels
batch_X, batch_Y = mnist.train.next_batch(100)
  train_data={X: batch_X, Y_: batch_Y}

  # train
  sess.run(train_step, feed_dict=train_data)

  #success
  a, c = sess.run([accuracy, cross_entropy], feed_dict=train_data)

  # sucess on test data
  test_data={X: mnist.test.images, Y_: mnist.test.labels}
  a, c = sess.run([accuracy, cross_entropy], feed_dict=test_data)

TensorFlow: All code

Go deeper

Tensorflow: Initialisation


K = 200
L = 100
M = 60
N = 30
W1 = tf.Variable(tf.truncated_normal([28*28, K]))
b1 = tf.Variable(tf.zeros([K]))
W2 = tf.Variable(tf.truncated_normal([K, L]))
b2 = tf.Variable(tf.zeros([L]))
W3 = tf.Variable(tf.truncated_normal([L, M]))
b3 = tf.Variable(tf.zeros([M]))
W4 = tf.Variable(tf.truncated_normal([M, N]))
b4 = tf.Variable(tf.zeros([N]))
W1 = tf.Variable(tf.truncated_normal([N, 10]))
b1 = tf.Variable(tf.zeros([10]))

TensorFlow: the model


X = tf.reshape(X, [-1, 28*28])

Y1 = tf.nn.sigmoid(tf.matmul(X, W1) + b1)
Y2 = tf.nn.sigmoid(tf.matmul(Y1, W2) + b2)
Y3 = tf.nn.sigmoid(tf.matmul(Y2, W3) + b3)
Y4 = tf.nn.sigmoid(tf.matmul(Y3, W4) + b4)

Y = tf.nn.softmax(tf.matmul(Y4, W5) + b5)

Deep learning in natural language processing

Goal: for computers to process or "understand" natural language in order to perform tasks that are useful.
Applications range from simple to complex:
- Spell checking, keyword search, finding synonyms
- Extracting information from websites such as product price, dates,..
- Classifying, reading level of school texts, positive/negative sentiment.
- Machine translation
- Question Answering

What we will see today

Word embedding: word2vec
Language model: Recurrent neural network and LSTM
Basic NLP tasks: part of speech and named entity
Sequence to sequence Framework: machine transation
Attention mechanism: machine translation and image captioning

Word Embedding

Vector Representations of Words

Image and audio processing system work with high-dimensional datasets encoded as vectors of the individual pixel-intensities or power spectral density coefficients
Natural language processing systems traditionally treat words as discrete atomic symbols

Representing Words - Classical View: One Hot Encoding

Obama: 10000
Dolphin: 01000
Paris: 00100
Berlin: 00010
SeaWorld: 00001

⇒ Every word is orthogonal to one another $$(w_{Paris}. w_{Berlin} = 0)$$

Word embedding

The Distributional Hypothesis in linguistics: words that are used and occur in the same contexts tend to purport similar meanings
Word embedding represent words in a contiuous vector space where semantically similar words are mapped to nearby points.

Word embedding

Count based methods

We compute the statistics of how some words co-occurs with its neighbor words in a large text corpus.
Then we map these count-statistics down to a small, dense vector for each word.

Word embedding

Predictive Method: Word2vec (Mikolov, 2013)

Continuous Bag Of Word (CBOW) architecture predicts the current word based on the context.
Skip-gram predicts surrounding words given the current word.
Skip-gram treats each context-target pair as a new observation while CBOW smoothes over a lot of the distributional information

Skip-gram

Let's $W$ be embedding matrix and $W'$ the output weight matrix
Let w a word is hot-one encoded by x such as $x_k=1, x_k=0 \text{ for } k' \neq k$, then, the hidden layer $$h = x^TW = W_{(k,.)} = v_w$$
If $v'_{w_j}$ is the j-th column of the matrix $W'$, the output pre-activation is: $$ u_j= (v'_{w_j})^T h$$

Skip-gram

The probability of a context word $w_t$ (target) given the context's center $w_c$ is: \begin{align} P(w_t| w_c) &= \text{softmax}(u_t) \\ &= \frac{\exp((v'_{w_t})^Tv_{w_c})}{\sum_{\text{word w in Vocab}} \exp((v'_{w})^Tv_{w_c})} \end{align}
The negative log-likelihood to minimize (very expensive to compute): \begin{align} -\log P(w_t| w_c) &= - (v'_{w_t})^Tv_w \\ &+ \log\big(\sum_{\text{word w in Vocab}} \exp((v'_{w})^Tv_{w_c})\big) \end{align}
Negative sampling: word2vec is instead trained using a binary classification objective: \begin{align} loss_{neg} &= -\log P(D=1| v'_{w_t} v_{w_c}) \\ &- k\mathbb{E}_{w \sim P_{noise}}\log P(D=0|v'_{w}, v_{w_c}) \end{align}

Linguistic regularities in the word space

Analogies testing dimensions of similarity can be solved quite well just by doing vector subtraction in the embedding space syntactically and semantically

Expression	Nearest token
Paris - France + Italy	Rome
bigger - big + cold	colder
sushi - Japan + Germany	bratwurst
Windows - Microsoft + Google	Android
apple - apples + cars	car

Baroni et al., “Don’t count, predict! A systematic comparision of context-counting vs. context-predicting semantic vectors”, ACL 2014

“We set out to conduct this study [on context-counting vs.context-predicting] because we were annoyed by the triumphalist overtones often surrounding predict models, despite the almost complete lack of a proper comparison to count vectors. Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. [...] Instead we found that the predict models are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.”

Language Model: Recurrent neural network

Language Model

A language model computes a probability for a sequence of words: $p(w_1,...,w_T)$
Useful for machine translation

Word ordering: p(the cat is small) > p(small the cat is)
Word choice: p(walking home after school) > p(walking house after school)

Traditional language model

Probability is usually conditioned on window of n previous words: Markov assumption: \begin{align} &p(w_1,...,w_T) = p(w_1)\prod_{i=1}^m p(w_i|w_1,...,w_{i-1}) \\ & \simeq p(w_1)\prod_{i=1}^m p(w_i|w_{i-(n-1)},...,w_{i-1}) \end{align}
Estimate probability by counting: $p(w_2|w_1) = \frac{\text{count}(w_1, w_2)}{\text{count}(w_1)}$ $$p(w_3| w_2, w_1) = \frac{\text{count}(w_1, w_2. w_3)}{\text{count}(w_1, w_2)}$$

N-gram: lack of Generalization

Consider three trigrams: "chases a cat", "chases a dog" and "chases a rabbit" were observed in training corpus
There is a clear pattern here: "chases a" will followed by an animal.
Based on this concept, we generalize this knowledge to unseen trigrams.
N-gram language model does not capture this concept
If the trigram "chases a lama" didn't occurr more than once in the training corpus, the conditional probability given by n-gram language model will be zero.

Recurrent neural network

[Credit of slide to Fei-Fei Li & Andrej Karpathy & Justin Johnson]

Recurrent neural network

Given list of word vectors: $x_1, x_2,...,x_T$
$h_t = \text{sigmoid}(W^{(hh)} h_{t-1} + W^{(hx)}xt)$
$y_t = \text{softmax}(W^{(yh)} h_t) $

the difficulty of training Recurrent Neural Networks: for hacker :)

[Credit of the slide to Fei-Fei Li & Andrej Karpathy & Justin Johnson]

the difficulty of training Recurrent Neural Networks: for hacker :)

[Credit of the slide to Fei-Fei Li & Andrej Karpathy & Justin Johnson]

the difficulty of training Recurrent Neural Networks: with maths

Let's take a simpler RNN $$h_t = Wf(h_{t-1}) + W^{(hx)}x_t$$ $$y_t = W^{(yh)}f(h_t)$$
Total error is the sum of each error at time steps t: $$\frac{\partial E}{\partial W} = \sum_{t=1}^T\frac{\partial E_t}{\partial W}$$
Let's apply Chain rule: $\frac{\partial E_t}{\partial W} = \sum_{k=1}^t \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W}$

the difficulty of training Recurrent Neural Networks: with maths

$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}}$
$\text{As } h_i = Wf(h_{i-1}) + W^{(hx)}x_i$, $$\frac{\partial h_i}{\partial h_{i-1}} = W^T\text{diag}(f'(h_{i-1}))$$
As $f'$ is bounded, $||\frac{\partial h_i}{\partial h_{i-1}}|| \leq \|W\| \| \text{diag}(f'(h_{i-1})\| \leq \beta$
$\|\frac{\partial h_t}{\partial h_k}\| \leq \prod_{i=k+1}^{t} ||\frac{\partial h_i}{\partial h_{i-1}}|| \leq \beta^{(t-k)}$
This can become very high or very small => Vanishing or exploding gradient

Trick for exploding gradient: clipping trick

Long Short Term Memory network

LSTM defines an extra cell mempry $c_t$.
The cell memory $c_t$ is combined linearly at each timestamp $$c_t = f_{t}*c_{t-1} + i_t * \tilde{c}_t$$
The information which is propagated from $c_{t-1}$ to $c_t$ is controlled by input and forget gates $i_t$ and $f_t$.
$\tilde{c}_t$ is the memory cell candidate. $\tilde{c}_t = \text{tanh}(W_c[h_{t-1}, x_t] + b_c)$
The information propagated to the hidden layer is controlled by the output gate: $$ h_t = o_t * \text{tanh}(c_t)$$
The input gate determines the what to include from the input $x_t$.
The forget gate determines the what to forget from $c_{t−1}$.
the output gate determines what is relevant to the current state $h_t$

Memory cell

forget gate

input gate

output gate

RNN vs LSTM

RNN

$$h_t = \text{tanh}(W \left( \begin{array}{c}x_t \\ h_{t-1} \end{array} \right)) $$ $$ h \in \textbf{R}^n, \text{ } W [n \text{ x } 2n] $$

LSTM

$$ \left( \begin{array}{c} i_t \\ f_t \\ o_t \\ \tilde{c}_t \end{array} \right) = \left( \begin{array}{c} \text{sigmoid} \\ \text{sigmoid} \\ \text{sigmoid} \\ \text{tanh} \end{array} \right) W \left( \begin{array}{c}x_t \\ h_{t-1} \end{array} \right)$$ $$ c_t = f_{t}*c_{t-1} + i_t * \tilde{c}_t $$ $$ h_t = o_t * \text{tanh}(c_t) $$ $$ h \in \textbf{R}^n, \text{ } W [4n \text{ x }2n] $$

Language model benchmark: Google News Benchmark (1B tokens)

model	perplexity
5-gram + KN regularized	74.4
RNN [Chelba et al, 2014]	51.3
LSTM [Brain team, submitted to ICML 2016]	30

[Samy Bengio: college de France presentation]

Basic Information Extraction tasks

Part of speech tagging (POS)
Named entity recognition (NER)

Information Extraction

Part Of Speech tags

tag	description
DT	Determiner
JJ	Adjective
VB	Verb, base form
PRP	Personal pronoun
...

Named Entity recognition

Organisation
Date
Location
Person

Classical approach

[Standford NLP, Chris Manning]

Beam Search for inference

Intuition: follow single path at a time but switch paths whenever some competing path look better
Beam Search: keep only k most promising path (k highest or lowest scores).
Beam Search example k = 3 (better = lower score): S={A} → S={B, C, D} → S={B, C, E} → S={C, E, H}

Classical approach

At each time, a classifier make a single decision conditioned on evidence from observations and previous decisions
extract from the sentence a rich set of hand-designed features which are then fed to a standard classification algorithm, e.g. a Support Vector Machine (SVM), often with a linear kernel
The choice of features is a completely empirical process, mainly based first on linguistic intuition, and then trial and error, and the feature selection is task dependent

Sentence

-3	-2	-1	0	+1
ORG	ORG	O	???	???
Xerox	Corp.	fell	26	%

Features

$W_0$	26
$W_{+1}$	%
$W_{-1}$	fell
$C_{-1}$	O
$C_{-2}-C_{-1}$	ORG-O
hasdigit?	true
hasdigit?	true
...	...

Natural Language Processing (almost) from Scratch

Use a neural network to learn features
Embedding layer: embedding vectors are stored in $LT_W$ which is a lookup table. We embed words with the same way as in word2vec. We could extend embedding to discrete features such as word suffix.
For given input x, the network output a score $f_i(x)$ for each tag $i$.
Word-Level Log-Likelihood: the conditional tag propability is: $$ p(i| x) = \text{softmax}(f_i)$$

[Collobert and al, 2011]

Natural Language Processing (almost) from Scratch

Sentence-Level Log-Likelihood:

We know that there are dependencies between word tags in a sentence
Tags organized in chunks and some tags cannot follow other tags
We introduce a transition score $(A_{i,j})_{(i,j)}$ for jumping from i to j tags in successive words. The transition scores are going to be trained.
The score of a sentence $[x]_1^T$ along a path of tags $[i]_1^T$ is then given by the sum of transition scores and network scores: $$ s([x]_1^T, [i]_1^T) = \sum_{t=1}^T (A_{i_{t-1}, i_t} + f_{i_t})$$
Then we normalize this score over all possible tag paths using softmax

Results

Task	Benchmark	SENNA
Part of Speech (Accuracy)	97.24	97.29
Named Entity Recognition (F1)	89.31	89.59

[Collobert and al, 2011]

Sequence to Sequence Framework: machine translation and image captioning as applications

Statistical Machine translation: classical approach

Translation from source to target ($s\rightarrow t$)
Source sentence: $s = s_1^m = s_1..s_j..s_m$
Target sentence: $t = t_1^I = t_1..t_i..t_I$
Search for the best translation: $$ t^* = \text{argmax}_t p(t|s)$$
Bayes rule: $$ t^* = \text{argmax}_t \frac{p(s|t) p(t)}{p(s)} = \text{argmax}_t p(s|t) p(t)$$
p(s|t): translation model trained on parallel data (bitexts)
p(t): language model trained on monolingual text: controls whether the target sentence is well-formed or not

Statistical Machine translation: classical approach

Perform word alignment in both directions
Extract phrase (heuristic)
Score phrases (relative frequency)
Train language and reordering model
Optimize weights of each model and combine all models

⇒ many steps which are optimized individually, no global training

[credit of the slide to Holger schwenk, college de France seminar]

Sequence to sequence: Encoder/decoder approach

An encoder processes the source sentence and creates an compact representation
This representation is the input to the decoder which generates a sequence in the target sentence

Encoder/decoder: huge LSTM: 4 layers, 384M parameters and 8,000 dimensional state each.
Encoder: no output layer, no loss function.
SMT basline: 33.30 BLEU.
Beam search for inference: 12 size.
Neural Machine Tranlation system (NMT): 30.59 BLEU and 34.91 ensemble of 5.
$\text{BLEU} = \text{BP} \exp{\sum_{i=1}^N \frac{\log(p_n)}{N}}$ where $p_n$ is the n-gram precision and BP is brevity penalty

A figure of a 2-dimensional PCA projection of the LSTM hidden states that are obtained after processing the phrases

The Trouble with Simple Encoder-Decoder Architectures

The encoder compresses the input sequence as a fixed-size vector from which the decoder needs to generate a full translation.
Encoder-Decoder does not perform well with long sentences.

A solution: Attention mechanism

Let's imagine that you are translating the source sentence $(x_1,x_2...,x_n)$
You have already written the first $i-1$ target words $(y_1,y_2,..y_{i-1})$
In addition to the already generated target words, A translator looks at each source word $x_j$ and decide whether the source word $x_j$ is relevant for the next target word.
This is called attention mechanism.

The attention mechanism is implemented as a neural network with a single hidden layer and a single scalar output $e_j$. This is applied to every source word $x_j$
The attention mechanism takes as an input the previous target hidden state $z_i$ and the current source hidden state $h_i$
The relevance score of every source word is $\alpha_j = \frac{\exp(e_j)}{\sum_{j'}\exp(e_{j'})}$
The sentence representation vector is: $c_i = \sum_{j}\alpha_jh_j$. which summarizes the information about the whole sentence with different emphasis on different words.

Attention mechanism results

Edge thicknesses represent the attention weights found by the attention model.

Image captioning

Image caption with attention mecahnism

Image caption with attention mechanism

Examples of mistakes where we can use attention to gain intuition into what the model saw

Recurrent neural network: different architectures

One to one: Vanilla Neural Networks
One to many: e.g. Image Captioning
many to one: e.g. Sentiment Classification
many to many: e.g. Machine Translation
many to many: e.g. Video classification on frame level

CONCLUSION

Deep learning provides a flexible «almost» universal learnable framework for representing world, visual and linguistic information.
It is very hard to train and a lot of hyperparameters to tuned.
It is as much about theory as about engineering.