Presenting the Importance of Random Initialization of the Weights

The problem of weights initialization is explained here.

“This turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.”

Basically, if done improperly, it would result in serious problems with learning features. This post is intended to provide some simple evidence of the importance of the asymmetry in weights initialization.

Configuration of the neural network:

Learning loop:

Predictions if initialized with assymetrical weights:

Predictions if all weights are initialized with 0.1s:

After 10,000 iterations the network failed to solve simple XOR problem — embarrasing, kind of.


Complete code:


Downloading more than 20 years of The New York Times

Articles for the period from 1987 to present are available without subscription. Their copyright notice is web scraping friendly:

“… you may download material from The New York Times on the Web (one machine readable copy and one print copy per page) for your personal, noncommercial use only.”

Why waste the opportunity to download these articles then?


Please read their terms of service here.
Please subscribe to The New York Times here.

Next time, I’ll modify the code so you can download articles from some other major online newspaper.

Downloading all English books from with Python


Project Gutenberg (PG) is probably second most popular source (after Wikipedia: here you will find a torrent file for the latest Wikipedia dump btw) of text corpora for NLP. The code below will download all available books in .txt format in the English language. It consists of two steps: (1) first, it collects all direct URLs to the books and (2) then, it downloads them one by one, extracts text files from archives and, then, deletes .zip files.

After you run the code, you will get approximately 16,486,020,098 bytes (16.57 GB on disk) for 41,599 items.

Next time, I will build word embeddings using word2vec model based on the PG text corpus.

A Neural Network in 10 lines of C++ Code

Purpose: For education purposes only. The code demonstrates supervised learning task using a very simple neural network. In my next post, I am going to replace the vast majority of subroutines with CUDA kernels.

Reference: Andrew Trask‘s post.

The core component of the code, the learning algorithm, is only 10 lines:

The loop above runs for 50 iterations (epochs) and fits the vector of attributes X to the vector of classes y through the vector of weights W. I am going to use 4 records from Iris flower dataset. The attributes (X) are sepal length, sepal width, petal length, and petal width. In my example, I have 2 (Iris Setosa (0) and Iris Virginica (1)) of 3 classes you can find in the original dataset. Predictions are stored in vector pred.

Neural network architecture. Values of vectors W and pred change over the course of training the network, while vectors X and y must not be changed:

The size of matrix X is the size of the batch by the number of attributes.

Line 3. Make predictions:

In order to calculate predictions, first of all, we will need to multiply a 4 x 4 matrix X by a 4 x 1 matrix W. Then, we will need to apply an activation function; in this case, we will use a sigmoid function.

A subroutine for matrix multiplication:

A subroutine for the sigmoid function:

Sigmoid function (red) and its first derivative (blue graph):

Line 4. Calculate pred_error, it is simply a difference between the predictions and the truth:

In order to subtract one vector from another, we will need to overload the “-” operator:

Line 5. Determine the vector of deltas pred_delta:

In order to perform elemetwise multiplicaton of two vectors, we will need to overload the “*” operator:

A subroutine for the derivative of the sigmoid function (d_sigmoid):

Basically, we use the first derivative to find the slope of the line tangent to the graph of the sigmoid function. At x = 0 the slope equals to 0.25. The further the prediction is from 0, the closer the slope is to 0: at x = ±10 the slope equals to 0.000045. Hence, the deltas will be small if either the error is small or the network is very confident about its prediction (i.e. abs(x) is greater than 4).

Line 6. Calculate W_delta:

This line computes weight updates. In order to do that, we need to perform matrix multiplication of transposed matrix X by matrix pred_delta.

The subroutine that transposes matrices:

Line 7. Update the weights W:

In order to perform matrix addition operation, we need to overload the “+” operator:

Complete code: