Downloading all English books from with Python


Project Gutenberg (PG) is probably second most popular source (after Wikipedia: here you will find a torrent file for the latest Wikipedia dump btw) of text corpora for NLP. The code below will download all available books in .txt format in the English language. It consists of two steps: (1) first, it collects all direct URLs to the books and (2) then, it downloads them one by one, extracts text files from archives and, then, deletes .zip files.

After you run the code, you will get approximately 16,486,020,098 bytes (16.57 GB on disk) for 41,599 items.

Next time, I will build word embeddings using word2vec model based on the PG text corpus.


