Downloading all English books from with Python


Project Gutenberg (PG) is probably second most popular source (after Wikipedia: here you will find a torrent file for the latest Wikipedia dump btw) of text corpora for NLP. The code below will download all available books in .txt format in the English language. It consists of two steps: (1) first, it collects all direct URLs to the books and (2) then, it downloads them one by one, extracts text files from archives and, then, deletes .zip files.

After you run the code, you will get approximately 16,486,020,098 bytes (16.57 GB on disk) for 41,599 items.

Next time, I will build word embeddings using word2vec model based on the PG text corpus.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s