Predicting genres of 45,000 Project Gutenberg books using NLP - BoW Approach

Project Gutenberg is a website that offers more than 58,000 free eBooks for which U.S. copyright have expired. It is very interesting text data for Natural Language Processing (NLP), as it is a huge body of text with pretty reliable labeling such as genre, author, publication year etc… Here, I’ll attempt to process approximately 45,000 English books from Project Gutenberg in order to find patterns between words and the genre of the books using a Bag-of-Words (BoW) approach.