Categories

  • docs

My Gmail inbox was counting close to 30,000. Not a pretty number at all, so I decided I should do something about it.

Just like a good data analyst would do, I wanted to clean up my inbox in a systematic and data-driven manner.

I wanted something easy, quick, and high payoff, not something like a full-blown spam classification algorithm using Natural Language Processing, although that’d be fun as well. So I settled on a 50% rule, where I’ll get rid of at least 50% of unnecessary emails.

You’d be surprise to learn that the pareto principle of 20% of causes explaining 80% of the effect also holds in my Gmail inbox. 12% of senders accounted for 80% of all emails in my case and only 18 senders accounted for approximately 50% of all my emails!

png

Here you can see, how concentrated my inbox is with regards to unique senders. There’s a steep curve up until the 200th unique sender, meaning that there are at least 200 email senders that contribute most to my cluttered inbox. The red dotted horizontal line indicates 50% of my total mail count to give a sense of how quickly the curve reaches that point.

Good to know, but I’m not going to review the top 200 senders because that’s already too many. Instead, I’m going focus on the dotted square portion of the distribution for now.

png

Zooming into the dotted square portion of the chart above, we can see that 50% of all my emails (~15,000) are from the top 18 most frequent senders. Now that’s a number that I can manage to review. Blue bars show the number of emails sent by each unique sender and the grey bars show cumulative numbers of those.

I wonder who the most frequent sender is (first blue bar) as it has sent a whopping 5,000 emails to me so far. Next up (second blue bar) is equally as persistent at around 2,000 emails. Wonder who that is as well…

name organization count
1 newtech-1 meetup.com 5559
2 info meetup.com 2354
3 hello mealpal.com 902
4 support streeteasy.com 523
5 usmail expediamail.com 371
6 notify buildinglink.com 361
7 travelocity ac.travelocity.com 354
8 alert indeed.com 335
9 Top-Free-Events-Today-announce meetup.com 323
10 godiva e.godiva.com 318
11 email usa.uniqlo.com 302
12 XXX uchicago.edu 299
13 groups-noreply linkedin.com 288
14 zales em.zales.com 287
15 XXX uchicago.edu 286
16 katespade em.katespade.com 274
17 noreply r.groupon.com 266
18 info twitter.com 240

Turns out, the two most frequent senders are from Meetup! Specifically, “newtech-1”, which I remember as a New York Tech Meetup, and “info” should be some general update on all Meetup events that I subscribe to.

Mealpal, which is a discounted email-based lunch subscription service that I had for a few months also sent me quite some emails, and so did StreetEasy, Expedia, Uniqlo, Godiva etc… Maybe I shouldn’t have given my email out to all those websites…

There were some individuals from UChicago(alma mater) as well, which I anonymized with “XXX”. Brings back memories.

I think I can delete all emails from the top 18 senders, so I extracted all mail ids corresponding to these senders and used Python and the imaplib module to automatically delete ~13,000 emails.

Voila! I just reduced my inbox count to half what is was without worrying about deleting any important emails! Below is a progress bar that shows that I deleted 12,862 emails in 1:21 hours.

100%|██████████| 12862/12862 [1:21:03<00:00,  4.23it/s]


More

If you’d like to see the Python script to download email data used in this process, please click here.

For Jupyter notebook walking through the email deletion process laid out on this page (after downloaded email data), please click here.