Finding top senders in Gmail

·

4 min read

tl;dr: I created a Python script for finding the top email senders in your Gmail inbox

It's the start of 2024 and a good time to do some spring (or should it be winter?) cleaning. If you're anything like me, you have thousands of unread emails in your Gmail inbox that you kept as unread to remind yourself to go back and read them later, but then never managed to do so. In the meantime, more emails come in and the problem gets worse.

To help remedy the situation, I wanted to get a count of the emails in my inbox by email sender, so that I could make the most impact top-down deleting emails that I don't care about (and unsubscribing at the same time). Unfortunately, there isn't any way to do this in Gmail - the only thing you can do is search by sender, then delete all the emails that come back - not a very efficient solution.

I figured the best thing to do would be to crack open the Python SDK for Gmail (quickstart guide) and put together a script that'd do exactly what I want. The first task was to create a new OAuth application, enable the Gmail API and then assign the appropriate permission to read from a Gmail inbox; once I had done this, I downloaded the credentials.json file which is used when initialising the script, in order to link the user's Gmail account to the application and grant the required permission. When first running the script, this is the consent screen you get after signing into your Google account, which confirms the permissions you're about to grant to your Google app (I called mine GmailCleaner):

Next, I wrote the Python script, which involved the following steps:

  1. Authenticate and authorise the app to read the user's Gmail inbox

  2. Get a listing of all emails

  3. Get the metadata for all emails and construct a dictionary of email sender to email count
    (for each message)

  4. Once all email metadata has been downloaded, order the dictionary descending on the email count and print to screen

Gmail refers to emails as messages and the SDK gives you the option to get either messages or threads; for the purposes of finding top email senders I used messages, as threads are collections of emails handily grouped together and exist to represent individual conversations.

The first step to finding top senders is to get all emails, which you can do by calling gmail_service.users().messages().list(userId='me', q=query).execute() . This gives back a paged collection of Message objects representing all emails, where only the id and threadId fields are populated.

Next, for each of the message ids received, the message metadata can be retrieved by calling gmail_service.users().messages().get(userId='me', id=message_id, format='metadata').execute() . This still gives us a fairly sparsely populated Message object as a response, but we only need to get and parse the headers to find who the message is from - once we have this, we add the email sender to our in-memory dictionary sender_count (or up the count if the sender already exists) and move on to the next message.

Nuance of the Python SDK
One of the main nuances with the Python SDK for Gmail at present is that you cannot perform a bulk query to get all email metadata (unless you make a REST call directly to the API and forfeit the Python SDK). This means that you're going to be making a call to get each message's metadata, potentially resulting in hundreds or thousands of these calls. I haven't checked the underlying SDK implementation to check if there's some smartness going on here (such as internal batching) but given how long it took to run, I wager that there isn't.

Finally, once all message metadata has been downloaded, it's time to order them in descending order on the email count: dict(sorted(sender_count.items(), key=lambda x: x[1], reverse=True)).

One of the optimisations I worked on this was to serialise the output of the main steps (the initial call to list all messages and the complete dictionary of email sender to email count) to speed up future runs and allow me to quickly re-order the final output without needing to make further API calls. As I manually deleted emails from Gmail using the output of this Python script, the underlying serialised data became outdated, but for the purposes of tinkering with the output it was useful regardless.

It is possible to delete emails using the Python SDK for Gmail and this would have made for a much more automated and less time consuming process, but I wanted to sense check the emails I was about to delete manually, so I didn't consider this option.

You can find the full script available as a Gist at https://gist.github.com/karlbaker02/91fd7be091f7085bb2be54eab78796db.