Skip to content

Exploring your Gmail social network (Python)

10 January 2011

In Malcolm Gladwell’s bestseller The Tipping Point, he outlines a theory called “The Law of the Few” that identifies key types of people in social networks. One of the important ones is the connector. Connectors are hubs in the network. The idea is that, rather than society consisting of one large sparsely connected network, it consists of many smaller, densely connected networks, and these smaller networks are in turn connected together by the connectors forming a network of networks known as a “small-world network”. Most people know at least one connector, and it is through these key people and their fellow connectors that we are connected to the rest of the world.

An interesting experiment is to explore your own personal social network to identify the connectors. An easy way to get an idea of your network is from your email mailbox. If we assume that the sender and all of the recipients of each group email we send or receive know each other, we can scan our mailbox and build a graph with email addresses as the nodes and edges joining the email addresses of people who know each other. In a previous post, I showed how Gmail messages can be easily accessed from many programming languages using an SQLite driver. Using this, we can find our social network from our Gmail mailbox.

Generating the graph

First we need to connect to the Gmail SQLite database with the standard sqlite3 library in Python. You need to set up Gmail Offline and locate your Gmail data file, as described in my previous post. I also suggest that you increase the Gmail Offline recent message range and re-sync your messages so that you have a larger collection of messages to work with (instructions can be found here – I set mine to 3 months).

To represent and analyse the network, I have used the excellent Python NetworkX library from the Los Alamos National Lab, which you will have to install, as well as matplotlib to render the network figures.

The following function will read your Gmail messages and build a NetworkX Graph object:

import email.utils
import itertools
import networkx
import sqlite3

def get_email_graph(db, ignored_addresses=[]):

    cur = db.cursor()
    cur.execute("SELECT c4FromAddress, c5ToAddresses, c6CcAddresses, c7BccAddresses FROM MessagesFT_content")

    graph = networkx.Graph()

    for address_field in cur: # loop through from, to, cc and bcc address fields
        addresses = set(address for name, address in email.utils.getaddresses(address_field) if address != '') # parse the addresses
        if len(addresses) <= 2 or len(addresses) > 10:
            continue # ignore any emails with a single recipient or very large group emails
        for address in ignored_addresses:
            addresses.discard(address) # remove any ignored addresses
        graph.add_edges_from(itertools.combinations(addresses, 2))

    return graph

You can then call the function this to build your social network graph (remembering to change the path to the location of your Gmail SQLite database):

db = sqlite3.connect("C:/path/to/mail.google.com/http_80/myemail@gmail.com-GoogleMail#database[1]")
graph = get_email_graph(db)

Excluding yourself from the graph

The get_email_graph function above has a second parameter, ignored_addresses, which can be used for excluding particular email addresses from the graph. Because the social network is built from your own mailbox, you will be at the centre of the graph and connected to every other node, and will always appear to be a large connector. For this reason, I removed myself from the network to see how everyone else in the network is connected to one another.

In addition to my primary Gmail address, I have multiple additional email addresses that forward to my Gmail account. I have set up my Gmail to be able to send from these addresses (if you are not familiar with this feature, see this article). Gmail Offline stores your primary email address and any additional outgoing email addresses as JSON-encoded records in the DataArrays table. If you have Gmail set up with your additional email addresses, the function below will return a list of all of your email addresses so that you can exclude them from the graph:

import json
def get_my_addresses(db):
    cur = db.cursor()
    # fetch the primary Gmail address
    cur.execute("SELECT Value FROM DataArrays WHERE Type = 'ui'")
    json_str = cur.fetchone()[0]
    primary_address = json.loads(json_str)[1]
    # get any additional email addresses
    cur.execute("SELECT Value FROM DataArrays WHERE Type = 'cfs'")
    json_str = cur.fetchone()[0]
    additional_addresses = [address for name, address, misc1, misc2 in json.loads(json_str)[1]]
    return set([primary_address,] + additional_addresses)

You should then use this list when calling get_email_graph, like this:

db = sqlite3.connect("C:/path/to/mail.google.com/http_80/myemail@gmail.com-GoogleMail#database[1]")
ignored_addresses = get_my_addresses(db)
graph = get_email_graph(db, ignored_addresses)

If you have multiple email addresses but don’t have Gmail configured with all of your forwarded email addresses, you can manually specify all of your email addresses as a list and then pass it to the get_email_graph function as above.

Visualising the graph

Now that we have loaded our social network into a graph object, we can display it as a figure with the following code:

from matplotlib import pyplot
networkx.draw(graph, node_size=5, font_size=8, width=.2)
pyplot.show()

This is the output from my graph (with the email addresses obfuscated using a hash function):

Clearly we can see that, rather than a single network, it is composed of multiple separate connected components. In my case, it was immediately apparent that these components correspond to different groups of people that I regularly deal with – one is my friends and family, another my work colleagues, and the remaining two are classmates and staff at two different universities I am associated with. Above I have labelled what each component corresponds to.

We can use the connected_components function to extract these distinct components in order to plot and analyse them separately. The code below will loop through them generating two plots for each component – one using the default layout and the other using a circular layout – and with the nodes sized proportionally to the number of connections it has:

for connected_group in networkx.connected_components(graph):
    subgraph = networkx.subgraph(graph, connected_group)
    node_list, node_size = zip(*networkx.degree(subgraph).items())
    pyplot.figure()
    pyplot.subplot(121)
    networkx.draw(subgraph, nodelist=node_list, node_size=node_size, font_size=5, width=.2)
    pyplot.subplot(122)
    pyplot.axis('equal')
    networkx.draw_circular(subgraph, nodelist=node_list, node_size=node_size, font_size=8, width=.2)
pyplot.show()

Below are examples of two component plots generated for my social network. The first is my & family network and the second is my work network.

Friends & family network

Work network

Analysing your social network

The NetworkX library has a wealth of graph algorithms that can be used to calculate interesting statistics on the graphs, such as the shortest path between two nodes and various centrality measures. In the figure below, the friends and family network has been enlarged and the email address labels removed. We can see that the network contains a handful of large nodes with large numbers of connections, with the majority having only a few connections. These large nodes are clearly the connectors amongst my friends and family network that Gladwell discussed.

The number of connections an email address has is its degree, and can be calculated using the degree function. Below is a histogram of the node degrees:

The majority of the nodes have a degree of less than 20. The “six degrees of separation” theory claims that every person in the world is connected to every other person by a chain of no more than approximately six connections. In graph theory terms, this means that the maximum shortest path between any two nodes in the graph is no more than 6. My friends & family graph has a maximum shortest path length of 5 and an average of 2.1 (calculated with the shortest_path_length function), which seems to agree with this theory.

Closing thoughts

Social networking websites such as Friendster and LinkedIn have leveraged the idea of “six degrees of separation” by allowing you to build your social network by connecting to your friends’ connections. Calculating the degrees of separation has been solved by multiple well-studied algorithms. However, these algorithms often do not scale well – for instance, the Floyd-Warshall algorithm has O(n3) time and O(n2) space complexity. Implementing an algorithm to find the n-level network of each person to every other person in a parallel and dynamic manner and on an extremely large network consisting of many thousands or millions of nodes, as must be the case for many social networking websites, is certainly an interesting and challenging problem – one which I plan to write a future article on.

The approach I have used represents the social network as an unweighted and undirected graph. One extension to the approach above could be to use a weighted graph, where the weights are the total number of emails between each pair of people. This would have the advantage of giving a higher weighting to connections between people who frequently communicate and would prevent a single large group email from giving all of its recipients a high degree. Another extension could be to use a directed graph, where each node points to the node it received the email from. This would result in the edges tending to point to frequent communicators, and an algorithm such as PageRank (which is implemented in NetworkX) could be used to analyse the most “important” nodes.

Advertisements
3 Comments
  1. shotgunapproach permalink

    Wow, tremendously informative post. NetworkX looks very useful. Thanks!

  2. thiemehennis permalink

    very informative. I put it on my todo list and hope I have time for it one day ( I want to learn both programming and social network analysis, and this seems a good assignment for myself).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: