Skip to content

Accessing your Gmail messages in MATLAB

16 December 2009

Lately I’ve been working on a Bayesian spam filter. There are spam email databases available for training and testing spam filters (e.g. HP’s Spambase), but I wanted to try it out on my own real-world email messages from my Gmail mailbox. In this post, I’ll show you how to access your Gmail data using the MATLAB Database Toolbox in Windows.

Gmail Offline is an interesting feature of Gmail that allows you to syncronise your email to your computer so you can access your Gmail messages while you are not connected to the internet. Gmail Offline is based on Google Gears, an open source library developed by Google to allow web applications to function offline. As it turns out, Gmail Offline stores all of the synchronised data using an SQLite database on the client computer. This means we can easily access our synced Gmail programmatically using standard database connectors (unlike Outlook’s proprietary .pst format, for example).

The first step is to locate the Gmail SQLite data file on your computer. Depending on your operating system and browser, the data files can be stored in different locations on your computer. The Google help page does not specify exactly which file within the Google Gears folders contains the actual email messages, and the file itself does not have an .sqlite extension. Just search around the sub-directories until you find the largest file (assuming you have synced a reasonable number of emails). I found my data file at the following location within the Google Gears directory, where “myemail” is my Google username:

\mail.google.com\http_80\myemail@gmail.com-GoogleMail#database[1]

Once you have found your data file, the next step is to install the database driver:

  1. Download and install the SQLite ODBC driver.
  2. Open the ODBC Data Source Administrator tool: Start → Run → type "odbcad32"
  3. Add a new User DSN using the “SQLite3 ODBC Driver”, set the Data Source Name to “gmail” and set the Database Name to the location of your Gmail Offline data file

Now we’re ready to use it in MATLAB. The first step is to open a connection to the database:

% Connect to the database
conn = database('gmail', '', '');
if ~isconnection(conn)
	error(conn.Message)
end

Assuming it connected successfully, we are ready to execute our queries. We will fetch the message ID, timestamp and a message “snippet” for each email and store it in the data variable:

% Fetch message data
e = exec(conn, 'SELECT MessageId,Timestamp,CAST(SnippetHtml as VARCHAR) FROM Messages LIMIT 100');
e = fetch(e);
data = e.Data;
close(e);

You may have noticed the type cast in the SQL query above. For some reason, I was not able to fetch the data from any columns of type “text”; they were always coming back as empty arrays, even though I confirmed they did contain data using SQLite Manager, and I was not able to find any solutions on the internet. I solved the problem simply by casting any “text” columns to “varchar”, but there may be better ways of getting around this.

Finally, we should always disconnect gracefully from the database:

% Disconnect from the database
close(conn);

I was not able to find any descriptions of what the various database tables in the Gmail Offline database contain or what their relationships are. The contents of some of them, like the “Attachments” or “Contacts” tables are obvious; others not so much.

To obtain the schema of the database, run the following:

% Fetch schema information
e = exec(conn, 'SELECT CAST(sql as VARCHAR) FROM SQLITE_MASTER WHERE type = ''table'' ORDER BY name');
e = fetch(e);
data = e.Data;
close(e);

% Display the results
fprintf('%s\n\n', data{:});

The MATLAB Database Toolbox is required to perform all of the above, and unfortunately is not included in MATLAB Student Version. However, many other languages have other similarly easy methods of accessing SQLite databases. Many languages have support for ODBC or you can sometimes use language-specific drivers. For instance, Perl provides the DBD::SQLite driver. Below is some example Perl code to obtain the database schema using DBD::SQLite; don’t forget to update the $dbFile variable to point to the location of your own Gmail data file.

use strict;
use DBI;

# Connect to the database
my $dbFile = 'myemail@gmail.com-GoogleMail#database[1]';
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbFile","","");

# Fetch schema information
my $sth = $dbh->prepare("SELECT sql FROM SQLITE_MASTER WHERE type = 'table' ORDER BY name");
$sth->execute or die $sth->errstr;

# Display the results
while(my $row = $sth->fetch) {
	print $row->[0], "\n\n";
}

# Disconnect from the database
$dbh->disconnect;

In a future post I will give some details about the spam filter that I used this for. I also plan to use this functionality for some interesting data mining experiments, so stay tuned!

Advertisements

From → Data mining, MATLAB, Perl

10 Comments
  1. tim permalink

    Cool stuff. One question that’s been nagging me is how the attachment files on the file system relate to the rows in the attachments table. I’m not sure what the numbers between the brackets are for? I can see which attachment belongs to which message in the attachments table but I can’t figure out exactly which file the attachment row belongs to…

    • Alister permalink

      I’m not really sure – the numbers don’t seem to correspond to anything in the database. I’d love to know if you find out though.

  2. I read a article under the same title some time ago, but this articles quality is much, much better. How you do this?

  3. So, this is really great, but how do I access the whole email — I only get a snippet of 80 characters! Help! I think the problem is here :::

    e = exec(conn, ‘SELECT MessageId,Timestamp,CAST(SnippetHtml as VARCHAR) FROM Messages LIMIT 100’);

    does this limit the third field containing the message to 80 characters? “CAST(SnippetHtml as VARCHAR)”

    • I think the SnippetHtml column only stores the first 80 characters of each message because it is probably used for message previews. If you are after the full body of the email, take a look at the MessagesFT_content table. Something like this should work:

      SELECT c1Body FROM MessagesFT_content LIMIT 100

  4. Heya i’m for the first time here. I came across this board and I find It truly useful & it helped me out a lot. I hope to give something back and help others like you aided me.

  5. Hello There. I found your blog the use of msn.
    That is an extremely neatly written article.
    I will make sure to bookmark it and come back to read more of your useful information.
    Thanks for the post. I’ll definitely comeback.

  6. It’s the best time to make some plans for the long run and it
    is time to be happy. I’ve learn this put up and if
    I may just I desire to counsel you some attention-grabbing things or advice.
    Maybe you can write next articles regarding this article. I desire to read more things approximately
    it!

  7. I am regular reader, how are you everybody?
    This piece of writing posted at this website is really fastidious.

  8. It’s hard to find educated people for this topic, butt you seem like you
    know what you’re talking about! Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: