Skip to content

Mapping urban noise levels using smartphones

Participatory sensingThe Sony Lab in Paris recently released a free smartphone app called NoiseTube which uses your smartphone’s microphone and GPS to measure noise levels as you walk around. This data is combined with data collected from other users in order to plot the current noise levels on a city map, a technique dubbed “participatory sensing”. Anyone can sign up to download the application and contribute data. I doubt it will really take off, but either way it’s an interesting concept that makes very clever use of crowdsourcing and repurposing of existing technology. The goal is to meet EU requirements of member countries to periodically measure noise pollution levels, but the website is open to users from any country. Currently you can view noise data for individual users (making your data public is optional) and you can download Google Earth KML data for various cities, but I’d love to see someone create a Google Maps mashup of this!

Configuring TortoiseCVS to use TortoiseMerge

I’m a big fan of TortoiseSVN for working with Subversion repositories in Windows. Recently I’ve had to work on a software project using a CVS repository, so I naturally decided to use TortoiseCVS. TortoiseSVN was originally inspired by TortoiseCVS, but despite the similarity in the names, they have no affiliation and function quite differently.

One of the first things that I noticed was that TortoiseCVS does not seem to come bundled with a Windows diff tool, unlike the handy TortoiseMerge that comes with TortoiseSVN. The TortoiseCVS FAQ recommends a couple of third-party diff tools which integrate with it quite nicely. However, if you already have TortoiseSVN installed, there’s an easy way to configure TortoiseCVS to use TortoiseMerge.

First, open TortoiseCVS’s settings by right-clicking on the desktop, and then “CVS” > “Preferences…”:

Click on the “Tools” tab. Under “Diff application”, browse to the TortoiseMerge.exe executable, which is in the TortoiseSVN bin folder. In my installation, this was:

C:\Program Files\TortoiseSVN\bin\TortoiseMerge.exe

For “two-way diff parameters”, enter the following:

/base:"%1" /mine:"%2"

Click OK and that’s it! The image below shows what your preferences should look like:

Now you can right-click on any modified text files in a checked out CVS repository, click “CVS Diff” and it will fire up TortoiseMerge to show you the differences between your local modified copy and the last commited version in the repository.

Note: You should also be able to use TortoiseMerge as your TortoiseCVS merge application too. I haven’t tested this out, but the “two-way merge parameters” should be similar to those used above for the diff application.

Parsing English numbers with Perl

Note: The problem described here has already been solved with libraries such as Lingua::EN::FindNumber and Lingua::EN::Words2Nums. For production software, I’d recommend you look at using those modules instead of re-inventing the wheel. This article is only intended for those interested in learning how this type of parsing works.

In a project I was recently working on, there was a need to perform named entity recognition on natural language text fields in order to convert natural language numbers in the field, such as “two hundred and thirteen”, into their numerical values (e.g. 213). Google does a great job of this; for example, try out this search to convert a string to a number, and the reverse. In this post, I’ll discuss how this conversion functionality can be achieved with the nifty Perl recursive descent parser generator Parser::RecDescent.

The parsing is a two-step process. First, each of the number words need to be matched and their values looked up in a dictionary. For example, the word “two” needs to be matched and recognised as “2”, “hundred” as “100” and “thirteen” as “13”. In parsing parlance, this step is known as lexical analysis. Second, we need to calculate the total number that the entire sentence represents by accumulating all of the matched values. This is known as syntactic analysis.

Consider the following input string:

“two hundred and seven million thirteen thousand two hundred and ninety eight”

The first step is to tokenise the string and remove the word “and” anywhere it appears. (Breaking a number up with the word “and” is a British English convention; American English usually omits the instances of “and”.) We then end up with the following list of token values:

“two”, “hundred”, “seven”, “million”, “thirteen”, “thousand”, “two”, “hundred”, “ninety”, “eight”

If we convert each token value into its numeric equivalent, this becomes:

2, 100, 7, 1000000, 13, 1000, 2, 100, 90, 8

Finally, in order to find the total, we calculate:

((2 × 100 + 7) × 1000000) + (13 × 1000) + (2 × 100 + 90 + 8) = 207,013,298

This matching and conversion is achieved with a parser generator. A parser generator takes a formal grammar as its input and outputs a parse tree, which is an abstract representation of the input text. The grammar refers to the rules for expressing numbers in English.

Syntactic analysis

For the syntactic analysis, I based my approach on an excellent discussion on this topic on Stackoverflow. One of the posters suggested a very simple algorithm to perform the calculation. I found the provided pseudocode a little confusing, so here is my own version:

total = 0, prior = 0

for each word in sentence

   value = dictionary[word]

   if prior = 0
      prior = value
   else if prior > value
      prior = prior + value
   else
      prior = prior * value

   if value >= 1000 or last word in sentence
      total = total + prior
      prior = 0

This algorithm works by retaining two variables, prior and total. The prior variable stores the current value of the current order of magnitude; how many billions, millions or thousands. This is then added back to the total when we step down an order of magnitude. The table below shows the algorithm in action for the input string of “two hundred and seven million thirteen thousand two hundred and ninety eight”.

Word Value Prior Total
0 0
two 2 2 0
hundred 100 200 0
seven 7 207 0
million 1,000,000 0 207,000,000
thirteen 13 13 207,000,000
thousand 1,000 0 207,013,000
two 2 2 207,013,000
hundred 100 200 207,013,000
ninety 90 290 207,013,000
eight 8 298 207,013,000
0 207,013,298

Lexical analysis

Lexical analysis involves defining a grammar for the formation of English words and matching an input string to this grammar. A simplistic approach is to define a dictionary of all possible number words, such as “two”, “hundred” and “million”, and then match any string if it contains only these words. If we use this approach, the algorithm for syntactic analysis described above will still work, however the lexical analysis stage will match invalid sentences that don’t mean anything in English, such as “hundred thirteen seven”, and feed these into the syntactic analyser with unpredictable results.

A more intelligent, but more complicated approach is to ensure that the English number words appear in a valid sequence. This can be defined by the grammar. An example of such a grammar can be found in the appendix of a paper by Stuart Shieber. Inspired by this approach, I wrote the following grammar, which matches numbers smaller than one billion (109):

<number> ::= ((number_1to999 number_1e6)? (number_1to999 number_1e3)? number_1to999?) | number_0

<number_0> ::= "zero"
<number_1to9> ::= "one" | "two" | "three" | "four" | "five" | "six" | "seven" | "eight" | "nine"
<number_10to19> ::= "ten" | "eleven" | "twelve" | "thirteen" | "fourteen" | "fifteen"
	| "sixteen" | "seventeen" | "eighteen" | "nineteen"
<number_1to999> ::= (number_1to9? number_100)? (number_1to9 | number_10to19 | (number_tens number_1to9))?
<number_tens> ::= "twenty" | "thirty" | "fourty" | "fifty" | "sixty" | "seventy" | "eighty" | "ninety"
<number_100> ::= "hundred"
<number_1e3> ::= "thousand"
<number_1e6> ::= "million"

To visualise what it is doing, the syntax diagram below shows the main parts of the grammar (generated using Franz Braun’s CGI diagram generator):

Number parser syntax diagram

Wrapping it all up

The Perl implementation is shown below. There are a few caveats to this implementation (many of them are identified in the Stackoverflow discussion). Because it simply discards the word “and” anywhere in the sentence, it doesn’t distinguish between separate numbers; for example, “twenty and five” will be treated as “twenty five”. The implementation only recognises numbers up to the millions; if it were extended to billions and above, it would need some method of dealing with short and long scales. Furthermore, it only accepts integers and doesn’t accept ordinals. It also does not support vernacular forms of numbers, such as “fifteen hundred”, “three-sixty-five”, “a hundred” or “one point two million” (these are other nuances in English numerals can be found here).

#!/usr/bin/perl

use strict;
use Parse::RecDescent;

my $sentence = $ARGV[0] || die "Must pass an argument";

# Define the grammar
$::RD_AUTOACTION = q { [@item] };
my $parser = Parse::RecDescent->new(q(

	startrule : (number_1to999 number_1e6)(?) (number_1to999 number_1e3)(?) number_1to999(?) | number_0
	number_0: "zero" {0}
	number_1to9: "one" {1} | "two" {2} | "three" {3} | "four" {4} | "five" {5} | "six" {6} | "seven" {7}
		| "eight" {8} | "nine" {9}
	number_10to19: "ten" {10} | "eleven" {11} | "twelve" {12} | "thirteen" {13} | "fourteen" {14}
		| "fifteen" {15} | "sixteen" {16} | "seventeen" {17} | "eighteen" {18} | "nineteen" {19}
	number_1to999: (number_1to9(?) number_100)(?)
		(number_1to9 | number_10to19 | number_10s number_1to9(?))(?)
	number_10s: "twenty" {20} | "thirty" {30} | "fourty" {40} | "fifty" {50} | "sixty" {60} |
		"seventy" {70} | "eighty" {80} | "ninety" {90}
	number_100: "hundred" {100}
	number_1e3: "thousand" {1e3}
	number_1e6: "million" {1e6}

));

# Perform lexical analysis
$sentence =~ s/(\W)and(\W)/$1$2/gi; #remove the word "and"
my $parseTree = $parser->startrule(lc $sentence);

# Perform syntactic analysis
my @numbers = flattenParseTree($parseTree); # flatten the tree to a sequence of numbers
my $number = combineNumberSequence(\@numbers); # process the sequence of numbers to find the total

print $number;

sub flattenParseTree($) {

	my $parseTree = shift || return;
	my @tokens = ();
	if(UNIVERSAL::isa( $parseTree, "ARRAY")) {
		push(@tokens, flattenParseTree($_)) foreach(@{$parseTree});
	} elsif($parseTree > 0) {
		return $parseTree;
	}
	return @tokens;

}

sub combineNumberSequence($) {

	my $numbers = shift || return;
	my $prior = 0;
	my $total = 0;

	for(my $i=0; $i <= $#$numbers; $i++) {
 		if($prior == 0) {
 			$prior = $numbers->[$i];
		} elsif($prior > $numbers->[$i]) {
			$prior += $numbers->[$i];
		} else {
			$prior *= $numbers->[$i];
		}

		if(($numbers->[$i] >= 1e3) || ($i == $#$numbers)) {
			$total += $prior;
			$prior = 0;
		}

	}

	return $total;

}

Accessing your Gmail messages in MATLAB

Lately I’ve been working on a Bayesian spam filter. There are spam email databases available for training and testing spam filters (e.g. HP’s Spambase), but I wanted to try it out on my own real-world email messages from my Gmail mailbox. In this post, I’ll show you how to access your Gmail data using the MATLAB Database Toolbox in Windows.

Gmail Offline is an interesting feature of Gmail that allows you to syncronise your email to your computer so you can access your Gmail messages while you are not connected to the internet. Gmail Offline is based on Google Gears, an open source library developed by Google to allow web applications to function offline. As it turns out, Gmail Offline stores all of the synchronised data using an SQLite database on the client computer. This means we can easily access our synced Gmail programmatically using standard database connectors (unlike Outlook’s proprietary .pst format, for example).

The first step is to locate the Gmail SQLite data file on your computer. Depending on your operating system and browser, the data files can be stored in different locations on your computer. The Google help page does not specify exactly which file within the Google Gears folders contains the actual email messages, and the file itself does not have an .sqlite extension. Just search around the sub-directories until you find the largest file (assuming you have synced a reasonable number of emails). I found my data file at the following location within the Google Gears directory, where “myemail” is my Google username:

\mail.google.com\http_80\myemail@gmail.com-GoogleMail#database[1]

Once you have found your data file, the next step is to install the database driver:

  1. Download and install the SQLite ODBC driver.
  2. Open the ODBC Data Source Administrator tool: Start → Run → type "odbcad32"
  3. Add a new User DSN using the “SQLite3 ODBC Driver”, set the Data Source Name to “gmail” and set the Database Name to the location of your Gmail Offline data file

Now we’re ready to use it in MATLAB. The first step is to open a connection to the database:

% Connect to the database
conn = database('gmail', '', '');
if ~isconnection(conn)
	error(conn.Message)
end

Assuming it connected successfully, we are ready to execute our queries. We will fetch the message ID, timestamp and a message “snippet” for each email and store it in the data variable:

% Fetch message data
e = exec(conn, 'SELECT MessageId,Timestamp,CAST(SnippetHtml as VARCHAR) FROM Messages LIMIT 100');
e = fetch(e);
data = e.Data;
close(e);

You may have noticed the type cast in the SQL query above. For some reason, I was not able to fetch the data from any columns of type “text”; they were always coming back as empty arrays, even though I confirmed they did contain data using SQLite Manager, and I was not able to find any solutions on the internet. I solved the problem simply by casting any “text” columns to “varchar”, but there may be better ways of getting around this.

Finally, we should always disconnect gracefully from the database:

% Disconnect from the database
close(conn);

I was not able to find any descriptions of what the various database tables in the Gmail Offline database contain or what their relationships are. The contents of some of them, like the “Attachments” or “Contacts” tables are obvious; others not so much.

To obtain the schema of the database, run the following:

% Fetch schema information
e = exec(conn, 'SELECT CAST(sql as VARCHAR) FROM SQLITE_MASTER WHERE type = ''table'' ORDER BY name');
e = fetch(e);
data = e.Data;
close(e);

% Display the results
fprintf('%s\n\n', data{:});

The MATLAB Database Toolbox is required to perform all of the above, and unfortunately is not included in MATLAB Student Version. However, many other languages have other similarly easy methods of accessing SQLite databases. Many languages have support for ODBC or you can sometimes use language-specific drivers. For instance, Perl provides the DBD::SQLite driver. Below is some example Perl code to obtain the database schema using DBD::SQLite; don’t forget to update the $dbFile variable to point to the location of your own Gmail data file.

use strict;
use DBI;

# Connect to the database
my $dbFile = 'myemail@gmail.com-GoogleMail#database[1]';
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbFile","","");

# Fetch schema information
my $sth = $dbh->prepare("SELECT sql FROM SQLITE_MASTER WHERE type = 'table' ORDER BY name");
$sth->execute or die $sth->errstr;

# Display the results
while(my $row = $sth->fetch) {
	print $row->[0], "\n\n";
}

# Disconnect from the database
$dbh->disconnect;

In a future post I will give some details about the spam filter that I used this for. I also plan to use this functionality for some interesting data mining experiments, so stay tuned!

Hello world!

After months of contemplating and procrastinating, I have finally decided to start my own blog!

First a little about myself: I am an Australian software engineer who has been working in the field of image processing for the past few years. Like many developers, it’s a passion as well as a job, and I enjoy spending time working on my own various hobby projects and reading and participating on websites like Slashdot, Stackoverflow and Joel on Software. I read a number of excellent software-related blogs and they have been the inspiration for me to start my own.

This blog will mainly be about things that interest me and that I read about in my spare time: software development techniques and project management; algorithms, particularly in the domain of image/signal processing and machine learning; and book reviews and technological news that interests me.  I work in research, so I’ll also discuss conferences I attend and interesting papers I’ve read or written. Currently my main development languages are C++, C#, Java, MATLAB and Perl, but this changes regularly and I always enjoy learning new languages.

I hope to update my blog at least once a month, but hopefully more if I can find the time. So feel free to drop me a line if you have any comments or feedback, and I hope you enjoy reading my blog. Watch this space!