Skip to content

Parsing English numbers with Perl

2 January 2010

Note: The problem described here has already been solved with libraries such as Lingua::EN::FindNumber and Lingua::EN::Words2Nums. For production software, I’d recommend you look at using those modules instead of re-inventing the wheel. This article is only intended for those interested in learning how this type of parsing works.

In a project I was recently working on, there was a need to perform named entity recognition on natural language text fields in order to convert natural language numbers in the field, such as “two hundred and thirteen”, into their numerical values (e.g. 213). Google does a great job of this; for example, try out this search to convert a string to a number, and the reverse. In this post, I’ll discuss how this conversion functionality can be achieved with the nifty Perl recursive descent parser generator Parser::RecDescent.

The parsing is a two-step process. First, each of the number words need to be matched and their values looked up in a dictionary. For example, the word “two” needs to be matched and recognised as “2”, “hundred” as “100” and “thirteen” as “13”. In parsing parlance, this step is known as lexical analysis. Second, we need to calculate the total number that the entire sentence represents by accumulating all of the matched values. This is known as syntactic analysis.

Consider the following input string:

“two hundred and seven million thirteen thousand two hundred and ninety eight”

The first step is to tokenise the string and remove the word “and” anywhere it appears. (Breaking a number up with the word “and” is a British English convention; American English usually omits the instances of “and”.) We then end up with the following list of token values:

“two”, “hundred”, “seven”, “million”, “thirteen”, “thousand”, “two”, “hundred”, “ninety”, “eight”

If we convert each token value into its numeric equivalent, this becomes:

2, 100, 7, 1000000, 13, 1000, 2, 100, 90, 8

Finally, in order to find the total, we calculate:

((2 × 100 + 7) × 1000000) + (13 × 1000) + (2 × 100 + 90 + 8) = 207,013,298

This matching and conversion is achieved with a parser generator. A parser generator takes a formal grammar as its input and outputs a parse tree, which is an abstract representation of the input text. The grammar refers to the rules for expressing numbers in English.

Syntactic analysis

For the syntactic analysis, I based my approach on an excellent discussion on this topic on Stackoverflow. One of the posters suggested a very simple algorithm to perform the calculation. I found the provided pseudocode a little confusing, so here is my own version:

total = 0, prior = 0

for each word in sentence

   value = dictionary[word]

   if prior = 0
      prior = value
   else if prior > value
      prior = prior + value
   else
      prior = prior * value

   if value >= 1000 or last word in sentence
      total = total + prior
      prior = 0

This algorithm works by retaining two variables, prior and total. The prior variable stores the current value of the current order of magnitude; how many billions, millions or thousands. This is then added back to the total when we step down an order of magnitude. The table below shows the algorithm in action for the input string of “two hundred and seven million thirteen thousand two hundred and ninety eight”.

Word Value Prior Total
0 0
two 2 2 0
hundred 100 200 0
seven 7 207 0
million 1,000,000 0 207,000,000
thirteen 13 13 207,000,000
thousand 1,000 0 207,013,000
two 2 2 207,013,000
hundred 100 200 207,013,000
ninety 90 290 207,013,000
eight 8 298 207,013,000
0 207,013,298

Lexical analysis

Lexical analysis involves defining a grammar for the formation of English words and matching an input string to this grammar. A simplistic approach is to define a dictionary of all possible number words, such as “two”, “hundred” and “million”, and then match any string if it contains only these words. If we use this approach, the algorithm for syntactic analysis described above will still work, however the lexical analysis stage will match invalid sentences that don’t mean anything in English, such as “hundred thirteen seven”, and feed these into the syntactic analyser with unpredictable results.

A more intelligent, but more complicated approach is to ensure that the English number words appear in a valid sequence. This can be defined by the grammar. An example of such a grammar can be found in the appendix of a paper by Stuart Shieber. Inspired by this approach, I wrote the following grammar, which matches numbers smaller than one billion (109):

<number> ::= ((number_1to999 number_1e6)? (number_1to999 number_1e3)? number_1to999?) | number_0

<number_0> ::= "zero"
<number_1to9> ::= "one" | "two" | "three" | "four" | "five" | "six" | "seven" | "eight" | "nine"
<number_10to19> ::= "ten" | "eleven" | "twelve" | "thirteen" | "fourteen" | "fifteen"
	| "sixteen" | "seventeen" | "eighteen" | "nineteen"
<number_1to999> ::= (number_1to9? number_100)? (number_1to9 | number_10to19 | (number_tens number_1to9))?
<number_tens> ::= "twenty" | "thirty" | "fourty" | "fifty" | "sixty" | "seventy" | "eighty" | "ninety"
<number_100> ::= "hundred"
<number_1e3> ::= "thousand"
<number_1e6> ::= "million"

To visualise what it is doing, the syntax diagram below shows the main parts of the grammar (generated using Franz Braun’s CGI diagram generator):

Number parser syntax diagram

Wrapping it all up

The Perl implementation is shown below. There are a few caveats to this implementation (many of them are identified in the Stackoverflow discussion). Because it simply discards the word “and” anywhere in the sentence, it doesn’t distinguish between separate numbers; for example, “twenty and five” will be treated as “twenty five”. The implementation only recognises numbers up to the millions; if it were extended to billions and above, it would need some method of dealing with short and long scales. Furthermore, it only accepts integers and doesn’t accept ordinals. It also does not support vernacular forms of numbers, such as “fifteen hundred”, “three-sixty-five”, “a hundred” or “one point two million” (these are other nuances in English numerals can be found here).

#!/usr/bin/perl

use strict;
use Parse::RecDescent;

my $sentence = $ARGV[0] || die "Must pass an argument";

# Define the grammar
$::RD_AUTOACTION = q { [@item] };
my $parser = Parse::RecDescent->new(q(

	startrule : (number_1to999 number_1e6)(?) (number_1to999 number_1e3)(?) number_1to999(?) | number_0
	number_0: "zero" {0}
	number_1to9: "one" {1} | "two" {2} | "three" {3} | "four" {4} | "five" {5} | "six" {6} | "seven" {7}
		| "eight" {8} | "nine" {9}
	number_10to19: "ten" {10} | "eleven" {11} | "twelve" {12} | "thirteen" {13} | "fourteen" {14}
		| "fifteen" {15} | "sixteen" {16} | "seventeen" {17} | "eighteen" {18} | "nineteen" {19}
	number_1to999: (number_1to9(?) number_100)(?)
		(number_1to9 | number_10to19 | number_10s number_1to9(?))(?)
	number_10s: "twenty" {20} | "thirty" {30} | "fourty" {40} | "fifty" {50} | "sixty" {60} |
		"seventy" {70} | "eighty" {80} | "ninety" {90}
	number_100: "hundred" {100}
	number_1e3: "thousand" {1e3}
	number_1e6: "million" {1e6}

));

# Perform lexical analysis
$sentence =~ s/(\W)and(\W)/$1$2/gi; #remove the word "and"
my $parseTree = $parser->startrule(lc $sentence);

# Perform syntactic analysis
my @numbers = flattenParseTree($parseTree); # flatten the tree to a sequence of numbers
my $number = combineNumberSequence(\@numbers); # process the sequence of numbers to find the total

print $number;

sub flattenParseTree($) {

	my $parseTree = shift || return;
	my @tokens = ();
	if(UNIVERSAL::isa( $parseTree, "ARRAY")) {
		push(@tokens, flattenParseTree($_)) foreach(@{$parseTree});
	} elsif($parseTree > 0) {
		return $parseTree;
	}
	return @tokens;

}

sub combineNumberSequence($) {

	my $numbers = shift || return;
	my $prior = 0;
	my $total = 0;

	for(my $i=0; $i <= $#$numbers; $i++) {
 		if($prior == 0) {
 			$prior = $numbers->[$i];
		} elsif($prior > $numbers->[$i]) {
			$prior += $numbers->[$i];
		} else {
			$prior *= $numbers->[$i];
		}

		if(($numbers->[$i] >= 1e3) || ($i == $#$numbers)) {
			$total += $prior;
			$prior = 0;
		}

	}

	return $total;

}
Advertisements
One Comment
  1. Excellent web site. Plenty of helpful information here.
    I’m sending it to a few pals ans additionally sharing in delicious.
    And of course, thank you to your sweat!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: