# Parsing English numbers with Perl

**Note:** The problem described here has already been solved with libraries such as Lingua::EN::FindNumber and Lingua::EN::Words2Nums. For production software, I’d recommend you look at using those modules instead of re-inventing the wheel. *This article is only intended for those interested in learning how this type of parsing works.*

In a project I was recently working on, there was a need to perform named entity recognition on natural language text fields in order to convert natural language numbers in the field, such as “two hundred and thirteen”, into their numerical values (e.g. 213). Google does a great job of this; for example, try out this search to convert a string to a number, and the reverse. In this post, I’ll discuss how this conversion functionality can be achieved with the nifty Perl recursive descent parser generator Parser::RecDescent.

The parsing is a two-step process. First, each of the number words need to be matched and their values looked up in a dictionary. For example, the word “two” needs to be matched and recognised as “2”, “hundred” as “100” and “thirteen” as “13”. In parsing parlance, this step is known as *lexical analysis*. Second, we need to calculate the total number that the entire sentence represents by accumulating all of the matched values. This is known as *syntactic analysis*.

Consider the following input string:

“two hundred and seven million thirteen thousand two hundred and ninety eight”

The first step is to tokenise the string and remove the word “and” anywhere it appears. (Breaking a number up with the word “and” is a British English convention; American English usually omits the instances of “and”.) We then end up with the following list of token values:

“two”, “hundred”, “seven”, “million”, “thirteen”, “thousand”, “two”, “hundred”, “ninety”, “eight”

If we convert each token value into its numeric equivalent, this becomes:

2, 100, 7, 1000000, 13, 1000, 2, 100, 90, 8

Finally, in order to find the total, we calculate:

((2 × 100 + 7) × 1000000) + (13 × 1000) + (2 × 100 + 90 + 8) = 207,013,298

This matching and conversion is achieved with a parser generator. A parser generator takes a formal grammar as its input and outputs a parse tree, which is an abstract representation of the input text. The grammar refers to the rules for expressing numbers in English.

### Syntactic analysis

For the syntactic analysis, I based my approach on an excellent discussion on this topic on Stackoverflow. One of the posters suggested a very simple algorithm to perform the calculation. I found the provided pseudocode a little confusing, so here is my own version:

total = 0, prior = 0 for each word in sentence value = dictionary[word] if prior = 0 prior = value else if prior > value prior = prior + value else prior = prior * value if value >= 1000 or last word in sentence total = total + prior prior = 0

This algorithm works by retaining two variables, `prior`

and `total`

. The `prior`

variable stores the current value of the current order of magnitude; how many billions, millions or thousands. This is then added back to the `total`

when we step down an order of magnitude. The table below shows the algorithm in action for the input string of “two hundred and seven million thirteen thousand two hundred and ninety eight”.

Word | Value | Prior | Total |
---|---|---|---|

– | – | 0 | 0 |

two | 2 | 2 | 0 |

hundred | 100 | 200 | 0 |

seven | 7 | 207 | 0 |

million | 1,000,000 | 0 | 207,000,000 |

thirteen | 13 | 13 | 207,000,000 |

thousand | 1,000 | 0 | 207,013,000 |

two | 2 | 2 | 207,013,000 |

hundred | 100 | 200 | 207,013,000 |

ninety | 90 | 290 | 207,013,000 |

eight | 8 | 298 | 207,013,000 |

– | – | 0 | 207,013,298 |

### Lexical analysis

Lexical analysis involves defining a grammar for the formation of English words and matching an input string to this grammar. A simplistic approach is to define a dictionary of all possible number words, such as “two”, “hundred” and “million”, and then match any string if it contains only these words. If we use this approach, the algorithm for syntactic analysis described above will still work, however the lexical analysis stage will match invalid sentences that don’t mean anything in English, such as “hundred thirteen seven”, and feed these into the syntactic analyser with unpredictable results.

A more intelligent, but more complicated approach is to ensure that the English number words appear in a valid sequence. This can be defined by the grammar. An example of such a grammar can be found in the appendix of a paper by Stuart Shieber. Inspired by this approach, I wrote the following grammar, which matches numbers smaller than one billion (10^{9}):

<number> ::= ((number_1to999 number_1e6)? (number_1to999 number_1e3)? number_1to999?) | number_0 <number_0> ::= "zero" <number_1to9> ::= "one" | "two" | "three" | "four" | "five" | "six" | "seven" | "eight" | "nine" <number_10to19> ::= "ten" | "eleven" | "twelve" | "thirteen" | "fourteen" | "fifteen" | "sixteen" | "seventeen" | "eighteen" | "nineteen" <number_1to999> ::= (number_1to9? number_100)? (number_1to9 | number_10to19 | (number_tens number_1to9))? <number_tens> ::= "twenty" | "thirty" | "fourty" | "fifty" | "sixty" | "seventy" | "eighty" | "ninety" <number_100> ::= "hundred" <number_1e3> ::= "thousand" <number_1e6> ::= "million"

To visualise what it is doing, the syntax diagram below shows the main parts of the grammar (generated using Franz Braun’s CGI diagram generator):

### Wrapping it all up

The Perl implementation is shown below. There are a few caveats to this implementation (many of them are identified in the Stackoverflow discussion). Because it simply discards the word “and” anywhere in the sentence, it doesn’t distinguish between separate numbers; for example, “twenty and five” will be treated as “twenty five”. The implementation only recognises numbers up to the millions; if it were extended to billions and above, it would need some method of dealing with short and long scales. Furthermore, it only accepts integers and doesn’t accept ordinals. It also does not support vernacular forms of numbers, such as “fifteen hundred”, “three-sixty-five”, “a hundred” or “one point two million” (these are other nuances in English numerals can be found here).

#!/usr/bin/perl use strict; use Parse::RecDescent; my $sentence = $ARGV[0] || die "Must pass an argument"; # Define the grammar $::RD_AUTOACTION = q { [@item] }; my $parser = Parse::RecDescent->new(q( startrule : (number_1to999 number_1e6)(?) (number_1to999 number_1e3)(?) number_1to999(?) | number_0 number_0: "zero" {0} number_1to9: "one" {1} | "two" {2} | "three" {3} | "four" {4} | "five" {5} | "six" {6} | "seven" {7} | "eight" {8} | "nine" {9} number_10to19: "ten" {10} | "eleven" {11} | "twelve" {12} | "thirteen" {13} | "fourteen" {14} | "fifteen" {15} | "sixteen" {16} | "seventeen" {17} | "eighteen" {18} | "nineteen" {19} number_1to999: (number_1to9(?) number_100)(?) (number_1to9 | number_10to19 | number_10s number_1to9(?))(?) number_10s: "twenty" {20} | "thirty" {30} | "fourty" {40} | "fifty" {50} | "sixty" {60} | "seventy" {70} | "eighty" {80} | "ninety" {90} number_100: "hundred" {100} number_1e3: "thousand" {1e3} number_1e6: "million" {1e6} )); # Perform lexical analysis $sentence =~ s/(\W)and(\W)/$1$2/gi; #remove the word "and" my $parseTree = $parser->startrule(lc $sentence); # Perform syntactic analysis my @numbers = flattenParseTree($parseTree); # flatten the tree to a sequence of numbers my $number = combineNumberSequence(\@numbers); # process the sequence of numbers to find the total print $number; sub flattenParseTree($) { my $parseTree = shift || return; my @tokens = (); if(UNIVERSAL::isa( $parseTree, "ARRAY")) { push(@tokens, flattenParseTree($_)) foreach(@{$parseTree}); } elsif($parseTree > 0) { return $parseTree; } return @tokens; } sub combineNumberSequence($) { my $numbers = shift || return; my $prior = 0; my $total = 0; for(my $i=0; $i <= $#$numbers; $i++) { if($prior == 0) { $prior = $numbers->[$i]; } elsif($prior > $numbers->[$i]) { $prior += $numbers->[$i]; } else { $prior *= $numbers->[$i]; } if(($numbers->[$i] >= 1e3) || ($i == $#$numbers)) { $total += $prior; $prior = 0; } } return $total; }

Excellent web site. Plenty of helpful information here.

I’m sending it to a few pals ans additionally sharing in delicious.

And of course, thank you to your sweat!