Note: The problem described here has already been solved with libraries such as Lingua::EN::FindNumber and Lingua::EN::Words2Nums. For production software, I’d recommend you look at using those modules instead of re-inventing the wheel. This article is only intended for those interested in learning how this type of parsing works.

In a project I was recently working on, there was a need to perform named entity recognition on natural language text fields in order to convert natural language numbers in the field, such as “two hundred and thirteen”, into their numerical values (e.g. 213). Google does a great job of this; for example, try out this search to convert a string to a number, and the reverse. In this post, I’ll discuss how this conversion functionality can be achieved with the nifty Perl recursive descent parser generator Parser::RecDescent.

The parsing is a two-step process. First, each of the number words need to be matched and their values looked up in a dictionary. For example, the word “two” needs to be matched and recognised as “2”, “hundred” as “100” and “thirteen” as “13”. In parsing parlance, this step is known as lexical analysis. Second, we need to calculate the total number that the entire sentence represents by accumulating all of the matched values. This is known as syntactic analysis.

Consider the following input string:

“two hundred and seven million thirteen thousand two hundred and ninety eight”

The first step is to tokenise the string and remove the word “and” anywhere it appears. (Breaking a number up with the word “and” is a British English convention; American English usually omits the instances of “and”.) We then end up with the following list of token values:

“two”, “hundred”, “seven”, “million”, “thirteen”, “thousand”, “two”, “hundred”, “ninety”, “eight”

If we convert each token value into its numeric equivalent, this becomes:

2, 100, 7, 1000000, 13, 1000, 2, 100, 90, 8

Finally, in order to find the total, we calculate:

((2 × 100 + 7) × 1000000) + (13 × 1000) + (2 × 100 + 90 + 8) = 207,013,298

This matching and conversion is achieved with a parser generator. A parser generator takes a formal grammar as its input and outputs a parse tree, which is an abstract representation of the input text. The grammar refers to the rules for expressing numbers in English.

### Syntactic analysis

For the syntactic analysis, I based my approach on an excellent discussion on this topic on Stackoverflow. One of the posters suggested a very simple algorithm to perform the calculation. I found the provided pseudocode a little confusing, so here is my own version:

```total = 0, prior = 0

for each word in sentence

value = dictionary[word]

if prior = 0
prior = value
else if prior > value
prior = prior + value
else
prior = prior * value

if value >= 1000 or last word in sentence
total = total + prior
prior = 0```

This algorithm works by retaining two variables, `prior` and `total`. The `prior` variable stores the current value of the current order of magnitude; how many billions, millions or thousands. This is then added back to the `total` when we step down an order of magnitude. The table below shows the algorithm in action for the input string of “two hundred and seven million thirteen thousand two hundred and ninety eight”.

Word Value Prior Total
0 0
two 2 2 0
hundred 100 200 0
seven 7 207 0
million 1,000,000 0 207,000,000
thirteen 13 13 207,000,000
thousand 1,000 0 207,013,000
two 2 2 207,013,000
hundred 100 200 207,013,000
ninety 90 290 207,013,000
eight 8 298 207,013,000
0 207,013,298

### Lexical analysis

Lexical analysis involves defining a grammar for the formation of English words and matching an input string to this grammar. A simplistic approach is to define a dictionary of all possible number words, such as “two”, “hundred” and “million”, and then match any string if it contains only these words. If we use this approach, the algorithm for syntactic analysis described above will still work, however the lexical analysis stage will match invalid sentences that don’t mean anything in English, such as “hundred thirteen seven”, and feed these into the syntactic analyser with unpredictable results.

A more intelligent, but more complicated approach is to ensure that the English number words appear in a valid sequence. This can be defined by the grammar. An example of such a grammar can be found in the appendix of a paper by Stuart Shieber. Inspired by this approach, I wrote the following grammar, which matches numbers smaller than one billion (109):

```<number> ::= ((number_1to999 number_1e6)? (number_1to999 number_1e3)? number_1to999?) | number_0

<number_0> ::= "zero"
<number_1to9> ::= "one" | "two" | "three" | "four" | "five" | "six" | "seven" | "eight" | "nine"
<number_10to19> ::= "ten" | "eleven" | "twelve" | "thirteen" | "fourteen" | "fifteen"
| "sixteen" | "seventeen" | "eighteen" | "nineteen"
<number_1to999> ::= (number_1to9? number_100)? (number_1to9 | number_10to19 | (number_tens number_1to9))?
<number_tens> ::= "twenty" | "thirty" | "fourty" | "fifty" | "sixty" | "seventy" | "eighty" | "ninety"
<number_100> ::= "hundred"
<number_1e3> ::= "thousand"
<number_1e6> ::= "million"```

To visualise what it is doing, the syntax diagram below shows the main parts of the grammar (generated using Franz Braun’s CGI diagram generator): ### Wrapping it all up

The Perl implementation is shown below. There are a few caveats to this implementation (many of them are identified in the Stackoverflow discussion). Because it simply discards the word “and” anywhere in the sentence, it doesn’t distinguish between separate numbers; for example, “twenty and five” will be treated as “twenty five”. The implementation only recognises numbers up to the millions; if it were extended to billions and above, it would need some method of dealing with short and long scales. Furthermore, it only accepts integers and doesn’t accept ordinals. It also does not support vernacular forms of numbers, such as “fifteen hundred”, “three-sixty-five”, “a hundred” or “one point two million” (these are other nuances in English numerals can be found here).

```#!/usr/bin/perl

use strict;
use Parse::RecDescent;

my \$sentence = \$ARGV || die "Must pass an argument";

# Define the grammar
\$::RD_AUTOACTION = q { [@item] };
my \$parser = Parse::RecDescent->new(q(

startrule : (number_1to999 number_1e6)(?) (number_1to999 number_1e3)(?) number_1to999(?) | number_0
number_0: "zero" {0}
number_1to9: "one" {1} | "two" {2} | "three" {3} | "four" {4} | "five" {5} | "six" {6} | "seven" {7}
| "eight" {8} | "nine" {9}
number_10to19: "ten" {10} | "eleven" {11} | "twelve" {12} | "thirteen" {13} | "fourteen" {14}
| "fifteen" {15} | "sixteen" {16} | "seventeen" {17} | "eighteen" {18} | "nineteen" {19}
number_1to999: (number_1to9(?) number_100)(?)
(number_1to9 | number_10to19 | number_10s number_1to9(?))(?)
number_10s: "twenty" {20} | "thirty" {30} | "fourty" {40} | "fifty" {50} | "sixty" {60} |
"seventy" {70} | "eighty" {80} | "ninety" {90}
number_100: "hundred" {100}
number_1e3: "thousand" {1e3}
number_1e6: "million" {1e6}

));

# Perform lexical analysis
\$sentence =~ s/(\W)and(\W)/\$1\$2/gi; #remove the word "and"
my \$parseTree = \$parser->startrule(lc \$sentence);

# Perform syntactic analysis
my @numbers = flattenParseTree(\$parseTree); # flatten the tree to a sequence of numbers
my \$number = combineNumberSequence(\@numbers); # process the sequence of numbers to find the total

print \$number;

sub flattenParseTree(\$) {

my \$parseTree = shift || return;
my @tokens = ();
if(UNIVERSAL::isa( \$parseTree, "ARRAY")) {
push(@tokens, flattenParseTree(\$_)) foreach(@{\$parseTree});
} elsif(\$parseTree > 0) {
return \$parseTree;
}
return @tokens;

}

sub combineNumberSequence(\$) {

my \$numbers = shift || return;
my \$prior = 0;
my \$total = 0;

for(my \$i=0; \$i <= \$#\$numbers; \$i++) {
if(\$prior == 0) {
\$prior = \$numbers->[\$i];
} elsif(\$prior > \$numbers->[\$i]) {
\$prior += \$numbers->[\$i];
} else {
\$prior *= \$numbers->[\$i];
}

if((\$numbers->[\$i] >= 1e3) || (\$i == \$#\$numbers)) {
\$total += \$prior;
\$prior = 0;
}

}

return \$total;

}```
1. Excellent web site. Plenty of helpful information here.
I’m sending it to a few pals ans additionally sharing in delicious.
And of course, thank you to your sweat!

2. bug in parser
must be something like
number_1to999: (number_1to9(?) number_100)(?) (number_10to19 | number_10s number_1to9(?) | number_1to9)(?)

3. http://palepurpur.ru/onlayn-zaym-na-kartu-s-prosrochkami-ukraina.html – РћРЅР»Р°Р№РЅ Р·Р°Р№Рј РЅР° РєР°СЂС‚Сѓ СЃ РїСЂРѕСЃСЂРѕС‡РєР°РјРё СѓРєСЂР°РёРЅР°

4. В нашей базе – только те новостройки Столицы, Новой Москвы и Московской области,
в которых можно купить недвижимость от застройщика или уполномоченного им продавца.
Мы сосредоточились на тех объектах недвижимости, которые интересны дольщикам здесь и сейчас.
Причем речь не всегда будет идти о квартирах в привычном понимании этого термина – мы не обходим вниманием и лофты.
Мы с пристрастием изучаем каждый новый ЖК, прежде чем представить его на Ваш суд.

Мы не специалисты в сфере строительства, но мы в состоянии осветить его ход:
помимо традиционных отзывов покупателей и анализа цены
мы организуем для Вас контрольные поездки на площадки строительства.
Мы готовы адресовать девелоперу любые Ваши претензии –
поверьте, те, кто заинтересован в реализации Вам своей квартиры, от ответов убегать не станут.
Смысл нашей деятельности заключается в том, чтобы Вы имели возможность найти у нас ту новостройку,
в которой находится именно Ваша новая квартира.