Quality of early 1970s approximate string matching algorithms

Approximate string matching algorithms (ASMA) determine if two strings are the same, “close enough”, or are distinct. Spelling error detection and correction were early applications of these algorithms, and today ASMAs (or similar techniques) are used in natural language understanding applications. Hundreds of papers have been written on the subject (1980 survey, 2001 survey), but this post is focused on the subset of algorithms that were considered commercially viable for the field of computer-assisted instruction (CAI), an early hardware/software-as-a-service business.

Background

Between 1985 and 1986, John C. Nesbit wrote a pair of articles12 for the Journal of Computer-Based Instruction examining the accuracy of ASMAs. His interest was using the algorithms for student response matching in computer-aided instruction. A student response is a typed answer to a question posed by an educational program. For example, the question “who was the second president of the United States?” might wish to accept variants such as:

  1. John Adams
  2. John A Adams
  3. John A. Adams
  4. President Adams
  5. Pres. Adams
  6. Adams

If the student responded with “Jon Adams” (misspelling John), the developer may wish that answer to be accepted or to coax the student to fix their answer rather than rejecting it outright. Similarly, if the student answered “Thomas Jefferson” (or variants thereof), the developer may want to reject the answer with a suggestive message such as “Jefferson was the third president”. As you likely intuiting, the sheer range of variants possible would be very tedious for the developer unless the CAI system added labor-saving techniques.

Some CAI systems would go beyond simple spell-correction and provide response normalization features. For instance, PLATO would treat “three” and “3” as equivalent, as well as “1/2”, “0.5”, “2/4”, and “50/100”. There were also facilities to treat mathematical expressions such as “100 + (a * b)/(c + d)” as equivalent to all the other possible forms. Given the limited CPU and memory available (and the fact that response matching was a very common operation), these features used a number of tricks and heuristics, rather than implementing a computer algebra system.

Nesbit selected a group of ten algorithms that were either being used for CAI applications or had performed well in previous studies. He implemented and evaluated all of the algorithms on a DEC VAX-11/780 except for PLATO, which ran on its native CDC CYBER 170/720.

There were two operative theories for how words became misspelled. The first theory was that words were often misspelled such that they were correct phonetically. The second theory is that misspellings come from random errors that drop, insert, substitutions, and sometimes transpose characters. Blair, PLANIT, Soundex, and Symonds were based on the phonetic theory. Alberga, EDIST1, EDIST2, and Damerau were based on edit distance. Damerau-Symonds combined the two theories. PLATO projected words into a bitstring via a variety of feature extraction rules.

Results

Table 1 is based on the four figures from (Nesbit 1986). We extracted the Type I and Type II errors from the figures via visual examination, and based on data about the four datasets (Table 2), computed a F1 score. For the algorithms Alberga, EDIST1, and EDIST2, since the figures included a range of values, we took the value closest to the origin. Table 3 summarizes the results for each algorithm using the harmonic mean of the F1 score from each of the four datasets.

Table 1: Qualitative measurements from (Nesbit 1986) plus F1 Score

Dataset Algorithm Type I Error: FP % Type II Error: FN % F1
Blair Alberga 6 6 0.94
Blair 24 4 0.87
Damerau 27 0 0.86
Damerau-Symonds 6 4 0.95
EDIST1 2 3 0.98
EDIST2 2 2 0.98
PLANIT 8 19 0.87
PLATO 6 2 0.96
Soundex 6 21 0.86
Symonds 17 2 0.91
Damerau Alberga 9 4 0.88
Blair 29 3 0.66
Damerau 18 0 0.79
Damerau-Symonds 9 4 0.88
EDIST1 2 2 0.97
EDIST2 2 3 0.96
PLANIT 30 21 0.60
PLATO 12 2 0.85
Soundex 21 14 0.71
Symonds 46 2 0.49
Masters Alberga 11 8 0.93
Blair 34 4 0.86
Damerau 28 0 0.90
Damerau-Symonds 14 5 0.93
EDIST1 7 4 0.96
EDIST2 5 4 0.97
PLANIT 19 18 0.85
PLATO 13 4 0.94
Soundex 12 21 0.85
Symonds 22 3 0.91
Nesbit Alberga 29 21 0.81
Blair 59 2 0.74
Damerau 47 2 0.82
Damerau-Symonds 20 17 0.86
EDIST1 21 17 0.86
EDIST2 14 12 0.90
PLANIT 17 37 0.75
PLATO 27 8 0.88
Soundex 17 20 0.85
Symonds 27 7 0.89

Table 2: Datasets used in (Nesbit 1986)

Dataset Words Mispellings Total Description
Blair 117 117 234 Common misspellings
Damerau 41 44 85 Newspaper errors
Masters 179 320 499 Grades 8, 12, 16
Nesbit 213 524 737 Grades 2-6

Table 3: Harmonic Mean of F1 Scores

Algorithm Harmonic Mean Publication Year
Alberga 0.89 1967
Blair 0.77 1960
Damerau 0.84 1964
Damerau-Symonds 0.90 1970
EDIST1 0.94 1974
EDIST2 0.95 1983
PLANIT 0.75 1966
PLATO 0.91 1972 / 19853
Soundex 0.81 19304
Symonds 0.75 1970

In descending order, the best performing algorithms by quality were EDIST2, EDIST1, PLATO, Damerau-Symonds, Alberga, Damerau, Soundex, Blair, and Symonds.

Conclusion

By the early 1970s, edit distance based algorithms were showing their qualitative superiority to phonetically based algorithms. Furthermore, edit distance based algorithms were more amenable to multiple languages, while the phonetic algorithms evaluated were all tuned for English text. (However, all the datasets used English examples.) However, edit distance algorithms were more costly to compute than the others in the sample (except Damerau, which restricted the error count to one). EDIST1 and EDIST2 both required computing the edit distance, which cost \(O(mn)\) time and \(O(mn)\) space where m and n were the length of the two strings, although in 1980 a solution with \(O(m)\) space was found (Navarro 2001). Solutions with \(O(kn)\) time were found in the 80s and 90s, where k was the error threshold. Today, edit distance is the standard technique for handling errors in information retrieval, conversational analytics, and biological informatics.

The PLATO algorithm stands apart in the group due to its excellent performance, both in terms of quality and computational speed. Its design was inspired by a numerical taxonomy approach and uses techniques from cluster analysis. Due to the novelty and effectiveness of the approach, we will discuss details of the implementation in a future production.


  1. Nesbit, John C. “Approximate string matching in response analysis.” Journal of Computer-Based Instruction 12.3 (1985): 71-75. archive.org ↩︎

  2. Nesbit, John C. “The accuracy of approximate string matching algorithms.” Journal of Computer Based Instruction 13.3 (1986): 80-83. archive.org ↩︎

  3. The PLATO matching algorithm evolved over time. 1972 was the date of the first publication describing its design in the TUTOR language, although some version of the algorithm may date back as far as 1965. The algorithm that was tested by Nesbit was based on the 1985/1986 version. ↩︎

  4. Soundex was first patented in 1918. The U.S. Census uses a version called “American Soundex” which was defined in 1930. ↩︎