Quality of early 1970s approximate string matching algorithms
Approximate string matching algorithms (ASMA) determine if two strings are the same, “close enough”, or are distinct. Spelling error detection and correction were early applications of these algorithms, and today ASMAs (or similar techniques) are used in natural language understanding applications. Hundreds of papers have been written on the subject (1980 survey, 2001 survey), but this post is focused on the subset of algorithms that were considered commercially viable for the field of computer-assisted instruction (CAI), an early hardware/software-as-a-service business.
Background
Between 1985 and 1986, John C. Nesbit wrote a pair of articles12 for the Journal of Computer-Based Instruction examining the accuracy of ASMAs. His interest was using the algorithms for student response matching in computer-aided instruction. A student response is a typed answer to a question posed by an educational program. For example, the question “who was the second president of the United States?” might wish to accept variants such as:
- John Adams
- John A Adams
- John A. Adams
- President Adams
- Pres. Adams
- Adams
If the student responded with “Jon Adams” (misspelling John), the developer may wish that answer to be accepted or to coax the student to fix their answer rather than rejecting it outright. Similarly, if the student answered “Thomas Jefferson” (or variants thereof), the developer may want to reject the answer with a suggestive message such as “Jefferson was the third president”. As you likely intuiting, the sheer range of variants possible would be very tedious for the developer unless the CAI system added labor-saving techniques.
Some CAI systems would go beyond simple spell-correction and provide response normalization features. For instance, PLATO would treat “three” and “3” as equivalent, as well as “1/2”, “0.5”, “2/4”, and “50/100”. There were also facilities to treat mathematical expressions such as “100 + (a * b)/(c + d)” as equivalent to all the other possible forms. Given the limited CPU and memory available (and the fact that response matching was a very common operation), these features used a number of tricks and heuristics, rather than implementing a computer algebra system.
Nesbit selected a group of ten algorithms that were either being used for CAI applications or had performed well in previous studies. He implemented and evaluated all of the algorithms on a DEC VAX-11/780 except for PLATO, which ran on its native CDC CYBER 170/720.
There were two operative theories for how words became misspelled. The first theory was that words were often misspelled such that they were correct phonetically. The second theory is that misspellings come from random errors that drop, insert, substitutions, and sometimes transpose characters. Blair, PLANIT, Soundex, and Symonds were based on the phonetic theory. Alberga, EDIST1, EDIST2, and Damerau were based on edit distance. Damerau-Symonds combined the two theories. PLATO projected words into a bitstring via a variety of feature extraction rules.
Results
Table 1 is based on the four figures from (Nesbit 1986). We extracted the Type I and Type II errors from the figures via visual examination, and based on data about the four datasets (Table 2), computed a F1 score. For the algorithms Alberga, EDIST1, and EDIST2, since the figures included a range of values, we took the value closest to the origin. Table 3 summarizes the results for each algorithm using the harmonic mean of the F1 score from each of the four datasets.
Table 1: Qualitative measurements from (Nesbit 1986) plus F1 Score
Dataset | Algorithm | Type I Error: FP % | Type II Error: FN % | F1 |
---|---|---|---|---|
Blair | Alberga | 6 | 6 | 0.94 |
Blair | 24 | 4 | 0.87 | |
Damerau | 27 | 0 | 0.86 | |
Damerau-Symonds | 6 | 4 | 0.95 | |
EDIST1 | 2 | 3 | 0.98 | |
EDIST2 | 2 | 2 | 0.98 | |
PLANIT | 8 | 19 | 0.87 | |
PLATO | 6 | 2 | 0.96 | |
Soundex | 6 | 21 | 0.86 | |
Symonds | 17 | 2 | 0.91 | |
Damerau | Alberga | 9 | 4 | 0.88 |
Blair | 29 | 3 | 0.66 | |
Damerau | 18 | 0 | 0.79 | |
Damerau-Symonds | 9 | 4 | 0.88 | |
EDIST1 | 2 | 2 | 0.97 | |
EDIST2 | 2 | 3 | 0.96 | |
PLANIT | 30 | 21 | 0.60 | |
PLATO | 12 | 2 | 0.85 | |
Soundex | 21 | 14 | 0.71 | |
Symonds | 46 | 2 | 0.49 | |
Masters | Alberga | 11 | 8 | 0.93 |
Blair | 34 | 4 | 0.86 | |
Damerau | 28 | 0 | 0.90 | |
Damerau-Symonds | 14 | 5 | 0.93 | |
EDIST1 | 7 | 4 | 0.96 | |
EDIST2 | 5 | 4 | 0.97 | |
PLANIT | 19 | 18 | 0.85 | |
PLATO | 13 | 4 | 0.94 | |
Soundex | 12 | 21 | 0.85 | |
Symonds | 22 | 3 | 0.91 | |
Nesbit | Alberga | 29 | 21 | 0.81 |
Blair | 59 | 2 | 0.74 | |
Damerau | 47 | 2 | 0.82 | |
Damerau-Symonds | 20 | 17 | 0.86 | |
EDIST1 | 21 | 17 | 0.86 | |
EDIST2 | 14 | 12 | 0.90 | |
PLANIT | 17 | 37 | 0.75 | |
PLATO | 27 | 8 | 0.88 | |
Soundex | 17 | 20 | 0.85 | |
Symonds | 27 | 7 | 0.89 |
Table 2: Datasets used in (Nesbit 1986)
Dataset | Words | Mispellings | Total | Description |
---|---|---|---|---|
Blair | 117 | 117 | 234 | Common misspellings |
Damerau | 41 | 44 | 85 | Newspaper errors |
Masters | 179 | 320 | 499 | Grades 8, 12, 16 |
Nesbit | 213 | 524 | 737 | Grades 2-6 |
Table 3: Harmonic Mean of F1 Scores
Algorithm | Harmonic Mean | Publication Year |
---|---|---|
Alberga | 0.89 | 1967 |
Blair | 0.77 | 1960 |
Damerau | 0.84 | 1964 |
Damerau-Symonds | 0.90 | 1970 |
EDIST1 | 0.94 | 1974 |
EDIST2 | 0.95 | 1983 |
PLANIT | 0.75 | 1966 |
PLATO | 0.91 | 1972 / 19853 |
Soundex | 0.81 | 19304 |
Symonds | 0.75 | 1970 |
In descending order, the best performing algorithms by quality were EDIST2, EDIST1, PLATO, Damerau-Symonds, Alberga, Damerau, Soundex, Blair, and Symonds.
Conclusion
By the early 1970s, edit distance based algorithms were showing their qualitative superiority to phonetically based algorithms. Furthermore, edit distance based algorithms were more amenable to multiple languages, while the phonetic algorithms evaluated were all tuned for English text. (However, all the datasets used English examples.) However, edit distance algorithms were more costly to compute than the others in the sample (except Damerau, which restricted the error count to one). EDIST1 and EDIST2 both required computing the edit distance, which cost \(O(mn)\) time and \(O(mn)\) space where m and n were the length of the two strings, although in 1980 a solution with \(O(m)\) space was found (Navarro 2001). Solutions with \(O(kn)\) time were found in the 80s and 90s, where k was the error threshold. Today, edit distance is the standard technique for handling errors in information retrieval, conversational analytics, and biological informatics.
The PLATO algorithm stands apart in the group due to its excellent performance, both in terms of quality and computational speed. Its design was inspired by a numerical taxonomy approach and uses techniques from cluster analysis. Due to the novelty and effectiveness of the approach, we will discuss details of the implementation in a future production.
-
Nesbit, John C. “Approximate string matching in response analysis.” Journal of Computer-Based Instruction 12.3 (1985): 71-75. archive.org ↩︎
-
Nesbit, John C. “The accuracy of approximate string matching algorithms.” Journal of Computer Based Instruction 13.3 (1986): 80-83. archive.org ↩︎
-
The PLATO matching algorithm evolved over time. 1972 was the date of the first publication describing its design in the TUTOR language, although some version of the algorithm may date back as far as 1965. The algorithm that was tested by Nesbit was based on the 1985/1986 version. ↩︎
-
Soundex was first patented in 1918. The U.S. Census uses a version called “American Soundex” which was defined in 1930. ↩︎