Prototyping & Fuzzy String Searching

April 8, 2018

Introduction

I work at a company where Confluence is the go to place for documentation, and information sharing. However, a few points of contension exist with Confluence. One is the ability to search and locate the needed information. Second is the time it takes can be non trivial, especially if the data is not clearly tagged, organized, or buried in a very long page of documentation. This leads to using Confluene as a last resort, and typically folks will resort to keeping an external list of bookmarks in their browser or some other method.

A collegue came to me at one point, curious about building a UI that would behave similar to Chrome’s omnibar, but only contain links to resources of interest. This collegue expected things to match some metadata and accept malformed input, to quote him: “I should be able to mash some keys and see a close match.” This is not unlike the functionality one sees in Chrome’s omnibar, or a code editors quick search functionality.

On a day to day I build web UIs, so I quickly envisioned there being a single input where one could “…mash some keys and see a close match.” But before setting out to build our first prototype for this idea we had to ask how we could solve this problem of partial text search. Our options were to try and solve the problem outselves, but were quick to recognize that, that is not the problem we were after solving…and it’s a pretty well solved problem in computer science. At least well enough for our use-case.

Doing a quick search on the internet you can find various algorithms and implementations that can be used. The two algorithms that we would most likely to incorporate are Levenstein Distance and Dice’s Coefficient.

The entries on wikibooks.org contains an implementation of each of these algorithms and there are a number of libraries that implement and properly test the implementation for a number of languages.

Shallow Dive into the Algorithms

Before we dive into building a prototype we need to make a choice about which algorithm we should be using for our use cases. We’d have to consider which algorithm would give us what we wanted with least amount of additional work.

This is the test data that we’ll be testing our algorithms with:

stargate: sg-1
stargate: universe
star trek: discovery
star wars: a new hope

Levenshtein Distance

Let us have a look at the Levenshtein Distance algorithm first. The Levenshtein Distance computes the number of changes needed to transform one string into another. For instance we can transform the string abc into abd by substituting c with d. Giving us the measurement of 1. Another example is bonjour vs hello, which has an edit distance of 6. We are able to keep a one shared letter, o, and we need to make 6 other modifications(additions, removals, or substitutions) to go from one word to the other and vice-versa.

Given this description, you’d be right to assume that we’d have to do some more leg work before we could provide a search term and a list of potential matches to retrieve a sorted list that behave like an fuzzy search.

Let us consider a few examples sorted by their Levenshtein distance:

function levenshteinDistance(searchTerm, candidateString) {
/*
Assume some off the shelf implementation from a library.
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
Input strings will always be lower-cased to maximize a potential match.
*/
}

// Ordered matches for "stargate" by edit distance
levenshteinDistance("stargate", "stargate: sg-1"); // 6
levenshteinDistance("stargate", "stargate: universe"); // 10
levenshteinDistance("stargate", "star trek: discovery"); // 15
levenshteinDistance("stargate", "star wars: a new hope"); // 15

// Ordered matches for "star trek" by edit distance
levenshteinDistance("star trek", "stargate: sg-1"); // 9
levenshteinDistance("star trek", "star trek: discovery"); // 11
levenshteinDistance("star trek", "stargate: universe"); // 12
levenshteinDistance("star trek", "star wars: a new hope"); // 14

// Ordered matches for "star wars" by edit distance
levenshteinDistance("star wars", "stargate: sg-1"); // 9
levenshteinDistance("star wars", "stargate: universe"); // 11
levenshteinDistance("star wars", "star wars: a new hope"); // 12
levenshteinDistance("star wars", "star trek: discovery"); // 14

The thing you’ll notice about the edit distance from these examples is that the edit score can not be used alone to determine similarity to a longer string, when there are multiple potential matches; each of varying length. The Levenshtein algorithm inherently has an minimum edit distance equal to the difference in length of the strings(1); consequently resulting in shorter candidate strings having a lower Levenshtein distance than longer strings which may contain the entire search term.

In order to circumvent this issue, comparison should only be done on words, rather than the entire input sequences at once. Below is an implementation that splits both the search term and candidate string into words, which are compared against eachother. Only the best scores are taken, ensuring that we have a score that reflects the best matches regardless of position in the string.

Note that in this naïve implementation we make the assumption that the searchTerm will always be shorter than the candidate string and that the word order of the search term does not matter. eg: ‘star wars’ and ‘wars star’ would return the same score against the same candidate string.

function getMinLevenshteinDistance(searchTerm, candidateString) {
let searchTerms = searchTerm.split(' ');
let candidateStrings = candidateString.split(' ');

return searchTerms
.map(term =>
candidateStrings
.map(candidate => levenshteinDistance(term, candidate))
.sort() // Sort by ascending order.
.shift() // Get edit distance for best match.
)
.reduce((totalScore, score) => totalScore + score, 0)
}

// Ordered matches for "stargate" by edit distance
getMinLevenshteinDistance("stargate", "stargate: sg-1"); // 1
getMinLevenshteinDistance("stargate", "stargate: universe"); // 1
getMinLevenshteinDistance("stargate", "star trek: discovery"); // 4
getMinLevenshteinDistance("stargate", "star wars: a new hope"); // 4

// Ordered matches for "star trek" by edit distance
getMinLevenshteinDistance("star trek", "star trek: discovery"); // 1
getMinLevenshteinDistance("star trek", "star wars: a new hope"); // 3
getMinLevenshteinDistance("star trek", "stargate: sg-1"); // 7
getMinLevenshteinDistance("star trek", "stargate: universe"); // 11

// Ordered matches for "star wars" by edit distance
getMinLevenshteinDistance("star wars", "star wars: a new hope"); // 1
getMinLevenshteinDistance("star wars", "star trek: discovery"); // 3
getMinLevenshteinDistance("star wars", "stargate: sg-1"); // 7
getMinLevenshteinDistance("star wars", "stargate: universe"); // 11

Given this additional processing we can now consider using the Levenshtein distance in our prototype. Let’s have a look at the Dice Coefficient algorithm.

Sørensen-Dice Coefficient

The Sørensen-Dice Coefficient,(or Dice Coefficient DSC), is a measurement of how similar two sets are. The core algorithm looks at the number of adjecent-pairs, or bigrams, that occur in the strings being compared. Another name for this measuremen is the F1 Score; used in statistical analysis of binary classification to measure a test’s accuracy. DSC has the following formula:(2).

$DSC = {2 |X \cap Y| \over |X| + |Y|}$

Which means that the similarity score is equal to twice the ratio of intercepting bigrams, to the total number of possible bigrams. Giving us a range from 0 to 1. Where zero indicates no similarity at all, and 1 indicates identical strings. To illustrate how the algorithm works lets look at how the similarity of the following words stargate and star wars might be computed.

The word stargate(X) has the following bigram set: $$X = \{st|ta|ar|rg|ga|at|te\}$$ and the word star wars(Y) has the following bigram set: $$Y = \{st|ta|ar|r␣|␣w|wa|ar|rs\}$$. Which give us the following values for our equation:

$|X \cap Y| = 3 \\ |X| = 7 \\ |Y| = 8$

If we plug in our values into the DSC equation, left as an excecise to the reader, we arrive at a value of: 0.40, telling us that there is roughly a 40% similarity between our two strings. If we inspect the input X and Y we can at a glace see that indeed seems like an accurate measurement. If we think about what this algorithm is doing, we can begin to make assumptions about how it will behave with out input data. We can assume that when a search term is compared to a shorter candidate string we will arrive at a higher score(better match), than when compared to longer string. We can also see that unlike our Levenshtein distance algorithm, we do not need to further pre or post process our strings to find an accurate measurement we can use. Let us see how the DSC algorithm might score our previous examples:

function diceCoefficient(searchTerm, candidateString) {
/*
Assume some off the shelf implementation from a library etc.
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient
Input strings will always be lower-cased to maximize a potential match.
*/
}

// Ordered matches for "stargate" by similarity
diceCoefficient("stargate", "stargate: sg-1"); // 0.7
diceCoefficient("stargate", "stargate: universe"); // 0.58
diceCoefficient("stargate", "star trek: discovery"); // 0.23
diceCoefficient("stargate", "star wars: a new hope"); // 0.22

// Ordered matches for "star trek" by similarity
diceCoefficient("star trek", "star trek: discovery"); // 0.59
diceCoefficient("star trek", "stargate: sg-1"); // 0.28
diceCoefficient("star trek", "star wars: a new hope"); // 0.28
diceCoefficient("star trek", "stargate: universe"); // 0.24

// Ordered matches for "star wars" by similarity
diceCoefficient("star wars", "star wars: a new hope"); // 0.57
diceCoefficient("star wars", "stargate: universe"); // 0.32
diceCoefficient("star wars", "star trek: discovery"); // 0.29
diceCoefficient("star wars", "stargate: sg-1"); // 0.28

If we compare the output of the DSC and Levenshtein Distance algorithms, we can see that the top results are the same, but there is variance amoung the results that follow. DSC gives a wider possible range of values, continous instead of discrete, and also gives higher priority to candidate strings that match on a individual bigrams; for example: the bigram rs in star wars gives higher priority to stargate: universe in the DSC version; unlike the Levenshstein distance which gives priority to star trek: discovery because the word wars has a smaller edit distance to trek than it does other words in some of the other possible candidates.

Picking an Algorithm

Ultimately, the Sørensen-Dice Coefficient gives us a much better measurement for fuzzy searching. It also has the potential to allow for future feature development. We can choose to take advantage of the way it utilizes bigrams to visualize the portions of the strings that we are matching on. And while for our prototype this is not important, it is to our advantage to understand the algorithm and how it can make future feature development simpler.

Data Selection

So we’ve picked our algorithm, and we know the possibilities, we’ve even got some sample code from our investigaion. The next step is to decide what type of data we want to collect, and perhaps given our problem is simple, we should strive to keep the data structure simple too.

Since this is going to be a bookmark service, let us mimic the fields that users are already familiar with. Consider Chrome’s bookmark dialog. The dialog is a simple box that asks for a title, a URL, and a folder to place this bookmark. For our page let us mimic this structure and save a title, a URL, and a field for tags. The tag behaving similar to a folder, but allowing us more flexibility for future feature development.

At the end of the day our small application would consume data of the following format:

[
{
"title": "Star Wars: A New Hope",
"url": "http://www.imdb.com/title/tt0076759/"
"tags": ["Science Fantasy"]
},
// More entries...
]

Assembly Required

Now that we have identified our algorithm and we have our data format, we are ready to asseble our prototype. For this prototype, I’ll be using HTML, vanilla Javascript, and CSS. We’ve taken an off the shelf implementation of DSC from wikibooks.org, and inlined some bookmark data so we can see our algorithm in action.