Prototyping & Fuzzy String Searching

Building a Collaborative, and Interactive Bookmark

April 8, 2018

Introduction

I work at a company where we use a wiki as the go to place for documentation, and information sharing. A few points of contention exist with our wiki. The ability to search and locate the needed information is not always useful. The time it takes to locate information can be non trivial. Data is not tagged, organized, or may be buried in a very long page of documentation. Using the wiki has become a last resort, keeping an external list of bookmarks is the norm.

A colleague came to me curious about building a UI that would behave like Chrome’s Omnibar. However, it would only contain links to resources of interest. This colleague expected things to match some metadata and accept malformed input. “I should be able to mash some keys and see a close match.” he said. This is like Chrome’s omnibar, or a code editors quick search functionality.

On a day to day I build web UIs. I envisioned there being a single input where one could “…mash some keys and see a close match.” Before setting out to build a first prototype we had to ask how we could solve this problem of partial text search. The options were to try and solve the problem ourselves, or find an already built solution. Text-search is not the problem we were after solving. It is a pretty well solved problem in computer science. At least well enough for our use-case.

We are not trying to solve this problem and will use an someone else’s implementation. But, in this post we’ll have a closer look at: Levenstein Distance and Dice’s Coefficient algorithms.

The entries on wikibooks.org contains an implementation of each of these algorithms. There are also many libraries that implement and test the implementation.

Shallow Dive into the Algorithms

Before diving into building a prototype we need to choose which algorithm we want to use. The primary criteria being: accuracy and least amount of additional work.

This is the test data that we’ll be testing our algorithms with:

stargate: sg-1
stargate: universe
star trek: discovery
star wars: a new hope

Levenshtein Distance

Let us have a look at the Levenshtein Distance algorithm first. The Levenshtein Distance counts the number of edits needed to transform one string into another. For instance we can transform the string abc into abd by substituting c with d. Giving us the measurement of 1. Another example is bonjour vs hello, which has an edit distance of 6. Meaning 6 edits are required to transform the two words. An edit being an addition, removal, or substitution.

We now have feeling that to retrieve a sorted list that behave like a fuzzy search might take additional work. After all we’re trying to search if a target string contains something like the input string, not transform one into the other.

Let us examines a few examples sorted by their Levenshtein distance:

function levenshteinDistance(searchTerm, candidateString) {
  /*
    Assume some off the shelf implementation from a library.
    https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
    Input strings will always be lower-cased to maximize a potential match.
  */
}

// Ordered matches for "stargate" by edit distance
levenshteinDistance("stargate", "stargate: sg-1"); // 6
levenshteinDistance("stargate", "stargate: universe"); // 10
levenshteinDistance("stargate", "star trek: discovery"); // 15
levenshteinDistance("stargate", "star wars: a new hope"); // 15

// Ordered matches for "star trek" by edit distance
levenshteinDistance("star trek", "stargate: sg-1"); // 9
levenshteinDistance("star trek", "star trek: discovery"); // 11
levenshteinDistance("star trek", "stargate: universe"); // 12
levenshteinDistance("star trek", "star wars: a new hope"); // 14

// Ordered matches for "star wars" by edit distance
levenshteinDistance("star wars", "stargate: sg-1"); // 9
levenshteinDistance("star wars", "stargate: universe"); // 11
levenshteinDistance("star wars", "star wars: a new hope"); // 12
levenshteinDistance("star wars", "star trek: discovery"); // 14

Notice the edit distance from in these examples can not be used alone to determine similarity to a longer string. Particularly when there are many potential matches; each of varying length. The Levenshtein algorithm has an minimum edit distance equal to the difference in length of the strings(1). This results in shorter candidate strings having a lower distance than longer strings which may contain the entire search term.

To circumvent this issue, comparison should only be done on words, rather than the entire input sequences at once. Below is an implementation that splits both the search term and candidate string into a list of words. The words in each list are then compared, and the best scores taken. This ensures that we have a score that reflects the best matches regardless of position in the string.

Note that in this naïve implementation. We make the assumption that the search term will always be shorter than the candidate string and that the word order of the search term does not matter. eg: ‘star wars’ and ‘wars star’ would return the same score against the same candidate string.

function getMinLevenshteinDistance(searchTerm, candidateString) {
  let searchTerms = searchTerm.split(' ');
  let candidateStrings = candidateString.split(' ');

  return searchTerms
    .map(term =>
      candidateStrings
        .map(candidate => levenshteinDistance(term, candidate))
        .sort() // Sort by ascending order.
        .shift() // Get edit distance for best match.
    )
    .reduce((totalScore, score) => totalScore + score, 0)
}

// Ordered matches for "stargate" by edit distance
getMinLevenshteinDistance("stargate", "stargate: sg-1"); // 1
getMinLevenshteinDistance("stargate", "stargate: universe"); // 1
getMinLevenshteinDistance("stargate", "star trek: discovery"); // 4
getMinLevenshteinDistance("stargate", "star wars: a new hope"); // 4

// Ordered matches for "star trek" by edit distance
getMinLevenshteinDistance("star trek", "star trek: discovery"); // 1
getMinLevenshteinDistance("star trek", "star wars: a new hope"); // 3
getMinLevenshteinDistance("star trek", "stargate: sg-1"); // 7
getMinLevenshteinDistance("star trek", "stargate: universe"); // 11

// Ordered matches for "star wars" by edit distance
getMinLevenshteinDistance("star wars", "star wars: a new hope"); // 1
getMinLevenshteinDistance("star wars", "star trek: discovery"); // 3
getMinLevenshteinDistance("star wars", "stargate: sg-1"); // 7
getMinLevenshteinDistance("star wars", "stargate: universe"); // 11

Given this additional processing we can now consider using the Levenshtein distance in our prototype. Let’s have a look at the Dice Coefficient algorithm.

Sørensen-Dice Coefficient

The Sørensen-Dice Coefficient,(or Dice Coefficient DSC), is a measurement of how similar two sets are. The core algorithm looks at the number of adjecent-pairs, or bigrams, that occur in the strings being compared. Another name for this measuremen is the F1 Score; used in statistical analysis of binary classification to measure a test’s accuracy. DSC has the following formula:(2).

\[DSC = {2 |X \cap Y| \over |X| + |Y|}\]

Which means that the similarity score is equal to twice the ratio of intercepting bigrams, to the total number of possible bigrams. Giving us a range from 0 to 1. Where zero indicates no similarity at all, and 1 indicates identical strings. To illustrate how the algorithm works lets look at how the similarity of the following words stargate and star wars might be computed.

The word stargate(X) has the following bigram set: \(X = \{st|ta|ar|rg|ga|at|te\}\) and the word star wars(Y) has the following bigram set: \(Y = \{st|ta|ar|r␣|␣w|wa|ar|rs\}\). Which give us the following values for our equation:

\[ |X \cap Y| = 3 \\ |X| = 7 \\ |Y| = 8 \]

If we plug in our values into the DSC equation, left as an excecise to the reader, we arrive at a value of: 0.40, telling us that there is roughly a 40% similarity between our two strings. If we inspect the input X and Y we can at a glace see that indeed seems like an accurate measurement. If we think about what this algorithm is doing, we can begin to make assumptions about how it will behave with out input data. We can assume that when a search term is compared to a shorter candidate string we will arrive at a higher score(better match), than when compared to longer string. We can also see that unlike our Levenshtein distance algorithm, we do not need to further pre or post process our strings to find an accurate measurement we can use. Let us see how the DSC algorithm might score our previous examples:

function diceCoefficient(searchTerm, candidateString) {
  /*
    Assume some off the shelf implementation from a library etc.
    https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient
    Input strings will always be lower-cased to maximize a potential match.
  */
}

// Ordered matches for "stargate" by similarity
diceCoefficient("stargate", "stargate: sg-1"); // 0.7
diceCoefficient("stargate", "stargate: universe"); // 0.58
diceCoefficient("stargate", "star trek: discovery"); // 0.23
diceCoefficient("stargate", "star wars: a new hope"); // 0.22

// Ordered matches for "star trek" by similarity
diceCoefficient("star trek", "star trek: discovery"); // 0.59
diceCoefficient("star trek", "stargate: sg-1"); // 0.28
diceCoefficient("star trek", "star wars: a new hope"); // 0.28
diceCoefficient("star trek", "stargate: universe"); // 0.24

// Ordered matches for "star wars" by similarity
diceCoefficient("star wars", "star wars: a new hope"); // 0.57
diceCoefficient("star wars", "stargate: universe"); // 0.32
diceCoefficient("star wars", "star trek: discovery"); // 0.29
diceCoefficient("star wars", "stargate: sg-1"); // 0.28

If we compare the output of the DSC and Levenshtein Distance algorithms, we can see that the top results are the same, but there is variance amoung the results that follow. DSC gives a wider possible range of values, continous instead of discrete, and also gives higher priority to candidate strings that match on a individual bigrams; for example: the bigram rs in star wars gives higher priority to stargate: universe in the DSC version; unlike the Levenshstein distance which gives priority to star trek: discovery because the word wars has a smaller edit distance to trek than it does other words in some of the other possible candidates.

Picking an Algorithm

Ultimately, the Sørensen-Dice Coefficient gives us a much better measurement for fuzzy searching. It also has the potential to allow for future feature development. We can choose to take advantage of the way it utilizes bigrams to visualize the portions of the strings that we are matching on. And while for our prototype this is not important, it is to our advantage to understand the algorithm and how it can make future feature development simpler.

Data Selection

So we’ve picked our algorithm, and we know the possibilities, we’ve even got some sample code from our investigaion. The next step is to decide what type of data we want to collect, and perhaps given our problem is simple, we should strive to keep the data structure simple too.

Since this is going to be a bookmark service, let us mimic the fields that users are already familiar with. Consider Chrome’s bookmark dialog. The dialog is a simple box that asks for a title, a URL, and a folder to place this bookmark. For our page let us mimic this structure and save a title, a URL, and a field for tags. The tag behaving similar to a folder, but allowing us more flexibility for future feature development.

At the end of the day our small application would consume data of the following format:

[
  {
    "title": "Star Wars: A New Hope",
    "url": "http://www.imdb.com/title/tt0076759/"
    "tags": ["Science Fantasy"]
  },
  // More entries...
]

Assembly Required

Now that we have identified our algorithm and we have our data format, we are ready to asseble our prototype. For this prototype, I’ll be using HTML, vanilla Javascript, and CSS. We’ve taken an off the shelf implementation of DSC from wikibooks.org, and inlined some bookmark data so we can see our algorithm in action.

See the Pen Protobar by Bladymir Tellez (@blad) on CodePen.

Conclusion

In this post we’ve explored how we’ve taken an idea for an interactive bookmark landing page; identified a subproblem, fuzzy string searching, looked at known algorithms and their differences to make a choice about which to use. The analysis of the algorithms led us to a better understanding of how the algorithms work, and also other potential features for the future.

In our prototpe(displayed above), we’ve seeded the data into the JS file; however the data can easily come from an API service that generates the same JSON. In order to maximize utility of this prototype the next steps are be to deploy the landing page to a location where interested parties can access it, and optimally a location where the curated list can be updated.

What makes this landing page for bookmarks powerful is not just it’s ability to sort based on fuzzy search criteria, but also the inherent consistency that a curated list has in applying naming conventions. By following a consitent and predictible naming criteria, we provide a clarity and consistency that may not exist where the links originated.

References:

  1. Wikipedia: Levenshtein Distance
  2. Wikipedia: Sørensen–Dice coefficient