Sublime Text Ctrl+P Like Fuzzy Matching In Few Lines of Python

There were a few people curious in the ST forums on how Sublime Text’s fuzzy matching works.

Sublime Text Fuzzy Search

In reality, I don’t know either. There were some interesting solutions given, but I haven’t researched their methods and workability.

I needed to implement this for a small project I’m working on. I went with a solution that worked decently (I think I saw this within a JS fuzzy searching library). I’m not claiming it to be the best or the fastest (far from it, actually).

However, just in case you need to implement something like this, this might come in handy.

The FuzzyMatcher class

class FuzzyMatcher():

    def __init__(self):
        self.pattern = ''

    def setPattern(self, pattern):
        self.pattern = '.*?'.join(map(re.escape, list(pattern)))

    def score(self, string):
        match = re.search(self.pattern, string)
        if match is None:
            return 0
        else:
            return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))

There are two methods, one sets the pattern to match against, and the other is used to calculate the score of the candidates.

You would use them as:

fuzzyMatcher = FuzzyMatcher()
fuzzyMatcher.setPattern('aad') # Set a pattern to match aganinst
score1 = fuzzyMatcher.score('This is string one') # Score for this string against the pattern
score2 = fuzzyMatcher.score('This is string two') # And so on...

The idea

First, let’s understand the principles we’re working with:

  1. The pattern would be few characters
  2. A match would mean that the characters in the pattern appear in the same order as in the matched string
  3. A match found near the beginning of a string is scored more than a match found near the end
  4. A match is scored more if the characters in the patterns are closer to each other, while the score is lower if they are more spread out
  5. We are not assigning more weights to certain characters than the other

That’s the criterias we’re going with. As you can see, we can construct a regex to see if a pattern appears in a string.

If the pattern is aad, our regex would look like a.*?a.*?d. We’re looking for the shortest substring that has the characters aad in that order, with arbitrary characters in between them.

We can easily figure out the position of the match, as well as the length of the match. If the length of the match is larger, it automatically implies that the characters are more spread out.

Explanation of the Python program

Let’s look at the setPattern function that creates the regex. We have a separate function to create the regex for clarity.

def setPattern(self, pattern):
    self.pattern = '.*?'.join(map(re.escape, list(pattern)))

We first convert the string into a character array using list(pattern), and then map re.escape to the characters. Some characters have special meaning in regex, but we want the user’s input to be treated as a literal.

Then, we join the escaped characters using .*?. This would turn something like aad into a.*?a.*?d.

Now, let’s look at the score function.

def score(self, string):
    match = re.search(self.pattern, string)
    if match is None:
        return 0
    else:
        return 100.0 / ((1 + match.start()) * (match.end() - match.start() + 1))

Here, we’re doing a regex search for the input string, against the pattern we previously set. If we have no match, return 0 as the score.

Else:

  1. The score is inversely proportional to the start of the match. We add a 1 since strings are zero-indexed, and we don’t want a division by 0.
  2. The score is inversely proportional to the length of the match.
  3. We scale the results by multiplying it by 100.0. This value is arbitrary. Since the string lengths I dealt with were small, this worked. You may have to adjust this.

Conclusion

That was a simple (possibly inefficient) solution that works well. Of course, it can be optimized by pre-compiling the regex and stuff like that. I left it out just for clarity.

And this is in no way restricted to Python. Any language with a decent regex implementation would work.

Also, the score calculation can be improved by assigning more weight to characters appearing closer together than the match being farther away in a string. I’m researching on that as we speak.

Comments

  1. Adam Forsyth says:

    Nice post. I have a couple small notes. What’s the purpose of the call to “list”? “map” should iterate over a string just as well, I don’t think there is a need to convert it. I would rewrite “setPattern” as:

    def set_pattern(self, pattern):
    self.pattern = ‘.*?’.join(re.escape(char) for char in pattern)

    to follow Python conventions and also consider using a “property” around “pattern” rather than having to call a setter.

    • Raahul Seshadri says:

      You’re right. The list is unnecessary (since even join works directly on a string). I need to get more acquainted with the Python conventions.

  2. Ivan JĂșnior says:

    Great post!

    BTW, A key difference in their implementation is that they probably create a inverted index, avoiding linear iteration and allowing a bigger number of files in a small amout of time.

    I’m not sure about it, but I’ve been using a VIM plugin with a similiar feature and it does that way.

Speak Your Mind

*

Current day month ye@r *