I am hoping to replace a handmade solution based upon the “damerauLevenshtein” algorithm with an FTS counterpart.
How does one design the index or construct a query which returns matches for which the relevancy between the search term and the field values exceed a given percentage, say 50%?
If I search for “Joseph Public” in a name field, I would like all the matches returned with similarity to that search term exceeding 50% (or any other similarity threshold provided)
I can’t think of a direct/explicit way to achieve this.
But a couple of related options coming to mind are,
Try a match query making use of a “prefix_length” parameter which is greater than the minimum amount of (percentage of ) prefix matching needed to ensure that - those many tokens are already matched.
Match query also accepts a fuzziness parameter which would then be applied to the remaining matching tokens after the specified prefix_length.
eg:
“query”: {
“match”: “Joseph Public”,
“field”:“name”,
“operator”:“and”,
“fuzziness”: 2,
“prefix_length”: 7
}
Another thinking to achieve a similar result if we have multiple tokens to always search for is by using boosting based on the amount of tokens searched.
For eg: you can have a disjunction query with multiple child match_phrase/phrase queries depending on your requirements with the highest boosting for child query with the maximum number of tokens to search for.
eg:
“disjuncts”: [
{“match_phrase”: “term1 term2 term3”, “field”: “name”, “boost”: N},
{“match_phrase”: “term1 term2”, “field”: “name”, “boost”: 2N/3}
{“match_phrase”: “term2 term3”, “field”: “name”, “boost”: N/3}
]
But all these are sort of approximations and not a precise answer to your requirement.
I appreciate your effort. I will review your suggestions to determine if they may get us closer to our objective.
The more code I can replace with features and solutions already available in FTS, the better. A 25% reduction or more would be nice first round of refactoring and optimization.