Character Filter

gizmo74 · September 20, 2018, 2:18pm

Hi,

I’m playing with character filter. Idea is to filter some german umlauts and other characters. I can’t use standard “de” filter, because I need to use prefix/wildcard search. So idea is to filter them in couchbase via character filter (ü -> u etc.) and manually do the same with the query string (because fts don’t use analyzer for wildcard/prefix).

I indexed some documents with texts like “hello mister müller”.

Standard (no filter): wildcard query with müll* works.
character filter with “regular expression = ü, replace=u”: wildcard query with mull* does NOT work
character filter with “regular expression = ü, replace=[emtpty]”: wildcard query with mll* works
character filter with “regular expression = e, replace=a”: wildcard query with hall* works

So something seems to be wrong with ü -> u replacement, while ü -> empty or e -> a works perfectly.

Do I something wrong? Or could it be that is a problem of utf8, because ü ist a 2 byte character, while u is 1 byte?

Thanks, Pascal

gizmo74 · September 21, 2018, 3:13pm

I created a inex now with edge_ngram token filter. Now it works as expected with matchquery and is also faster than prefix queries… I’ll continue with testing that for my use case.

Topic		Replies	Views
Custom character filter in analyzer not working Full Text Search	1	748	March 27, 2020
FTS Search UUID not working Full Text Search	3	1188	August 2, 2019
Full Text search with ASCII Folding Filter Full Text Search	4	1715	December 13, 2018
Asciifolding for not discriminating diacritics Full Text Search	7	1534	November 15, 2019
Efficiently pattern matching SQL++	5	976	February 15, 2019

Character Filter

Related topics