@flaviu To support phrase search in the necessary order, you will need to “include term vectors” for the field in question; you can choose the analyzer as needed but for your example text I’ll go with the standard
analyzer so the numbers don’t get dropped.
With the above settings, the term dictionary will also include array positions for your text. Here’s a sample …
dictionary:
word1 - 204 (cc) posting byteSize: 20 cardinality: 2
word2 - 249 (f9) posting byteSize: 20 cardinality: 2
word3 - 294 (126) posting byteSize: 20 cardinality: 2
word4 - 331 (14b) posting byteSize: 18 cardinality: 1
word5 - 366 (16e) posting byteSize: 18 cardinality: 1
Now you can perform a match_phrase query over this that will take into account the order of the criteria. Note that “term vectors” are a requirement for the match_phrase query.
Here’re queries that would work …
{"query": {"field": "fieldX", "match_phrase": "word1 word2 word3"}}
{"query": {"field": "fieldX", "match_phrase": "word1 word2 word3 word4 word5"}}
and here’re those that won’t …
{"query": {"field": "fieldX", "match_phrase": "word5 word1"}}
{"query": {"field": "fieldX", "match_phrase": "word1 word2 word3 word5 word4"}}
Remember a match_phrase query is an analytic query, so the analyzer for the text field (from the index definition) is applied on the search criteria before executing the search.
If you’ve 2 documents with these contents in fieldX
:
"word1 word2 word3"
"word1 word2 word3 word4 word5"
Then running a match_phrase search for word1 word2 word3
will return both the documents as hits, scoring the first above the second, because of exact match.
Alternatively, you could look into applying a custom analyzer with a shingle token filter while indexing your data and using a non-analytic query such as term to search for your data.
Here’s the definition of a custom analyzer …
"analysis": {
"analyzers": {
"temp_shingle": {
"token_filters": [
"shingle_min_5_max_5"
],
"tokenizer": "whitespace",
"type": "custom"
}
},
"token_filters": {
"shingle_min_5_max_5": {
"filler": "",
"max": 5,
"min": 5,
"output_original": false,
"separator": " ",
"type": "shingle"
}
}
}
With this definition, the index will NOT index text whose shingle length is less than or greater than 5, meaning the text word1 word2 word3
is not even indexed. Here’s a sample term dictionary for the above 2 documents …
dictionary:
word1 word2 word3 word4 word5 - 9223372039002259457 (8000000080000001) -- docNum: 1, norm: 0.000000
Remember to hook this analyzer to your fieldX
while defining the index. Now here’s a query …
{"query": {"field": "fieldX", "term": "word1 word2 word3 word4 word5"}}
Hope this helps.