In this blog post, we’ll have a look at the preview API for full text search in Couchbase 4.5. Please note that this API, released in the latest Java SDK (2.2.4
), is still @Experimental
.
We’ll cover:
- Full Text Search in Couchbase?
- The Java API
- Various Types of Queries
- Getting Hit Explanations
- Conclusion
This experimental API can be used with Couchbase Server 4.5 Developer Preview, provided you use the 2.2.4
Java SDK client, which you can get through Maven. Add the following dependency to your pom.xml
:
1 2 3 4 5 |
com.couchbase.client java-client 2.2.4 |
Full Text Search in Couchbase?
Yes! The upcoming 4.5
server release, (codename Watson) will include a full text indexer (FTS, also known as CBFT) based on the open-source Bleve project. Bleve is all about full-text search and indexing in Go (shoutout to our very own Marty Schoch for initiating this project).
The idea is to leverage Bleve to provide an off-the-shelf full text search in Couchbase Server, without having to use connectors to external software (that runs on their own cluster). If that off-the-shelf solution doesn’t meet your needs all the way of course you still can use these connectors, but for simpler needs you are good to go with a single solution.
FTS offers a host of capabilities that are provided by Bleve: Text Analyzers, Tokenizers and post-processing Token Filters that are beyond the scope of this post, as well as the numerous types of queries that you can run on the resulting indexes. Let’s see what those types are and how you can expect to use them in the context of the Java SDK.
In the rest of this blog post, we’ll use 3 indexes that you will be able to build through the web administrative console in the upcoming 4.5 Developer Preview:
Here is the list of indexes in the UI:
We have:
- a
beerIndex
that indexes the whole content of each document in the
bucket.beer-sample
- a
travelIndex
that indexes the whole content of each document in the
bucket.travel-sample
- an alias index,
commonIndex
, that is an union of the two indexes above.
The Java API
The entry point of the full text search feature in the Java SDK is on the Bucket
, using the query(SearchQuery ftq)
method. This is consistent with the existing querying methods already present in the API to run a ViewQuery
or a N1qlQuery
.
The API for full text search follows the builder pattern. Identify the type of query you want and use the corresponding builder to construct it, get the SearchQuery
out of it using build()
and execute it using bucket.query(searchQuery)
.
Let’s take a (very simple) example and see how it can be consumed:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
//we'll use that Cluster and Bucket for the remainder of the examples Cluster cluster = CouchbaseCluster.create("127.0.0.1"); Bucket bucket = cluster.openBucket("beer-sample"); //we use a simple form of query: SearchQuery ftq = MatchQuery.on("beerIndex").match("national").limit(3).build(); //we fire the query and look at results SearchQueryResult result = bucket.query(ftq); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { System.out.println(row); } |
If we look at each section individually, here’s what happened:
- We create a simple
MatchQuery
on a single term. - It runs on the beer sample (
.on(beerIndex
), looks for textual occurrences of the word “national” (.query("national")
) or close terms. - Additional configuration is done to limit the number of results to 3 (
limit(3)
) and the actual query is created at this point (.build()
). - The query is executed (
bucket.query(ftq)
) and returns aSearchQueryResult
. - We output the result’s
totalHits()
and individual rows (also accessible as a list throughhits()
).
Running that code outputs:
1 2 3 4 |
totalHits: 31 SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}} SearchQueryHit{id='brouwerij_nacional_balashi', score=0.12085760187148556, fragments={}} SearchQueryHit{id='cervecera_nacional', score=0.09863195902067363, fragments={}} |
We see that total hits gives us the actual number of hits before the limit was applied. The hits()
method returns 3 SearchQueryRow
objects, as requested.
Each hit contains the key to the associated document in Couchbase (id()
), as well as more information on the matching, eg. a score for the match (score()
)… If you want, you can retrieve the associated document using bucket.get(row.id())
:
1 2 3 4 5 6 |
result = bucket.query(ftq); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { System.out.println(row); System.out.println(bucket.get(row.id()).content()); } |
This gives us, for the first hit:
1 2 3 |
SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}} {"country":"United States","website":"http://www.dcbrau.com/","code":"20018","address":["3178-B Bladensburg Rd. NE"],"city":"Washington","phone":"","name":"DC Brau", "description":"The first brewery to open in the nation's capital since Prohibition.","state":"DC","type":"brewery","updated":"2011-08-08 19:02:40"} |
If we look closely at the document’s JSON, we notice where the document probably matched. In the “description
” field of the document, there is this sentence:
The first brewery to open in the nation‘s capital since Prohibition.
Also notice that the text query looked for the word requested and derived words that have the same root. It actually applied a fuzziness of 2 (see the next section).
This pattern can be applied to the other types of queries as well, so let’s have a look at a few more, see what kind of search can be performed.
Various Types of Queries
Fuzzy Querying
Fuzzy querying can be performed with the MatchQuery
, specifying a Levenshtein distance as the maximum fuzziness()
to allow on the term:
1 2 3 4 5 6 7 8 9 10 11 |
result = bucket.query(MatchQuery.on("beerIndex") .match("sammar") .field("name") .fuzziness(2) //actually the default .build()); System.out.println("nFuzzy Match Query"); System.out.println("totalHits (fuzziness = 2): " + result.totalHits()); for (SearchQueryRow row : result) { System.out.println(bucket.get(row.id()).content().get("name")); } |
At a fuzziness of 2, this matches words like “hammer”, “mamma” or “summer”:
1 2 3 4 5 |
Fuzzy Match Query totalHits (fuzziness = 2): 45 Mamma Mia! Pizza Beer Redhook Long Hammer IPA Summer Wheat |
At a fuzziness of 1, no match is found:
1 2 |
Fuzzy Match Query totalHits (fuzziness = 1): 0 |
A type of query dedicated to fuzziness and not applying any analyzer is also provided in the FuzzyQuery
.
Multiple Terms: MatchPhrase
As we saw, MatchQuery
is a term-based query that allows to optionally specify fuzziness and also applies the same filter to the searched term that may have been applied to the field (eg. stemming, etc…):
1 2 3 4 |
MatchQuery.on("beerIndex") .match("sesonal") .fuzziness(2) .field("description").build(); |
You can search for multiple terms in a single query by using a Match Phrase
query. Terms are analyzed and fuzziness can be optionally activated:
1 |
MatchPhraseQuery.on("beerIndex").matchPhrase("summer seasonal").field("description"); |
Regexp Query
A RegexpQuery
doesn’t only do literal matching but allows to match using a regular expression. Take this example:
1 2 3 4 5 6 7 8 9 10 |
result = bucket.query(RegexpQuery.on("beerIndex") .regexp("[tp]ale") .field("name") .build()); System.out.println("nRegexp Query"); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { System.out.println(bucket.get(row.id()).content().get("name")); } |
Notice this query targets a particular field in the json (field("name")
). We want all names that contain either “tale” or “pale”. Here are a few names that match this query:
1 2 3 4 5 |
Regexp Query totalHits: 408 Tall Tale Pale Ale Bard's Tale Beer Company Pale Ale |
Prefix Query
A PrefixQuery
looks for word occurrences that start with the given string:
1 2 3 4 5 6 7 8 9 10 |
result = bucket.query(PrefixQuery.on("beerIndex") .prefix("weiss") .field("name") .build()); System.out.println("nPrefix Query"); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { System.out.println(bucket.get(row.id()).content().get("name")); } |
Once again we only look inside the name
field, this time for words that start with “weiss”:
1 2 3 4 5 6 |
Prefix Query totalHits: 74 Bavarian-Weissbier Hefeweisse / Weisser Hirsch Münchner Kindl Weissbier / Münchner Weisse Franziskaner Hefe-Weissbier Hell / Franziskaner Club-Weiss Weissenheimer Wheat |
Range and Date Queries
FTS
is also good with non-textual data. For instance, the NumericRangeQuery
allows you to look for numerical values within a provided range:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
result = bucket.query(NumericRangeQuery.on("beerIndex") .min(3) .max(4) .field("abv") .fields("name", "abv") .build()); System.out.println("nNumeric Range Query"); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { JsonDocument doc = bucket.get(row.id()); System.out.println(""" + doc.content().get("name") + "", abv: " + doc.content().get("abv")); } |
Which outputs:
1 2 3 4 5 |
Numeric Range Query totalHits: 62 "Stud Service Stout", abv: 3.1 "Blonde", abv: 3.0 "Locke Mountain Light", abv: 3.7 |
Dates are covered as well with the DateRangeQuery
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Calendar calendar = Calendar.getInstance(); calendar.set(2011, Calendar.MARCH, 1); Date start = calendar.getTime(); calendar.set(2011, Calendar.APRIL, 1); Date end = calendar.getTime(); result = bucket.query(DateRangeQuery.on("beerIndex") .start(start) .end(end) .field("updated") .fields("name", "updated") .build()); System.out.println("nDate Range Query"); System.out.println("totalHits: " + result.totalHits()); for (SearchQueryRow row : result) { JsonDocument doc = bucket.get(row.id()); System.out.println(""" + doc.content().get("name") + "", updated: " + doc.content().get("updated")); } |
Which outputs:
1 2 3 4 5 6 |
Date Range Query totalHits: 4 "Dank", updated: 2011-03-16 09:06:54 "Oso", updated: 2011-03-16 09:05:15 "Summer Teeth", updated: 2011-03-08 12:22:14 "Columbus Brewing Company", updated: 2011-03-08 12:19:07 |
Generic Querying
FTS
also offer a more generic form of querying that combines phrases, terms and more using the String Query syntax
. This is accessible in the API through the StringQuery
.
Combining
Additionally, you can combine simple criteria like MatchQuery
using combination queries. Taking these two simple term queries:
1 2 |
MatchQuery bitterQuery = MatchQuery.on("beerIndex").match("bitter").field("description").build(); MatchQuery maltyQuery = MatchQuery.on("beerIndex").match("malty").field("description").build(); |
You could combine them in different manners:
- a
conjunction
looks for all the terms
1 |
ConjunctionQuery.on("beerIndex").conjuncts(bitterQuery, maltyQuery) |
- a
disjunction
looks for at least one term
1 |
DisjunctionQuery.on("beerIndex").disjuncts(bitterQuery, maltyQuery) |
- a
boolean query
allows you to combine the two approaches
1 |
BooleanQuery.on("beerIndex").must(bitterQuery).mustNot(maltyQuery) |
Getting Hit Explanations
If you want to get insights into the scoring and matching of a particular SearchQueryRow
, you can build your query using the .explain(true)
parameter and get details from the index in result’s explanation()
field:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[ { "message": "weight(_all:national^1.000000 in penn_brewery-penn_marzen), product of:", "children": [ { "message": "queryWeight(_all:national^1.000000), product of:", "children": [ { "message": "boost", "value": 1 }, { "message": "idf(docFreq=17, maxDocs=7303)", "value": 7.005668743723945 }, { "message": "queryNorm", "value": 0.1427415478209491 } ], "value": 0.9999999999999999 }, { "message": "fieldWeight(_all:national in penn_brewery-penn_marzen), product of:", "children": [ { "message": "tf(termFreq(_all:national)=1", "value": 1 }, { "message": "fieldNorm(field=_all, doc=penn_brewery-penn_marzen)", "value": 0.10000000149011612 }, { "message": "idf(docFreq=17, maxDocs=7303)", "value": 7.005668743723945 } ], "value": 0.7005668848116544 } ], "value": 0.7005668848116543 } ],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543} |
Conclusion
We hope that this preview of the API has peeked your interest!
Go ahead and download the first Developer Preview of Couchbase 4.5 with embedded Full Text Search service. We hope that you’ll be able to quickly start searching using the associated Java SDK API.
And until then… Happy coding!
– The Java SDK Team