Preview of Full Text Search in Couchbase using the Java SDK

In this blog post, we’ll have a look at the preview API for full text search in Couchbase 4.5. Please note that this API, released in the latest Java SDK (2.2.4), is still @Experimental.

We’ll cover:

Full Text Search in Couchbase?
The Java API
Various Types of Queries
Getting Hit Explanations
Conclusion

This experimental API can be used with Couchbase Server 4.5 Developer Preview, provided you use the 2.2.4 Java SDK client, which you can get through Maven. Add the following dependency to your pom.xml:


    com.couchbase.client
    java-client
    2.2.4

com.couchbase.client

java-client

2.2.4

Full Text Search in Couchbase?

Yes! The upcoming 4.5 server release, (codename Watson) will include a full text indexer (FTS, also known as CBFT) based on the open-source Bleve project. Bleve is all about full-text search and indexing in Go (shoutout to our very own Marty Schoch for initiating this project).

The idea is to leverage Bleve to provide an off-the-shelf full text search in Couchbase Server, without having to use connectors to external software (that runs on their own cluster). If that off-the-shelf solution doesn’t meet your needs all the way of course you still can use these connectors, but for simpler needs you are good to go with a single solution.

FTS offers a host of capabilities that are provided by Bleve: Text Analyzers, Tokenizers and post-processing Token Filters that are beyond the scope of this post, as well as the numerous types of queries that you can run on the resulting indexes. Let’s see what those types are and how you can expect to use them in the context of the Java SDK.

In the rest of this blog post, we’ll use 3 indexes that you will be able to build through the web administrative console in the upcoming 4.5 Developer Preview:

Here is the list of indexes in the UI:

We have:

a beerIndex that indexes the whole content of each document in the beer-sample bucket.
a travelIndex that indexes the whole content of each document in the travel-sample bucket.
an alias index, commonIndex, that is an union of the two indexes above.

The Java API

The entry point of the full text search feature in the Java SDK is on the Bucket, using the query(SearchQuery ftq) method. This is consistent with the existing querying methods already present in the API to run a ViewQuery or a N1qlQuery.

The API for full text search follows the builder pattern. Identify the type of query you want and use the corresponding builder to construct it, get the SearchQuery out of it using build() and execute it using bucket.query(searchQuery).

Let’s take a (very simple) example and see how it can be consumed:

//we'll use that Cluster and Bucket for the remainder of the examples
Cluster cluster = CouchbaseCluster.create("127.0.0.1");
Bucket bucket = cluster.openBucket("beer-sample");

//we use a simple form of query:
SearchQuery ftq = MatchQuery.on("beerIndex").match("national").limit(3).build();

//we fire the query and look at results
SearchQueryResult result = bucket.query(ftq);
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    System.out.println(row);
}

//we'll use that Cluster and Bucket for the remainder of the examples

Cluster cluster = CouchbaseCluster.create("127.0.0.1");

Bucket bucket = cluster.openBucket("beer-sample");

//we use a simple form of query:

SearchQuery ftq = MatchQuery.on("beerIndex").match("national").limit(3).build();

//we fire the query and look at results

SearchQueryResult result = bucket.query(ftq);

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

System.out.println(row);

}

If we look at each section individually, here’s what happened:

We create a simple MatchQuery on a single term.
It runs on the beer sample (.on(beerIndex), looks for textual occurrences of the word “national” (.query("national")) or close terms.
Additional configuration is done to limit the number of results to 3 (limit(3)) and the actual query is created at this point (.build()).
The query is executed (bucket.query(ftq)) and returns a SearchQueryResult.
We output the result’s totalHits() and individual rows (also accessible as a list through hits()).

Running that code outputs:

totalHits: 31
SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}}
SearchQueryHit{id='brouwerij_nacional_balashi', score=0.12085760187148556, fragments={}}
SearchQueryHit{id='cervecera_nacional', score=0.09863195902067363, fragments={}}

totalHits: 31

SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}}

SearchQueryHit{id='brouwerij_nacional_balashi', score=0.12085760187148556, fragments={}}

SearchQueryHit{id='cervecera_nacional', score=0.09863195902067363, fragments={}}

We see that total hits gives us the actual number of hits before the limit was applied. The hits() method returns 3 SearchQueryRow objects, as requested.

Each hit contains the key to the associated document in Couchbase (id()), as well as more information on the matching, eg. a score for the match (score())… If you want, you can retrieve the associated document using bucket.get(row.id()):

result = bucket.query(ftq);
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    System.out.println(row);
    System.out.println(bucket.get(row.id()).content());
}

result = bucket.query(ftq);

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

System.out.println(row);

System.out.println(bucket.get(row.id()).content());

}

This gives us, for the first hit:

SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}}
{"country":"United States","website":"http://www.dcbrau.com/","code":"20018","address":["3178-B Bladensburg Rd. NE"],"city":"Washington","phone":"","name":"DC Brau",
"description":"The first brewery to open in the nation's capital since Prohibition.","state":"DC","type":"brewery","updated":"2011-08-08 19:02:40"}

SearchQueryHit{id='dc_brau', score=0.09068310490562362, fragments={}}

{"country":"United States","website":"http://www.dcbrau.com/","code":"20018","address":["3178-B Bladensburg Rd. NE"],"city":"Washington","phone":"","name":"DC Brau",

"description":"The first brewery to open in the nation's capital since Prohibition.","state":"DC","type":"brewery","updated":"2011-08-08 19:02:40"}

If we look closely at the document’s JSON, we notice where the document probably matched. In the “description” field of the document, there is this sentence:

The first brewery to open in the nation‘s capital since Prohibition.

Also notice that the text query looked for the word requested and derived words that have the same root. It actually applied a fuzziness of 2 (see the next section).

This pattern can be applied to the other types of queries as well, so let’s have a look at a few more, see what kind of search can be performed.

Various Types of Queries

Fuzzy Querying

Fuzzy querying can be performed with the MatchQuery, specifying a Levenshtein distance as the maximum fuzziness() to allow on the term:

result = bucket.query(MatchQuery.on("beerIndex")
    .match("sammar")
    .field("name")
    .fuzziness(2) //actually the default
    .build());

System.out.println("nFuzzy Match Query");
System.out.println("totalHits (fuzziness = 2): " + result.totalHits());
for (SearchQueryRow row : result) {
    System.out.println(bucket.get(row.id()).content().get("name"));
}

result = bucket.query(MatchQuery.on("beerIndex")

.match("sammar")

.field("name")

.fuzziness(2) //actually the default

.build());

System.out.println("nFuzzy Match Query");

System.out.println("totalHits (fuzziness = 2): " + result.totalHits());

for (SearchQueryRow row : result) {

System.out.println(bucket.get(row.id()).content().get("name"));

}

At a fuzziness of 2, this matches words like “hammer”, “mamma” or “summer”:

Fuzzy Match Query
totalHits (fuzziness = 2): 45
Mamma Mia! Pizza Beer
Redhook Long Hammer IPA
Summer Wheat

Fuzzy Match Query

totalHits (fuzziness = 2): 45

Mamma Mia! Pizza Beer

Redhook Long Hammer IPA

Summer Wheat

At a fuzziness of 1, no match is found:

Fuzzy Match Query
totalHits (fuzziness = 1): 0

1 2	Fuzzy Match Query totalHits (fuzziness = 1): 0

A type of query dedicated to fuzziness and not applying any analyzer is also provided in the FuzzyQuery.

Multiple Terms: MatchPhrase

As we saw, MatchQuery is a term-based query that allows to optionally specify fuzziness and also applies the same filter to the searched term that may have been applied to the field (eg. stemming, etc…):

MatchQuery.on("beerIndex")
    .match("sesonal")
    .fuzziness(2)
    .field("description").build();

MatchQuery.on("beerIndex")

.match("sesonal")

.fuzziness(2)

.field("description").build();

You can search for multiple terms in a single query by using a Match Phrase query. Terms are analyzed and fuzziness can be optionally activated:

MatchPhraseQuery.on("beerIndex").matchPhrase("summer seasonal").field("description");

1	MatchPhraseQuery.on("beerIndex").matchPhrase("summer seasonal").field("description");

Regexp Query

A RegexpQuery doesn’t only do literal matching but allows to match using a regular expression. Take this example:

result = bucket.query(RegexpQuery.on("beerIndex")
    .regexp("[tp]ale")
    .field("name")
    .build());

System.out.println("nRegexp Query");
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    System.out.println(bucket.get(row.id()).content().get("name"));
}

result = bucket.query(RegexpQuery.on("beerIndex")

.regexp("[tp]ale")

.field("name")

.build());

System.out.println("nRegexp Query");

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

System.out.println(bucket.get(row.id()).content().get("name"));

}

Notice this query targets a particular field in the json (field("name")). We want all names that contain either “tale” or “pale”. Here are a few names that match this query:

Regexp Query
totalHits: 408
Tall Tale Pale Ale
Bard's Tale Beer Company
Pale Ale

Regexp Query

totalHits: 408

Tall Tale Pale Ale

Bard's Tale Beer Company

Pale Ale

Prefix Query

A PrefixQuery looks for word occurrences that start with the given string:

result = bucket.query(PrefixQuery.on("beerIndex")
    .prefix("weiss")
    .field("name")
    .build());

System.out.println("nPrefix Query");
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    System.out.println(bucket.get(row.id()).content().get("name"));
}

result = bucket.query(PrefixQuery.on("beerIndex")

.prefix("weiss")

.field("name")

.build());

System.out.println("nPrefix Query");

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

System.out.println(bucket.get(row.id()).content().get("name"));

}

Once again we only look inside the name field, this time for words that start with “weiss”:

Prefix Query
totalHits: 74
Bavarian-Weissbier Hefeweisse / Weisser Hirsch
Münchner Kindl Weissbier / Münchner Weisse
Franziskaner Hefe-Weissbier Hell  / Franziskaner Club-Weiss
Weissenheimer Wheat

Prefix Query

totalHits: 74

Bavarian-Weissbier Hefeweisse / Weisser Hirsch

Münchner Kindl Weissbier / Münchner Weisse

Franziskaner Hefe-Weissbier Hell / Franziskaner Club-Weiss

Weissenheimer Wheat

Range and Date Queries

FTS is also good with non-textual data. For instance, the NumericRangeQuery allows you to look for numerical values within a provided range:

result = bucket.query(NumericRangeQuery.on("beerIndex")
    .min(3)
    .max(4)
    .field("abv")
    .fields("name", "abv")
    .build());

System.out.println("nNumeric Range Query");
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    JsonDocument doc = bucket.get(row.id());
    System.out.println(""" + doc.content().get("name") + "", abv: " + doc.content().get("abv"));
}

result = bucket.query(NumericRangeQuery.on("beerIndex")

.min(3)

.max(4)

.field("abv")

.fields("name", "abv")

.build());

System.out.println("nNumeric Range Query");

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

JsonDocument doc = bucket.get(row.id());

System.out.println(""" + doc.content().get("name") + "", abv: " + doc.content().get("abv"));

}

Which outputs:

Numeric Range Query
totalHits: 62
"Stud Service Stout", abv: 3.1
"Blonde", abv: 3.0
"Locke Mountain Light", abv: 3.7

Numeric Range Query

totalHits: 62

"Stud Service Stout", abv: 3.1

"Blonde", abv: 3.0

"Locke Mountain Light", abv: 3.7

Dates are covered as well with the DateRangeQuery:

Calendar calendar = Calendar.getInstance();
calendar.set(2011, Calendar.MARCH, 1);
Date start = calendar.getTime();
calendar.set(2011, Calendar.APRIL, 1);
Date end = calendar.getTime();

result = bucket.query(DateRangeQuery.on("beerIndex")
    .start(start)
    .end(end)
    .field("updated")
    .fields("name", "updated")
    .build());

System.out.println("nDate Range Query");
System.out.println("totalHits: " + result.totalHits());
for (SearchQueryRow row : result) {
    JsonDocument doc = bucket.get(row.id());
    System.out.println(""" + doc.content().get("name") + "", updated: " + doc.content().get("updated"));
}

Calendar calendar = Calendar.getInstance();

calendar.set(2011, Calendar.MARCH, 1);

Date start = calendar.getTime();

calendar.set(2011, Calendar.APRIL, 1);

Date end = calendar.getTime();

result = bucket.query(DateRangeQuery.on("beerIndex")

.start(start)

.end(end)

.field("updated")

.fields("name", "updated")

.build());

System.out.println("nDate Range Query");

System.out.println("totalHits: " + result.totalHits());

for (SearchQueryRow row : result) {

JsonDocument doc = bucket.get(row.id());

System.out.println(""" + doc.content().get("name") + "", updated: " + doc.content().get("updated"));

}

Which outputs:

Date Range Query
totalHits: 4
"Dank", updated: 2011-03-16 09:06:54
"Oso", updated: 2011-03-16 09:05:15
"Summer Teeth", updated: 2011-03-08 12:22:14
"Columbus Brewing Company", updated: 2011-03-08 12:19:07

Date Range Query

totalHits: 4

"Dank", updated: 2011-03-16 09:06:54

"Oso", updated: 2011-03-16 09:05:15

"Summer Teeth", updated: 2011-03-08 12:22:14

"Columbus Brewing Company", updated: 2011-03-08 12:19:07

Generic Querying

FTS also offer a more generic form of querying that combines phrases, terms and more using the String Query syntax. This is accessible in the API through the StringQuery.

Combining

Additionally, you can combine simple criteria like MatchQuery using combination queries. Taking these two simple term queries:

MatchQuery bitterQuery = MatchQuery.on("beerIndex").match("bitter").field("description").build();
MatchQuery maltyQuery = MatchQuery.on("beerIndex").match("malty").field("description").build();

1 2	MatchQuery bitterQuery = MatchQuery.on("beerIndex").match("bitter").field("description").build(); MatchQuery maltyQuery = MatchQuery.on("beerIndex").match("malty").field("description").build();

You could combine them in different manners:

a conjunction looks for all the terms

 ConjunctionQuery.on("beerIndex").conjuncts(bitterQuery, maltyQuery)

1	ConjunctionQuery.on("beerIndex").conjuncts(bitterQuery, maltyQuery)

a disjunction looks for at least one term

 DisjunctionQuery.on("beerIndex").disjuncts(bitterQuery, maltyQuery)

1	DisjunctionQuery.on("beerIndex").disjuncts(bitterQuery, maltyQuery)

a boolean query allows you to combine the two approaches

 BooleanQuery.on("beerIndex").must(bitterQuery).mustNot(maltyQuery)

1	BooleanQuery.on("beerIndex").must(bitterQuery).mustNot(maltyQuery)

Getting Hit Explanations

If you want to get insights into the scoring and matching of a particular SearchQueryRow, you can build your query using the .explain(true) parameter and get details from the index in result’s explanation() field:

{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[
{
    "message": "weight(_all:national^1.000000 in penn_brewery-penn_marzen), product of:",
    "children": [
        {
            "message": "queryWeight(_all:national^1.000000), product of:",
            "children": [
                {
                    "message": "boost",
                    "value": 1
                },
                {
                    "message": "idf(docFreq=17, maxDocs=7303)",
                    "value": 7.005668743723945
                },
                {
                    "message": "queryNorm",
                    "value": 0.1427415478209491
                }
            ],
            "value": 0.9999999999999999
        },
        {
            "message": "fieldWeight(_all:national in penn_brewery-penn_marzen), product of:",
            "children": [
                {
                    "message": "tf(termFreq(_all:national)=1",
                    "value": 1
                },
                {
                    "message": "fieldNorm(field=_all, doc=penn_brewery-penn_marzen)",
                    "value": 0.10000000149011612
                },
                {
                    "message": "idf(docFreq=17, maxDocs=7303)",
                    "value": 7.005668743723945
                }
            ],
            "value": 0.7005668848116544
        }
    ],
    "value": 0.7005668848116543
}    ],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543}

{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[{"message":"product of:","children":[{"message":"sum of:","children":[

{

"message": "weight(_all:national^1.000000 in penn_brewery-penn_marzen), product of:",

"children": [

{

"message": "queryWeight(_all:national^1.000000), product of:",

"children": [

{

"message": "boost",

"value": 1

{

"message": "idf(docFreq=17, maxDocs=7303)",

"value": 7.005668743723945

{

"message": "queryNorm",

"value": 0.1427415478209491

}

"value": 0.9999999999999999

{

"message": "fieldWeight(_all:national in penn_brewery-penn_marzen), product of:",

"children": [

{

"message": "tf(termFreq(_all:national)=1",

"value": 1

{

"message": "fieldNorm(field=_all, doc=penn_brewery-penn_marzen)",

"value": 0.10000000149011612

{

"message": "idf(docFreq=17, maxDocs=7303)",

"value": 7.005668743723945

}

"value": 0.7005668848116544

}

"value": 0.7005668848116543

} ],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543},{"message":"coord(1/1)","value":1}],"value":0.7005668848116543}],"value":0.7005668848116543}

Conclusion

We hope that this preview of the API has peeked your interest!

Go ahead and download the first Developer Preview of Couchbase 4.5 with embedded Full Text Search service. We hope that you’ll be able to quickly start searching using the associated Java SDK API.

And until then… Happy coding!
– The Java SDK Team

Simon Basle, Software Engineer, Pivotal

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

COMMUNITY

Join the Developer Community

Resource Center

Education

Compare

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Preview of Full Text Search in Couchbase using the Java SDK

Full Text Search in Couchbase?

The Java API

Various Types of Queries

Fuzzy Querying

Multiple Terms: MatchPhrase

Regexp Query

Prefix Query

Range and Date Queries

Generic Querying

Combining

Getting Hit Explanations

Conclusion

Author

Posted by Simon Basle, Software Engineer, Pivotal

Leave a reply Cancel reply