Introduction
Couchbase Full Text Search (FTS) is a great tool for indexing and querying geospatial data. In this article, Iâll present a geospatial search use case and demonstrate the various ways that we can perform a search of location data using the Couchbase Full Text Search service. Iâll be using Couchbase Server Enterprise Edition 6.6 (running locally in Docker) to create an FTS index on my sample geospatial dataset and then run geospatial queries against the index.
What makes this case interesting is that we aren’t using a spatial database. What is a spatial database? Unlike Couchbase Server, which is a NoSQL document database, spatial databases are specially optimized for data that describes geometric spaces such as lines, points of interest, or even 3-D topology in advanced instances. As we’ll see, Couchbase’s Full Text Search capabilities make it just as useful for handling and querying geospatial data as anything we might expect from a more specialized solution.
Use Case
My family has always enjoyed visiting and exploring Great Smoky Mountains National Park (or GRSM, the National Park Serviceâs abbreviation), and one day we might be interested in relocating there. But you canât live in the national park, so we need to consider the various cities and towns near the park and make a short list of the ones to evaluate and possibly visit.Â
The main objective is to be within close proximity to the national park, but weâll consider other factors like the size (population) of the towns, too.
Sample Dataset
To support my GRSM use case, Iâve decided to use a public dataset from GeoNames that includes states, cities, towns, and other landmarks from various nations around the world. I downloaded their United States data file and imported into Couchbase only the âpopulated placesâ data (records with feature codes of ‘PPL’, ‘PPLA’, ‘PPLA2’, ‘PPLA3′,’PPLA4’, and ‘PPLC’) for cities/towns with a non-zero population. The result is a Couchbase bucket âcitiesâ with 30,734 documents. Â
Each cityâs document data model includes some attributes of interest for my GRSM use case: the name, state, population, elevation, and, most importantly, the latitude and longitude. Here are a couple of sample JSON documents:
city::4699066
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "geonameid": 4699066, "featureclass": "P", "featurecode": "PPLA2", "name": "Houston", "state": "TX", "population": 2296224, "elevation": 12, "geo": { "lat": 29.76328, "lon": -95.36327 } } |
city::4649251
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "geonameid": 4649251, "featureclass": "P", "featurecode": "PPL", "name": "Pigeon Forge", "state": "TN", "population": 6171, "elevation": 305, "geo": { "lat": 35.78842, "lon": -83.55433 } } |
Creating the Index
With the city data loaded into the cities bucket in Couchbase, we can build an FTS index that suits the âlive near GRSMâ use case at hand. Iâll briefly cover the highlights of creating the index required here, and the full index definition is below in the appendix of the post. (For a more detailed explanation of creating FTS indexes, please refer to my blog post on the topic.)
Index creation key points:Â
- Name: city_geo
- Bucket: cities
- Type identifier: âDoc ID up to separatorâ and enter â::â as the delimiter (note the keys of the sample documents above)
- Type mappings:Â
- Uncheck âdefaultâ
- Create a mapping for âcityâ type documents, indexing only these specified fields:Â
- name: Iâll use the keyword analyzer for this field (because we want to sort by name later), and Iâll check index, store, _all, term vectors, and docvalues so that, in addition to searching by this field, I can test the index with highlights and sort by this field.Â
- state: Just store this text field so we can retrieve it in the search results.Â
- population: Set the type to number, and check index, store, and docvalues so that I can sort results based on population later.Â
- elevation: Set the type to number and check only store so that this value is included in the search results.
- geo: Set the type to geopoint (since each document has the âlatâ and lonâ properties in the âgeoâ subdocument), and check index, store, and _all.Â
Weâll wait until the indexing process is 100% complete:
Now, letâs quickly test the index in the Couchbase UI to verify the index is working as expected. The result looks good!
Geospatial Searches
Now that the dataset is loaded and indexed, I can get to the heart of the subject at hand and execute some geospatial queries against the index. For demonstration purposes, Iâll query the Couchbase Search Service REST API with cURL, but the Search queries can also be executed through any of the Couchbase SDKs as part of your application or service. N1QL queries also support Full Text Search with SQL methods without needing any coding.Â
Iâll format the REST API response for readability using jq, an open source command-line JSON processor.
Search Method 1: Point and Radius
Often we want to know whatâs nearby or within a specified distance of a specific point. In my use case, Iâd like to know what cities and towns are near the GRSM national park…maybe within 50 miles as a starting point. This first geospatial search method is referred to as âpoint and radiusâ, âpoint and distance, or âradius-basedâ. Â
For my âpointâ, Iâve chosen Newfound Gap, which is a pass over the Smoky Mountains on the border of Tennessee and North Carolina, as well as a popular lookout point and a trailhead for the Appalachian Trail. Itâs a must-do for my family when we visit GRSM. Letâs look for towns/cities within 50 miles of Newfound Gap.Â
Hereâs the radius-based query:Â
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
$ curl -s -XPOST -H "Content-Type: application/json" -u Administrator:password http://localhost:8094/api/index/city_geo/query -d ' { "fields": ["name","state","elevation","population"], "size": 500, "query": { "location": { "lon": -83.4217, "lat": 35.6067 }, "distance": "50mi", "field": "geo" }, "sort": [ { "by": "geo_distance", "field": "geo", "unit": "mi", "location": { "lon": -83.4217, "lat": 35.6067 } } ] }' | jq '("result_count: "+ (.total_hits | tostring)), (.hits[]| (.fields.name + ", " + .fields.state + " - population: " + (.fields.population | tostring) + ", elevation: " + (.fields.elevation | tostring)))' |
The result is 79 cities, sorted by distance from Newfound Gap. Iâve included the first 15 results here:Â
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
"result_count: 79" "Gatlinburg, TN - population: 4184, elevation: 394" "Pittman Center, TN - population: 565, elevation: 393" "Cherokee, NC - population: 2138, elevation: 605" "Bryson City, NC - population: 1458, elevation: 528" "Pigeon Forge, TN - population: 6171, elevation: 305" "Dillsboro, NC - population: 240, elevation: 607" "Maggie Valley, NC - population: 1251, elevation: 918" "Townsend, TN - population: 452, elevation: 326" "Sylva, NC - population: 2617, elevation: 623" "Sevierville, TN - population: 16490, elevation: 275" "Fair Garden, TN - population: 529, elevation: 340" "Webster, NC - population: 375, elevation: 656" "Cove Creek, NC - population: 1171, elevation: 767" "Walland, TN - population: 259, elevation: 281" "Cullowhee, NC - population: 6228, elevation: 645" |
Search Method 2: Bounding Box
79 is a lot of towns & cities to consider, so letâs think about another way to look at this. From my visits to the national park over the years, I know roughly that I want to live somewhere between Knoxville, TN and Waynesville, NC. Given those two locations, I can query against my GeoNames dataset using the âbounding boxâ or ârectangle-basedâ geospatial search method. Â
I can supply the coordinates of places near Knoxville and Wanyesville as parameters to my search and those will be used as the upper left and lower right corners of a rectangle. Any cities located within that rectangle will be returned by the query. Â
Hereâs the rectangle-based query:Â
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
$ curl -s -XPOST -H "Content-Type: application/json" -u Administrator:password http://localhost:8094/api/index/city_geo/query -d ' { "fields": ["name","state","elevation","population"], "size": 50, "query": { "top_left": { "lon": -83.937408, "lat": 36.032024 }, "bottom_right": { "lon": -82.947580, "lat": 35.401835 }, "field": "geo" }, "sort": [ "name" ] }' | jq '("result_count: "+ (.total_hits | tostring)), (.hits[]| (.fields.name + ", " + .fields.state + " - population: " + (.fields.population | tostring) + ", elevation: " + (.fields.elevation | tostring)))' |
The result is 21 cities, sorted by name:Â
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
"result_count: 21" "Bryson City, NC - population: 1458, elevation: 528" "Cherokee, NC - population: 2138, elevation: 605" "Cove Creek, NC - population: 1171, elevation: 767" "Dandridge, TN - population: 2924, elevation: 304" "Eagleton Village, TN - population: 5052, elevation: 289" "Fair Garden, TN - population: 529, elevation: 340" "Gatlinburg, TN - population: 4184, elevation: 394" "Hazelwood, NC - population: 1655, elevation: 842" "Knoxville, TN - population: 185291, elevation: 276" "Lake Junaluska, NC - population: 2734, elevation: 778" "Maggie Valley, NC - population: 1251, elevation: 918" "Newport, TN - population: 6834, elevation: 321" "Parrottsville, TN - population: 261, elevation: 363" "Pigeon Forge, TN - population: 6171, elevation: 305" "Pittman Center, TN - population: 565, elevation: 393" "Sevierville, TN - population: 16490, elevation: 275" "Seymour, TN - population: 10919, elevation: 285" "Townsend, TN - population: 452, elevation: 326" "Walland, TN - population: 259, elevation: 281" "Waynesville, NC - population: 9809, elevation: 837" "Wildwood, TN - population: 1098, elevation: 278" |
Search Method 3: Polygon
After some additional research, Iâve decided that I would prefer to live within Sevier County, but south of Interstate 40 and north of the national park boundary. Â
To do this, I will need to run a polygon-based search against my FTS index. This third method was recently added in Couchbase Server 6.5.1. The areas for geospatial search queries can now be specified as polygons, in addition to circles and rectangles. The polygon is expressed as a series of latitude-longitude coordinates, each determining the location of one corner of the polygon. Â
On the map of Sevier County above (the light red line is the county boundary), Iâve overlaid a polygon that roughly corresponds to the area that Iâm interested in, and Iâve captured the coordinates of the points of the polygon. Iâll use these coordinates to form my geospatial polygon-based query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
$ curl -s -XPOST -H "Content-Type: application/json" -u Administrator:password http://localhost:8094/api/index/city_geo/query -d ' { "fields": ["name","state","elevation","population"], "size": 50, "query": { "field": "geo", "polygon_points": [ "35.987374, -83.658937", "35.971769, -83.654212", "35.887168, -83.793874", "35.686403, -83.678068", "35.704374, -83.505435", "35.769145, -83.275637", "35.868423, -83.290819", "35.919168, -83.350486", "35.948053, -83.510420", "35.990925, -83.568382" ] }, "sort": [ { "by" : "field", "field" : "population", "missing" : "last", "type": "number" } ] }' | jq '("result_count: "+ (.total_hits | tostring)), (.hits[]| (.fields.name + ", " + .fields.state + " - population: " + (.fields.population | tostring) + ", elevation: " + (.fields.elevation | tostring)))' |
The result is a very manageable list of 6 cities, sorted by population in ascending order:Â
1 2 3 4 5 6 7 |
"result_count: 6" "Fair Garden, TN - population: 529, elevation: 340" "Pittman Center, TN - population: 565, elevation: 393" "Gatlinburg, TN - population: 4184, elevation: 394" "Pigeon Forge, TN - population: 6171, elevation: 305" "Seymour, TN - population: 10919, elevation: 285" "Sevierville, TN - population: 16490, elevation: 275" |
Summary and Next Steps
With these three search methods, Couchbase offers a comprehensive geospatial search capability for you to include in your applications. I encourage you to create an index with geopoint data and run some geospatial point, or geospatial polygon-based queries. You can easily do this with one of our Couchbase sample datasets, travel-sample, which has a lot of location-based data to use for this purpose. Â
Take it one step further and visualize JSON data as real time output from the document database search request using web-based geospatial technology platforms like Mapbox or ESRI. You will benefit from managing data in a distributed database management system that also supports horizontal scaling, general key value store, data consistency, and more.
Geospatial search is just one of the capabilities of Full Text Search in Couchbase. You can also try out queries on arrays and natural language queries with scoring, faceting, and boosting. For more information on this topic, take a look at the application developer documentation and training links in the reference section below. Â
References
- Couchbase Search Resources: https://www.couchbase.com/products/full-text-search
- What is a Spatial Database?
- Couchbase FTS Documentation: https://docs.couchbase.com/server/current/fts/full-text-intro.html
- Couchbase FTS Blog Posts: https://www.couchbase.com/blog/category/full-text-search/
- Couchbase NoSQL Search – Online Training: https://learn.couchbase.com/store/509465-cb121-intro-to-couchbase-full-text-search-fts
- Couchbase NoSQL SDK – Geospatial Search – Java, .NET, Python, Node.js, Go, Scala, Ruby, C
Appendix
Index Creation cURL Command and JSON Definition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
$ curl -XPUT -H "Content-Type: application/json" \ -u <username>:<password> http://localhost:8094/api/index/city_geo -d \ '{ "type": "fulltext-index", "name": "city_geo", "sourceType": "couchbase", "sourceName": "cities", "planParams": { "maxPartitionsPerPIndex": 171, "indexPartitions": 6 }, "params": { "doc_config": { "docid_prefix_delim": "::", "docid_regexp": "", "mode": "docid_prefix", "type_field": "type" }, "mapping": { "analysis": {}, "default_analyzer": "standard", "default_datetime_parser": "dateTimeOptional", "default_field": "_all", "default_mapping": { "dynamic": true, "enabled": false }, "default_type": "_default", "docvalues_dynamic": true, "index_dynamic": true, "store_dynamic": false, "type_field": "_type", "types": { "city": { "dynamic": false, "enabled": true, "properties": { "elevation": { "dynamic": false, "enabled": true, "fields": [ { "include_term_vectors": true, "name": "elevation", "store": true, "type": "number" } ] }, "geo": { "dynamic": false, "enabled": true, "fields": [ { "docvalues": true, "include_in_all": true, "include_term_vectors": true, "index": true, "name": "geo", "store": true, "type": "geopoint" } ] }, "name": { "dynamic": false, "enabled": true, "fields": [ { "analyzer": "keyword", "docvalues": true, "include_in_all": true, "include_term_vectors": true, "index": true, "name": "name", "store": true, "type": "text" } ] }, "population": { "dynamic": false, "enabled": true, "fields": [ { "docvalues": true, "include_term_vectors": true, "index": true, "name": "population", "store": true, "type": "number" } ] }, "state": { "dynamic": false, "enabled": true, "fields": [ { "name": "state", "store": true, "type": "text" } ] } } } } }, "store": { "indexType": "scorch" } }, "sourceParams": {} }' |