Following the support for Analytics external collections for Amazon Web Services (AWS) Simple Storage Service (S3) in the Couchbase Server 6.6 release, Couchbase is excited to announce the forthcoming support and general availability for Azure Blob Storage for external Analytics collections. This will further enable customers to combine data from external sources (e.g., AWS S3 and Azure Blob Storage) with local data (Analytics Collections) as well as remote Couchbase data (Remote Analytics Collections).
Customer use case
Some customers use Azure Blob Storage to reduce storage costs and store large volumes of data (e.g., multiple years of historical data, offline business data for machine learning, product reviews, etc.). They have expressed a desire to combine, query, and utilize Azure Blob Storage data in real time to make this data available to business users for data analytics. Read here for more about other Analytics use cases.
How do external Analytics collections work?
This section was previously covered in AWS S3 Analytics external datasets blog. If you have used external Analytics collections using S3 before, this part will already be familiar to you.
External Analytics collections provide the ability to dynamically query and analyze data residing in external sources (e.g., AWS S3 and Azure Blob Storage), allowing users to easily combine data in real time from both inside and outside of their Couchbase Analytics nodes. This is achieved in three simple steps:
-
- Set up an external link using a REST API call, command-line interface (CLI), or Couchbase Server Workbench.
- Create an external Analytics collection on the external link.
- Query the Analytics collection using SQL++ for Analytics (or use Tableau with Couchbase Analytics Tableau connector).
Let’s walk through a simple example. iMaz, an e-commerce company, sells consumer products online. Their order, product, and user data are stored on a Couchbase cluster with both Data and Analytics services (on separate sets of nodes in the cluster). They use the Analytics Service to run ad hoc and complex queries to analyze their business. iMaz also stores their product reviews on Azure Blob Storage, and they would like to combine and analyze the top 3 most highly rated products using the Couchbase Analytics Service.
Sample product data:
1 2 3 4 5 6 7 8 9 |
[{ "id": "Product_1", "docType": "Product", "productId": 1, "price": 811.76, "salePrice": 70.14, "productName": "Ergonomic Cotton Ball", "desc": "Plastic fused metallic Ergonomic Cotton Ball" }] |
Sample review data:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "id": "Review_0001764a17a844279a2227e137cc4e36", "docType": "Review", "reviewId": "0001764a17a844279a2227e137cc4e36", "productId": 1, "userId": 5862, "reviewerName": "M. Schaefer", "reviewerEmail": "...@mmail.com", "rating": 5, "title": "Works well and meets expectations.", "review": "Product works great and will buy one more for my extended family.", "reviewDate": 1597273484 } |
Let’s walk through each of the three steps from above with sample setup code along with a SQL++ query. Here is a demo video for you as well:
Step 1: Set up the links
We’ll create an Azure Blob Storage link using a REST API call or using the Couchbase Server Workbench. We’ll need to provide:
-
- Analytics Service hostname
- Analytics user credentials
- Azure Blob Storage link name, in this case myAzureLink
- Scope name (previously known as dataverse), in this case Default
- Link type (AzureBlob)
- Credentials (only one is allowed):
- Anonymous (no credentials)
- Shared key, or
- Shared access signature, or
- Managed identity id, or
- Azure Active Directory client secret, or
- Azure Active Directory client certificate (and client certificate password if password-protected)
1 2 3 4 5 6 |
curl -u <username>:<pwd> \ -X POST "http://<analytics_hostname>/analytics/link/Default/myAzureLink" \ -d type=AzureBlob \ -d endpoint=https://<account_name>.blob.core.windows.net/ \ -d accountName=... \ -d accountKey=... |
Below is a snapshot on how to create the AzureBlob link using the Shared Key authentication method using the Couchbase Server Workbench:
For more details on Azure Blob supported authentication methods, check Azure Blob Link.
Step 2: Create an external Analytics collection
After the external links are created, external Analytics collections can be created using DDL statements that refer to the previously created link names. The following is the DDL statement to create the Analytics collection on the Azure Blob Storage link created previously:
1 2 3 4 5 |
CREATE EXTERNAL ANALYTICS COLLECTION azureproductreviews ON `cb-analytics-7-0-0-demo` AT myAzureLink USING “reviews” WITH { "format": "json", "include": "*.json" } ; |
Where the above settings reflect the following:
-
- azureproductreviews: is the name of the Analytics external collection.
- cb-analytics-7-0-0-demo: is the name of the Azure Blob container the Analytics external collection is reading from.
- myAzureLink: is the name of the link the external collection is created on (created in the previous step).
- reviews: is the prefix used to read the data from. So only files under the reviews prefix will be read.
- format: is the property telling Analytics that the data being read is of type json.
- include (or exclude): is the property telling Analytics to apply the filter on the files, so only file names satisfying the filter condition are included,
As shown above, after the links are created, the creation of the external Analytics collection is independent of the link type. Multiple Analytics collections can be created on the same external link to point to different external data containers.
Currently, the external Analytics collection feature supports the JSON, CSV (comma-separated values), and TSV (tab-separated values) file formats, including compressed GZIP files (filenames ending with .gz or .gzip). Both the CSV and TSV formats require you to specify an inline type definition (more about this below). Additional file formats like parquet will be supported in future releases.
More details about inline type definitions can be found here.
Step 3: Query using SQL++
As the last step, we can now run the SQL++ query listed below. It joins the existing products analytics collection from the Couchbase Analytics Service with the product reviews data from Azure Blob Storage to retrieve the top 3 most highly rated products.
1 2 3 4 5 6 |
SELECT p.productName, AVG(s.rating) AS `Rating` FROM azureproductreviews s, products p WHERE s.productId = p.productId GROUP BY p.productName ORDER BY AVG(s.rating) DESC LIMIT 3; |
Here are the JSON query results:
1 2 3 4 5 |
[ { "Rating": 4.33, "productName": "Licensed Rubber Tuna"}, { "Rating": 4.29, "productName": "Gorgeous Plastic Salad"}, { "Rating": 3.86, "productName": "Intelligent Cotton Bike"} ] |
This is great – we’re now able to combine and analyze external data located in Azure Blob Storage using the Couchbase Analytics Service. Notice how few steps it took to enable us to analyze our data; no ETL was involved and the data was immediately available! Moreover, if the data changes, we will see those changes when we rerun the query, as external data is accessed on demand at query execution time.
You might now be wondering, “How would this have worked if the Azure Blob Storage reviews file format had been of type CSV instead of JSON?” The answer is simple; you simply would have defined your external Analytics collection accordingly. Below, we show what the create external Analytics collection statement from above would look like to support CSV:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE EXTERNAL ANALYTICS COLLECTION azureproductreviews ( id STRING NOT UNKNOWN, docType STRING NOT UNKNOWN, reviewId STRING NOT UNKNOWN, productId BIGINT, userId BIGINT, reviewerName STRING NOT UNKNOWN, reviewerEmail STRING NOT UNKNOWN, rating BIGINT, title STRING NOT UNKNOWN, review STRING NOT UNKNOWN, reviewDate BIGINT ) ON `cb-analytics-7.0.0-demo` AT myAzureLink USING "reviews" WITH { "format": "csv", "include": "*.csv", "header": false }; |
Notice how the create statement now includes inline type information. The type is needed to tell Analytics how to interpret the CSV data (e.g., not just as strings) in order to transform it internally into JSON form.
The SQL++ query remains exactly the same. That’s right, it requires no change at all! External Analytics collections are easy to set up, flexible, and simple to use thanks to the power of the SQL++ language. Users can develop complex ad hoc queries for interactive data exploration, answer new business questions, and combine external data with data from Remote Links to involve other Couchbase data sources as well.
Benefits
Here are some key benefits that come from using external Analytics collections:
-
- Data enrichment. Couchbase data can now be enriched with additional information obtained from data that resides in external data stores.
- Dynamic data access. The most current data can be dynamically retrieved, streamed, combined, and analyzed from any S3 or Azure Blob Storage (DP) containers in any regions during Analytics query execution.
- Parallel query processing. Users can configure and arrange access to external data using Analytics’ massively parallel processing (MPP) query processing architecture for fast response to queries involving external data.
Next steps
I hope you are excited about this new functionality of Couchbase Analytics 7.1. External Analytics collections unlock the value of external live and archived data residing in external data stores. Users can combine and analyze data in real time, sourced from AWS S3, Azure Blob Storage, and the Couchbase Analytics Service. This enables faster and more comprehensive data analyses and agile decision-making.
Learn more about external Analytics collections DDL.
Here are more resources to help you get started:
-
- We look forward to your feedback via Couchbase Forums.
- Manage External links using Azure Blob Storage.
- What’s new in Couchbase 7.1.
- Download Couchbase
Acknowledgements
This post was co-authored with Idris Motiwala. Idris is a Principal Product Manager, Analytics at Couchbase with 20+ years experience in design, development and execution of software products at both Fortune 500s and startups leading teams in digital transformation, cloud and analytics. Idris holds an MS in Technology Management and certifications in product management .