Good day, community. I had been posting here in Topic 3825 (“Rebalance is stuck”) because I thought we were experiencing an offshoot of that problem, but it turns out we have a more fundamental problem with how we’re using Couchbase, so I’d like to formally start a fresh Topic and ask for assistance on this. Essentially we’re trying to do a UNION DISTINCT on the elements in arrays within documents, and we’re trying to return from the Reduce function an array of short strings, and it will return 10 or 100 but completely chokes returning 1000.
Picture a bunch of customers, and we want to restrict each one to only download a certain limited number of some resource per 24-hour sliding time window. (Could be electronic documents, data rows in a database, electronically-purchased songs, or anything where we want to keep track of the items downloaded and not let the customer have more than X of them in a day… after that number, they have to wait until tomorrow, or at least until some number of hours go by.) But to be polite to the customer, we don’t want to double-count the items if they download them, then re-download them because their browser crashed, or they wanted to open the query in a new window to print it, or their Internet failed 98% of the way through the progress bar. I.e., if they download a sample set of specific 15 items, and then they download those same 15 again, then they still have a total of only 15 for the day, not 30. And if they download a set of 25 items, and then download a different set of 10 items that include 5 overlapping with the first set, then the total so far for the day is 30, not 35. What matters is all the unique keys of all the items downloaded by that customer in a 24-hour window. I hope this makes sense and seems like a reasonable thing to want to do in Couchbase or in any storage /query engine.
So each of our little source documents is inserted like this…
{
"actor": "kou:1073745525",
"action": "songDownloads",
"timestamp": 1433688093,
"count": 0,
"keys": [
"-2143267641",
"-2143270000",
"-2143026020",
"-2143031946"
]
}
“actor” is basically a user ID (the human / customer who’s asking for the items). “action” is the category of thing the customer is downloading — we want to make this system multi-tenant, so that the same general “how many things in a day” facility can be used internally across different applications. “timestamp” is obviously when the user pulled the items. Then the there are two varieties of user-limiting available: a plain “count” which is just an integer, and a “keys” which is a list (array) of unique keys. We want to make this facility flexible to either do the unique-keys-based limiting that I’m posting about here, or a more traditional rate-limiting system where the only thing that matters is the sum of all the counts of hits per day, depending on the application. So, an inserted document would always either fill in the “count” field with a positive integer, and leave “keys” as an empty array, or leave “count” as zero and fill in “keys” with an array of one or more keys.
The regular “count” thing we already have working… we can do a nice MapReduce view and get a sum of all the counts of that user over a time window, and it’s quite fast. That’s not the problem. What we’re trying to do is return the set-wise Union of all of the “keys” arrays of all the matching documents (i.e., for one “actor”, for one “action”, and for all "timestamp"s over a certain time period), and also make it Distinct, i.e., eliminate duplicates from the list — as I said in the above example, 25 documents plus 10 more documents with 5 repeats equals a total of only 30. So here is our map/reduce function pair:
function (doc, meta) {
// emit a row
emit([ doc.action, doc.actor, doc.timestamp ], { keys: keys, count: doc.count });
}
function( keys, values, rereduce ) {
var keys = [];
var count = 0;
var keyMap = {};
// loop through all values
for (var valueIndex=0; valueIndex<values.length; valueIndex++) {
// access the value
var value = values[valueIndex];
// add the counts
count += value.count;
// compute the union of the keys
for (var keyIndex=0; keyIndex < value.keys.length; keyIndex++) {
// access the keys
var key = value.keys[keyIndex];
// add the key if not already added
if (!keyMap[key]) {
keys.push(key);
keyMap[key] = true;
}
}
}
// return the combined result
return { keys: keys, count: count };
}
I apologize on behalf of my developer using the same variable names for different purposes — “keys” means either the name of the original field in the document, or the passed parameter, or a temporary local-scope array used within the reduce function simply to accumulate the uniquified list of keys before returning them. Sorry about that. Anyway, basically what it’s doing is going through each incoming document (“values” array) and, for each one, going through that document’s “keys” array and, for each key, putting it into a temp array for output unless a small associative array says that it’s already been inserted there. So the associative array (“keyMap”) is just the duplicate-detecting mechanism, while the “keys” array becomes the new output of the reduce step.
This entire logic works great, as long as the end result “keys” has fewer than 10 elements, or even fewer than 100 elements. But if we try to send back 1000 elements, the Couchbase service goes off into the ozone, chewing up RAM until it his OOM condition and crashing and starting over. (Either it’s the Indexing process or the Compacting process, but either process is calling our Reduce code, I guess, to perform its work.) If it will help, I can post a couple of the altered code snippets of our attempts to figure out what part of it Couchbase doesn’t like, but in simple English:
-
we added a circuit-breaker to simply break out of the for loops when keys.length became >10, or >100. Then it completed, but at 1000 it does not, and we need to return 1000 (or more) in our application.
-
thinking that maybe the associative array / hash implementation was the problem, we tried just not doing the duplicate checks, i.e., adding together 25 elements and 10 more elements and returning all 35, even with the duplicates. But this still choked in the same way, so it seems to be the final “return keys: keys” part that Couchbase doesn’t like. When we changed it to go ahead and perform the associative array set for all the elements, but still only return 10 of them, it works, so the associative array isn’t the trouble. In fact, here is that code snippet:
function( keysParm, values, rereduce ) { var keys = []; var count = 0; var keyMap = {}; var cont = true; // loop through all values for (var valueIndex=0; valueIndex<values.length; valueIndex++) { // break if appropriate if (!cont) { break; } // access the value var value = values[valueIndex]; // add the counts count += value.count; // compute the union of the keys for (var keyIndex=0; keyIndex < value.keys.length; keyIndex++) { // break if appropriate if (!cont) { break; } // access the keys var key = value.keys[keyIndex]; // add the key if not already added if (!keyMap[key]) { keys.push(key); keyMap[key] = true; } if (keys.length >= 10) { cont = false; } } } // return the combined result return { keys: keys, count: count };
}
So “cont = false” is our circuit-breaker. If we change the length test to >=1000 and then put in several source documents with a few hundred array elements, then Couchbase goes insane.
Thanks to Robert Hamon (@robert_hamon_) for kindly showing me a tuning parameter called “max_kv_size_per_doc”, which looks like a self-protection mechanism within Couchbase, to protect the server from users (i.e., us!) trying to do something dumb. He recommends setting the value lower than the default of 1MiB, so that it will error out and abort before it gets into the memory-hogging collapse state. So that seems as if it would have cleaner failure behavior than we have now, but it still doesn’t explain to me how to actually succeed in what we’re trying to do. Is what we’re doing reasonable? Is there a better way to do it that will cooperate with Couchbase’s MapReduce implementation? Thank you in advance, everyone.
– Jeff Saxe
SNL Financial, Charlottesville, Virginia, USA