and want to find any and all documents where a particular party appear across a large dataset
I have tried with the query, which “fails” due to eating all my available memory
CREATE INDEX adv_ALL_array_concat_d_allies_d_neutral_d_enemy_party ON `default`(ALL ARRAY `data`.`party` FOR data IN array_concat((`allies`), (`neutral`), (`enemy`)) END)
SELECT data.party
FROM `default` AS d
UNNEST ARRAY_CONCAT(d.allies, d.neutral, d.enemy) AS data
WHERE data.party IS VALUED
AND META(d).id LIKE 'a:%'
GROUP BY data.party
HAVING COUNT(1) > 1
to try and fix the memory issue I tried using a join to “stream” results but I still have the issue of memory being consumed
SELECT biggerThan.party,
ARRAY_AGG(META(d2).id) AS appear_in
FROM (
SELECT data.party
FROM `default` AS d
UNNEST ARRAY_CONCAT(d.allies, d.neutral, d.enemy) AS data
WHERE data.party IS VALUED
AND META(d).id LIKE 'a:%'
GROUP BY data.party
HAVING COUNT(1) > 1 ) AS biggerThan
JOIN `default` AS d2 ON biggerThan.party IN ARRAY_CONCAT(d2.allies, d2.neutral, d2.enemy).party
AND biggerThan.party IS VALUED
AND META(d2).id LIKE 'a:%'
GROUP BY biggerThan.party
What would the recommended way of executing some query akin to this against a large database be? Im guessing my problem is around the unnest and array_concat?
Make sure it uses covering index. EE it uses index aggregation.
CREATE INDEX ixt1 ON default(ALL ARRAY_CONCAT(allies[*].party, enemy[*].party, neutral[*].party))
WHERE meta().id LIKE 'a:%';
SELECT party
FROM default AS d
UNNEST ARRAY_CONCAT(d.allies[*].party, d.enemy[*].party, d.neutral[*].party) AS party
WHERE META(d).id LIKE "a:%" AND party IS VALUED
GROUP BY party
HAVING COUNT(1) > 1;
ARRAY_CONCAT(allies[*].party, enemy[*].party, neutral[*].party)
==>>
ARRAY v.party FOR v IN ARRAY_CONCAT(IFMISSINGORNULL(allies,[]), IFMISSINGORNULL(enemy,[]), IFMISSINGORNULL(neutral,[])) END
CREATE INDEX ixt ON default(
ARRAY_CONCAT(IFMISSING(allies[*].party,[]), IFMISSING(enemy[*].party,[]), IFMISSING(neutral[*].party,[])))
WHERE meta().id LIKE 'a:%';
should cover with:
SELECT party
FROM default AS d
UNNEST ARRAY_CONCAT(IFMISSING(allies[*].party,[]), IFMISSING(enemy[*].party,[]), IFMISSING(neutral[*].party,[])) AS party
WHERE META(d).id LIKE "a:%" AND party IS VALUED
GROUP BY party
HAVING COUNT(1) > 1;
It isn’t possible to stream results early because of the GROUP BY - all results have to be seen so that only distinct values are returned. For a large number of distinct “party” values this would have some overhead; I don’t see a way to change the statement to reduce the overhead and produce the same result. (All variants require some data gathering to successfully count parties across all array elements across all documents.)
When you state “eating memory”, what is the usage you observe (and where/what metric/tool are you using) and the result count from the statement?
Also, how many UNNESTed items are there, i.e.
SELECT count(1)
FROM default AS d
UNNEST ARRAY_CONCAT(IFMISSING(allies[*].party,[]), IFMISSING(enemy[*].party,[]), IFMISSING(neutral[*].party,[])) AS party
WHERE META(d).id LIKE "a:%" AND party IS VALUED
;
?
Are you using Enterprise Edition or Community Edition? - And the version?