Hello,
I am trying to speed up an index join, but I’m unfortunately unable to achieve acceptable speed
My situation is as follows:
Every user has two documents that both are in the same bucket (“data”):
- a data doc (id=‘ud/THEUSERID/data’) that has a year and a lastLogin property
- a user data doc (id=‘ud/THEUSERID/uData’) that has a sex property
Now I want to do a join on those two documents for each user and filter by their properties:
SELECT META(d).id
FROM `data` d
JOIN `data` ud ON KEY REPLACE(META(ud).id, 'uData', 'data') FOR d
WHERE d.type = "data"
AND d.year BETWEEN 0 AND 2000
AND ud.sex = 1
ORDER BY d.lastLogin
LIMIT 100
I created two indices for speeding this query up:
CREATE INDEX `type_year_lastLogin` ON `data`(`type`,`year`,`lastLogin`) using GSI
CREATE INDEX `id_and_sex` ON `data`(replace((meta(self).`id`), "uData", "data"),`sex`) using GSI
However, the query is still quite slow (more than a second for a few thousand documents whereas the query without JOIN runs in about 50ms).
This is what EXPLAIN looks like:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexScan",
"covers": [
"cover ((`d`.`type`))",
"cover ((`d`.`year`))",
"cover ((`d`.`lastLogin`))",
"cover ((meta(`d`).`id`))"
],
"index": "type_year_lastLogin",
"index_id": "cfa9e2f340ecde40",
"keyspace": "data",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"\"data\"",
"successor(2000)"
],
"Inclusion": 1,
"Low": [
"\"data\"",
"0"
]
}
}
],
"using": "gsi"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexJoin",
"as": "ud",
"for": "d",
"keyspace": "data",
"namespace": "default",
"on_key": "replace((meta(`ud`).`id`), \"uData\", \"data\")",
"scan": {
"covers": [
"cover (replace((meta(`ud`).`id`), \"uData\", \"data\"))",
"cover ((`ud`.`sex`))"
],
"index": "id_and_sex",
"index_id": "d8bf9bddb553643c",
"using": "gsi"
}
},
{
"#operator": "Filter",
"condition": "(((cover ((`d`.`type`)) = \"data\") and (cover ((`d`.`year`)) between 0 and 2000)) and ((`ud`.`sex`) = 1))"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "cover ((meta(`d`).`id`))"
}
]
}
]
}
}
]
},
{
"#operator": "Order",
"limit": "100",
"sort_terms": [
{
"expr": "cover ((`d`.`lastLogin`))"
}
]
},
{
"#operator": "Limit",
"expr": "100"
},
{
"#operator": "FinalProject"
}
]
},
"text": "\nSELECT META(d).id\n\nFROM `data` d\nJOIN `data` ud ON KEY REPLACE(META(ud).id, 'uData', 'data') FOR d\n\nWHERE d.type = \"data\"\nAND d.year BETWEEN 0 AND 2000\n\nAND ud.sex = 1\n\nORDER BY d.lastLogin\nLIMIT 100"
}
]
Apparently, both indices are used, but the query is slow nonetheless. Apparently I’m doing something wrong, but I can’t find a way to make it fast. Do you have an idea how I could achieve acceptable speed for this query?
Thank you in advance!