Graph Time Series Data From the Comfort of Your Terminal

If you have been following the earlier posts on the Couchbase blog, you have seen content about Couchbase Shell latest release. I am trying different things and today I wanted to see what I could do with time series datasets.

Ingest time series data

Browsing Kaggle, I found temperature data per City. I have downloaded it. It’s about 500Mb and that makes it hard to manipulate as a whole. But, of course, it’s a text file so we can easily peek at the data structure like so:

> open GlobalLandTemperaturesByCity.csv | first 10

╭───┬────────────┬────────────────────┬───────────────────────────────┬───────┬─────────┬──────────┬───────────╮
│ # │     dt     │ AverageTemperature │ AverageTemperatureUncertainty │ City  │ Country │ Latitude │ Longitude │
├───┼────────────┼────────────────────┼───────────────────────────────┼───────┼─────────┼──────────┼───────────┤
│ 0 │ 1743-11-01 │               6.07 │                          1.74 │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 1 │ 1743-12-01 │                    │                               │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 2 │ 1744-01-01 │                    │                               │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 3 │ 1744-02-01 │                    │                               │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 4 │ 1744-03-01 │                    │                               │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 5 │ 1744-04-01 │               5.79 │                          3.62 │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 6 │ 1744-05-01 │              10.64 │                          1.28 │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 7 │ 1744-06-01 │              14.05 │                          1.35 │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 8 │ 1744-07-01 │              16.08 │                          1.40 │ Århus │ Denmark │ 57.05N   │ 10.33E    │
│ 9 │ 1744-08-01 │                    │                               │ Århus │ Denmark │ 57.05N   │ 10.33E    │
╰───┴────────────┴────────────────────┴───────────────────────────────┴───────┴─────────┴──────────┴───────────╯

> open GlobalLandTemperaturesByCity.csv | first 10

╭───┬────────────┬────────────────────┬───────────────────────────────┬───────┬─────────┬──────────┬───────────╮

│ # │ dt │ AverageTemperature │ AverageTemperatureUncertainty │ City │ Country │ Latitude │ Longitude │

├───┼────────────┼────────────────────┼───────────────────────────────┼───────┼─────────┼──────────┼───────────┤

│ 0 │ 1743-11-01 │ 6.07 │ 1.74 │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 1 │ 1743-12-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 2 │ 1744-01-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 3 │ 1744-02-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 4 │ 1744-03-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 5 │ 1744-04-01 │ 5.79 │ 3.62 │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 6 │ 1744-05-01 │ 10.64 │ 1.28 │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 7 │ 1744-06-01 │ 14.05 │ 1.35 │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 8 │ 1744-07-01 │ 16.08 │ 1.40 │ Århus │ Denmark │ 57.05N │ 10.33E │

│ 9 │ 1744-08-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │

╰───┴────────────┴────────────────────┴───────────────────────────────┴───────┴─────────┴──────────┴───────────╯

> open GlobalLandTemperaturesByCity.csv | length

8599212

> open GlobalLandTemperaturesByCity.csv | length

8599212

8599212 lines, this is going to be 8599211 documents. My end goal is to see a temperature graph of several cities over the years. To do that, I am going to import everything to an import bucket first, then I will transform the data into time series.

If I import this naïvely, it gives me one doc per city/month group. So the key of my doc will be like Århus:1743-11-01. Say I only want to upload the first line, it should look like this:

> open GlobalLandTemperaturesByCity.csv | first 1 | wrap content | insert id { |x| $"($x.content.City):($x.content.dt)" }

╭───┬───────────────────┬──────────────────╮
│ # │      content      │        id        │
├───┼───────────────────┼──────────────────┤
│ 0 │ {record 7 fields} │ Århus:1743-11-01 │
╰───┴───────────────────┴──────────────────╯

> open GlobalLandTemperaturesByCity.csv | first 1 | wrap content | insert id { |x| $"($x.content.City):($x.content.dt)" } |doc upsert

╭───┬───────────┬─────────┬────────┬──────────┬─────────╮
│ # │ processed │ success │ failed │ failures │ cluster │
├───┼───────────┼─────────┼────────┼──────────┼─────────┤
│ 0 │         1 │       1 │      0 │          │ capella │
╰───┴───────────┴─────────┴────────┴──────────┴─────────╯

> open GlobalLandTemperaturesByCity.csv | first 1 | wrap content | insert id { |x| $"($x.content.City):($x.content.dt)" }

╭───┬───────────────────┬──────────────────╮

│ # │ content │ id │

├───┼───────────────────┼──────────────────┤

│ 0 │ {record 7 fields} │ Århus:1743-11-01 │

╰───┴───────────────────┴──────────────────╯

> open GlobalLandTemperaturesByCity.csv | first 1 | wrap content | insert id { |x| $"($x.content.City):($x.content.dt)" } |doc upsert

╭───┬───────────┬─────────┬────────┬──────────┬─────────╮

│ # │ processed │ success │ failed │ failures │ cluster │

├───┼───────────┼─────────┼────────┼──────────┼─────────┤

│ 0 │ 1 │ 1 │ 0 │ │ capella │

╰───┴───────────┴─────────┴────────┴──────────┴─────────╯

And now that I know it works, let’s get everything, and batch it:

open GlobalLandTemperaturesByCity.csv | par-each -t 5 {|x| wrap content | insert id {$"($x.City):($x.dt)" } } | doc upsert

1	open GlobalLandTemperaturesByCity.csv \| par-each -t 5 {\|x\| wrap content \| insert id {$"($x.City):($x.dt)" } } \| doc upsert

I have created a series collection in which I import the result of a SELECT aggregation, getting all the dates as timestamps and all the AverageTemperature as value:

query '

INSERT INTO series(KEY _k, VALUE _v)
    SELECT a.City _k,
    {"City": a.City ,
      "ts_start" : MIN(STR_TO_MILLIS(a.dt)),
      "ts_end" : MAX(STR_TO_MILLIS(a.dt)),
      "ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v
    FROM import a
    WHERE a.dt BETWEEN "1700-01-01" AND "2020-12-31"
    GROUP BY a.City;
    '

    INSERT INTO series(KEY _k, VALUE _v)
    SELECT a.City _k,
    {"City": a.City ,
      "ts_start" : MIN(STR_TO_MILLIS(a.dt)),
      "ts_end" : MAX(STR_TO_MILLIS(a.dt)),
      "ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v
    FROM import a
    WHERE SUBSTR(a.City,0,1) = "Z" AND a.dt BETWEEN "2010-01-01" AND "2020-12-31" 
    GROUP BY a.City;

query '

INSERT INTO series(KEY _k, VALUE _v)

SELECT a.City _k,

{"City": a.City ,

"ts_start" : MIN(STR_TO_MILLIS(a.dt)),

"ts_end" : MAX(STR_TO_MILLIS(a.dt)),

"ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v

FROM import a

WHERE a.dt BETWEEN "1700-01-01" AND "2020-12-31"

GROUP BY a.City;

INSERT INTO series(KEY _k, VALUE _v)

SELECT a.City _k,

{"City": a.City ,

"ts_start" : MIN(STR_TO_MILLIS(a.dt)),

"ts_end" : MAX(STR_TO_MILLIS(a.dt)),

"ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v

FROM import a

WHERE SUBSTR(a.City,0,1) = "Z" AND a.dt BETWEEN "2010-01-01" AND "2020-12-31"

GROUP BY a.City;

And now all the data is available as time series. Say I want the data from Paris, I can use the _timeseries function like this:

> let ts_ranges = [946684800000,1375315200000]; query $"SELECT t._t time, t._v0 `value` FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city = \"Paris\" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \);" | reject cluster | to csv  

time,value
946684800000,3.845
949363200000,6.587000000000001
951868800000,7.872000000000001
954547200000,10.067
957139200000,15.451
959817600000,17.666
962409600000,16.954
965088000000,19.512
967766400000,16.548000000000002
970358400000,11.675999999999998
....

> let ts_ranges = [946684800000,1375315200000]; query $"SELECT t._t time, t._v0 `value` FROM series AS d UNNEST _timeseries$d, {\"ts_ranges\":($ts_ranges)}$ AS t WHERE d.city = \"Paris\" AND $d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) $;" | reject cluster | to csv

time,value

946684800000,3.845

949363200000,6.587000000000001

951868800000,7.872000000000001

954547200000,10.067

957139200000,15.451

959817600000,17.666

962409600000,16.954

965088000000,19.512

967766400000,16.548000000000002

970358400000,11.675999999999998

....

To speed things up you can create the following index: CREATE INDEX ix1 ON series(City, ts_end, ts_start);

Notice the use of the ts_ranges variable at the beginning. You can easily reuse those values in a template String. They start with a $ sign and variables must be in parenthesis like ($my_variable). Which also means that now you need to escape parenthesis character as well as double quotes.

Plot time series

There are a variety of terminal plotting libraries, here I am using youplot:

> let ts_ranges = [946684800000,1375315200000]; query $"SELECT t._t date, t._v0 `value` FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city = \"Paris\" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \);" | reject cluster | to csv  | uplot line -d, --xlim 946684800000,1375315200000 --ylim -20,35 --title "global temperature in Paris" --xlabel date --ylabel temperature`

> let ts_ranges = [946684800000,1375315200000]; query $"SELECT t._t date, t._v0 `value` FROM series AS d UNNEST _timeseries$d, {\"ts_ranges\":($ts_ranges)}$ AS t WHERE d.city = \"Paris\" AND $d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) $;" | reject cluster | to csv | uplot line -d, --xlim 946684800000,1375315200000 --ylim -20,35 --title "global temperature in Paris" --xlabel date --ylabel temperature`

Now that is all great but ideally I want to have several cities in there to compare them. Any additional CSV columns are picked up automatically as long as you run youplot lines. Doing things gradually, let’s start with the support of multiple cities in the query. So a couple changes here:

- grouping the data by time
- adding d.city IN ($city) to the where clause. This works because the city array is a String literal of a JSON array. Let’s take a look at the answer as a JSON doc:

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, \(  ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) \)  FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" | reject cluster | first |to json

[
  {
    "date": "2000-01-01",
    "$1":
    [
      {
        "city": "Aba",
        "temp": 26.985000000000007
      },
      {
        "city": "Berlin",
        "temp": 1.3239999999999998
      },
      {
        "city": "London",
        "temp": 4.6930000000000005
      },
      {
        "city": "Paris",
        "temp": 3.845
      }
    ]
  }
]

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, $ ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}$ \) FROM series AS d UNNEST _timeseries$d, {\"ts_ranges\":($ts_ranges)}$ AS t WHERE d.city IN ($city) AND $d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) $ GROUP BY t._t ORDER BY t._t;" | reject cluster | first |to json

[

{

"date": "2000-01-01",

"$1":

[

{

"city": "Aba",

"temp": 26.985000000000007

{

"city": "Berlin",

"temp": 1.3239999999999998

{

"city": "London",

"temp": 4.6930000000000005

{

"city": "Paris",

"temp": 3.845

}

]

}

]

But this cannot turn into a CSV, even if you flatten everything like this:

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, \(  ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) \)  FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |reject cluster | first |flatten|flatten|to csv

date,city,temp
2000-01-01,Aba,26.985000000000007
2000-01-01,Berlin,1.3239999999999998
2000-01-01,London,4.6930000000000005
2000-01-01,Paris,3.845

date,city,temp

2000-01-01,Aba,26.985000000000007

2000-01-01,Berlin,1.3239999999999998

2000-01-01,London,4.6930000000000005

2000-01-01,Paris,3.845

So we can transform it into an object like so: Object v.city : v.temp FOR v IN ARRAY_AGG({"city": d.city, "temp":t._v0}) when v IS NOT MISSING END

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, \( Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) when v IS NOT MISSING END\)  FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |reject cluster | first | to json

{
  "date": "2000-01-01",
  "$1":
  {
    "Aba": 26.985000000000007,
    "Berlin": 1.3239999999999998,
    "London": 4.6930000000000005,
    "Paris": 3.845
  }
}

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, $ Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}$ when v IS NOT MISSING END\) FROM series AS d UNNEST _timeseries$d, {\"ts_ranges\":($ts_ranges)}$ AS t WHERE d.city IN ($city) AND $d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) $ GROUP BY t._t ORDER BY t._t;" |reject cluster | first | to json

{

"date": "2000-01-01",

"$1":

{

"Aba": 26.985000000000007,

"Berlin": 1.3239999999999998,

"London": 4.6930000000000005,

"Paris": 3.845

}

Which can now be flatten into a CSV:

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, \( Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) when v IS NOT MISSING END\)  FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |reject cluster | first | flatten| to csv

date,Aba,Berlin,London,Paris

2000-01-01,26.985000000000007,1.3239999999999998,4.6930000000000005,3.845

date,Aba,Berlin,London,Paris

2000-01-01,26.985000000000007,1.3239999999999998,4.6930000000000005,3.845

And with that we are ready to plot several lines:

> let ts_ranges = [946684800000,1375315200000]; let city = ['"Aba"','"Berlin"','"London"','"Paris"']; query $"SELECT t._t date, \( Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) when v IS NOT MISSING END\)  FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |reject cluster | flatten | to csv | uplot lines -H -d, --xlim 946684800000,1375315200000 --ylim -10,50 -w 50 --title "temperature per month" --xlabel date --ylabel temperature

To make that easier, you can use functions. Create a .nu file, like temp.nu, with the following content:

def tempGraph [$ts_ranges, $cities] {
    cb-env bucket cbsh
    cb-env scope globaltemp
    cb-env collection series
    
    let query = ($"SELECT t._t date, \( Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}\) when v IS NOT MISSING END\)"
     + $" FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t "
     + $" WHERE d.city IN ($city) " 
     + $" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \)"
     + $" GROUP BY t._t ORDER BY t._t;" )
    let csv = query $query |reject cluster | flatten | to csv
    $csv | uplot lines -H -d, --xlim 946684800000,1375315200000 --ylim -10,50 -w 50 --title "temperature per month" --xlabel date --ylabel temperature
}

def tempGraph [$ts_ranges, $cities] {

cb-env bucket cbsh

cb-env scope globaltemp

cb-env collection series

let query = ($"SELECT t._t date, $ Object v.city : v.temp FOR v IN ARRAY_AGG\({\"city\": d.city, \"temp\":t._v0}$ when v IS NOT MISSING END\)"

+ $" FROM series AS d UNNEST _timeseries$d, {\"ts_ranges\":($ts_ranges)}$ AS t "

+ $" WHERE d.city IN ($city) "

+ $" AND $d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) $"

+ $" GROUP BY t._t ORDER BY t._t;" )

let csv = query $query |reject cluster | flatten | to csv

$csv | uplot lines -H -d, --xlim 946684800000,1375315200000 --ylim -10,50 -w 50 --title "temperature per month" --xlabel date --ylabel temperature

}

Then if you source it it becomes much easier to use:

> source ./testingcbshell/temp.nu

> tempGraph  [946684800000,1375315200000] ['"Aba"','"Berlin"','"London"','"Paris"']

> source ./testingcbshell/temp.nu

> tempGraph [946684800000,1375315200000] ['"Aba"','"Berlin"','"London"','"Paris"']

I hope this gave you a quick overview of Couchbase time series support, of Couchbase Shell data manipulation, and how you can use other shell commands like youplot to make things more integrated and interesting. You could easily generate a full report in various formats from a GitHub Actions for instance, endless possibilities!

- Read more posts about Couchbase Shell.

Laurent Doguin

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

COMMUNITY

Join the Developer Community

Resource Center

Education

Compare

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Graph Time Series Data From the Comfort of Your Terminal

Ingest time series data

Plot time series

Author

Posted by Laurent Doguin

Leave a reply Cancel reply