Showing posts with label sql. Show all posts
Showing posts with label sql. Show all posts

2020-11-23

Tarantool: In-memory SQL Hybrid Database

Tarantool is an awesome product that I've evaded for a while (because it's using Lua), but after a long time seeking a awesome free silver-bullet database (2016's benchmark here), I think only Tarantool fits the bill (so I need to learn about Lua 5.1 first since Tarantool uses LuaJIT that implements Lua 5.1 and some part of Lua 5.2). The benefits of using Tarantool are:
  • ACID Transaction, like standard RDBMS and Aerospike
  • Replication, very easy load balance, like Aerospike, Redis, etc
  • On-Disk Persistence, like Redis Enterprise, Aerospike, KeyDB Pro
  • Extension Modules, like Redis (RediGraph, RediSearch, RediSQL, etc)
  • Can use SQL, so no need to learn new language, unlike MongoDB, ArangoDB, or databases that only support partial/subset of SQL: ScyllaDB CQL and GridDB TQL, eg. no WHERE-IN.
  • Combining both in-memory and disk storage, like paid version of Redis, KeyDB
  • Can use Go :3 unlike GridDB that only have partial support
  • Extremely customizable using Lua, even on the server part
  • Less memory requirement when using vinyl engine (for large datasets that are larger than RAM) compared to in-memory solutions like Aerospike (64 byte per key requirement) or Redis swap,  sacrificing the I/O performance.
Some of the cons of using Tarantool is that you cannot prepend or drop fields/column, you can only append and the field should be nullable until ever rows have all values.
To install Tarantool on ubuntu (without Docker) you can use these commands:

curl -L https://tarantool.io/odVEwx/release/2.4/installer.sh | bash
curl -L https://tarantool.io/odVEwx/live/2.5/installer.sh | bash
sudo apt install tarantool
sudo systemctl enable tarantool
sudo systemctl start tarantool
netstat -antp | grep 3301

If it doesn't work or stuck just ctrl-C the curl script then install manually from these repo:

echo '
deb https://download.tarantool.org/odVEwx/tarantool/2.5/ubuntu/ focal main
deb-src https://download.tarantool.org/odVEwx/tarantool/2.5/ubuntu/ focal main
deb https://download.tarantool.org/tarantool/modules/ubuntu/ focal main
deb-src https://download.tarantool.org/tarantool/modules/ubuntu/ focal main
' | sudo tee -a /etc/apt/sources.list.d/tarantool_2_5.list
sudo apt update
sudo apt install tarantool

To connect to existing tarantool, we can use this command:

tarantoolctl connect 3301

-- create "table"
s = box.schema.space.create('tester')
-- s equal to box.space.tester

-- create "schema"
s:format({
  {name = 'id', type = 'unsigned'},
  {name = 'band_name', type = 'string'},
  {name = 'year', type = 'unsigned'}
})

-- create primary index
s:create_index('primary', {
  type = 'hash',
  parts = {'id'}
})

-- insert records
s:insert{1, 'Roxette', 1986}
s:insert{2, 'Scorpions', 2015}
s:insert{3, 'Ace of Base', 1993}

-- query using primary index
s:select{3}

-- create secondary index
s:create_index('secondary', {
  type = 'hash',
  parts = {'band_name'}
})

-- query using secondary index
s.index.secondary:select{'Scorpions'}

-- grant guest user to access anywhere
box.schema.user.grant('guest', 'read,write,execute', 'universe')

-- reset the admin password, only when listen, 
-- eg. from /etc/tarantool/instances.enabled/example.lua
box.schema.user.passwd('pass')

-- alter table 
box.space.tester:format({
  {name = 'id', type = 'unsigned'},
  {name = 'band_name', type = 'string'},
  {name = 'year', type = 'unsigned'},
  {name = 'rate', type = 'unsigned', is_nullable=true}})

-- update record id=1, 4th column to 5
box.space.tester:update(1, {{'=', 4, 5}})

-- update record id=2, 4th column to 4
box.space.tester:update(2, {{'=', 4, 4}})

-- search id>1
box.space.tester:select(1, {iterator = 'GT'})
-- can be LT, LE, EQ, REQ (reverse equal), GE, GT.
-- must have TREE index

Next, how to connect from golang? We need to install first go-tarantoll library:

go get -u -v github.com/tarantool/go-tarantool
# or: go get -u -v github.com/viciious/go-tarantool

Next let's create a go file:

package main
import (
    "log"
    "fmt"
    "github.com/tarantool/go-tarantool"
)
func main() {
    conn, err := tarantool.Connect("127.0.0.1:3301", tarantool.Opts{
        User: "guest",
        Pass: "",
    })
    if err != nil {
        log.Fatalf("Connection refused "+err.Error())
    }
    defer conn.Close()

    // insert
    resp, err := conn.Insert("tester", []interface{}{4, "ABBA", 1972})
    fmt.Printf("Insert: %#v %v\n",resp.Data,err)

    // select offset=0, limit=1
    resp, err = conn.Select("tester", "primary", 0, 1, tarantool.IterEq, []interface{}{4})
    fmt.Printf("Select: %#v %v\n",resp.Data,err)
    resp, err = conn.Select("tester","secondary", 0, 1, tarantool.IterEq, []interface{}{"ABBA"})
    fmt.Printf("Select: %#v %v\n",resp.Data,err)

    // update col 2 by 3
    resp, err = conn.Update("tester", "primary", []interface{}{4}, []interface{}{[]interface{}{"+", 2, 3}})
    fmt.Printf("Update: %#v %v\n",resp.Data,err)

    // replace
    resp, err = conn.Replace("tester", []interface{}{4, "New band", 2011})
    fmt.Printf("Replace: %#v %v\n",resp.Data,err)

    // upsert: update increment col 2 by 5, or insert if not exists
    // does not return data back
    resp, err = conn.Upsert("tester", []interface{}{4, "Another band", 2000}, []interface{}{[]interface{}{"+", 2, 5}})
    fmt.Printf("Upsert: %#v %v\n",resp.Data,err)

    // delete
    resp, err = conn.Delete("tester", "primary", []interface{}{4})
    fmt.Printf("Delete: %#v %v\n",resp.Data,err)

    // call directly, do not add () when calling
    resp, err = conn.Call("box.space.tester:count", []interface{}{})
    fmt.Printf("Call: %#v %v\n",resp.Data,err)

    // eval Lua 
    resp, err = conn.Eval("return 4 + 5", []interface{}{})
    fmt.Printf("Eval: %#v %v\n",resp.Data,err)
}

It would give an output something like this:

Insert: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Select: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Select: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Update: []interface {}{[]interface {}{0x4, "ABBA", 0x7b7}} <nil>
Replace: []interface {}{[]interface {}{0x4, "New band", 0x7db}} <nil>
Upsert: []interface {}{} <nil>
Delete: []interface {}{[]interface {}{0x4, "New band", 0x7e0}} <nil>
Call: []interface {}{[]interface {}{0x3}} <nil>
Eval: []interface {}{0x9} <nil>

See more info here and APIs here. Next you can use cartridge to manage the cluster.
Each row/tuple in Tarantool stored as MsgPack and when displayed in console it uses YAML format. TREE is the default index in tarantool engine, memtx engine support few more: HASH, RTREE, BITSET. The TREE or HASH may only index certain types: integer, number, double, varbinary, boolean, uuid, scalar (null, bool, integer, double, decimal, string, varbinary). TREE or HASH or BITSET may only index string and unsigned. RTREE may only index array type. If we have multiple parts/columns on TREE index, we can do partial search starting from starting column of the index. When using string index, we may specify the collation (ordering): unicode or unicode_ci (case ignore, also ignore accented/diacritic alphabet). Sequence can be accessed through box.schema.sequence.create() with options (start, min, max, cycle, step, if_not_exists), we could call next method to get next value. Tarantool persist data in two modes: WAL and snapshot (can be forced using box.snapshot()). List of available operators for update/upsert:
  • + to add, - to subtract
  • & bitwise and, | bitwise or,  ^ bitwise xor
  • : string splice
  • ! insertion of a new field
  • # deletion
  • = assignment
This snippet shows how to update table and check it:

r, err := conn.Call(`box.schema.space.create`, []interface{}{TableName})
if err != nil {
  if err.Error() != `Space '`+TableName+`' already exists (0xa)` {
    log.Println(err)
    return
  }
}
fmt.Println(r.Tuples())
fmt.Println(`---------------------------------------------------`)
r, err = conn.Call(`box.space.`+TableName+`:format`, []interface{}{
  []map[string]interface{}{
    {`name`: `id`, `type`: `unsigned`},
    {`name`: `name`, `type`: `string`},
  },
})
if err != nil {
  log.Println(err)
  return
}
fmt.Println(r.Tuples())
r, err = conn.Call(`box.space.`+TableName+`:format`, []interface{}{})
if err != nil {
  log.Println(err)
  return
}
fmt.Println(r.Tuples())


func ExecSql(conn *tarantool.Connection, query string, parameters ...M.SX) map[interface{}]interface{} {
  params := A.X{query}
  for _, v := range parameters {
    params = append(params, v)
  }
  L.Describe(params)
  res, err := conn.Call(`box.execute`, params)
  if L.IsError(err) {
    L.Describe(`ERROR ExecSql !!! ` + err.Error())
    L.DescribeSql(query, parameters)
    L.Describe(err.Error())
    return map[interface{}]interface{}{`error`: err.Error()}
  }
  tup := res.Tuples()
  if len(tup) > 0 {
    if len(tup[0]) > 0 {
      if tup[0][0] != nil {
        kv, ok := tup[0][0].(map[interface{}]interface{})
        // row_count for UPDATE
        // metadata, rows for SELECT
        if ok {
          return kv
        }
      }
    }
  }
  // possible error
  if len(tup) > 1 {
    if len(tup[1]) > 0 {
      if tup[1][0] != nil {
        L.Describe(`ERROR ExecSql syntax: ` + X.ToS(tup[1][0]))
        L.Describe(query, parameters)
        L.Describe(tup[1][0])
        return map[interface{}]interface{}{`error`: tup[1][0]}
      }
    }
  }
  return map[interface{}]interface{}{}
}

func QuerySql(conn *tarantool.Connection, query string, callback func(row A.X), parameters ...M.SX) []interface{} {
  kv := ExecSql(conn, query, parameters...)
  rows, ok := kv[`rows`].([]interface{})
  if ok {
    for _, v := range rows {
      callback(v.([]interface{}))
    }
    return rows
  }
  return nil
}


The tarantool have some limitations:
  • no datetime (use unsigned if greater than 1970-01-01)
  • can only append column at the end, cannot delete column
  • cannot alter datatype, except if there’s no data yet
  • alter table when there’s data exists must have not_null=true flag 
Tarantool SQL limitations and gotchas:
  • table names or column names must be quoted or they will be automatically capitalized (and then error column/space=table not found)
  • concat just like postgresql (|| operator), but you must convert both operands to string first 
  • does not support right join (not needed anyway), can do natural join or left join, also support foreign key
  • no date/time data type/functions (tarantool problem)
  • NULLIF (tarantool), IFNULL (mysql), COALESCE (standard SQL) all works normally
  • cannot run alter table add column (must use APIs)
  • no information_schema
  • calling box.execute to execute SQL (supported since version 2.x), the result is on first row (row_count for UPDATE, metadata and rows for SELECT), the error is on second row.
I think that's it for now, to learn Lua (needed for stored procedure) you can use official documentation. For example project using Tarantool and Meilisearch, you can clone kmt1 dummy project. 


2016-02-10

Simpler Way to find rows with Largest/Smallest value for each group in SQL

Sometimes we need to find largest or smallest value for each column and also associated column. Recently I found out that DISTINCT ON is quite good, than my previous way (LEFT JOIN less IS NULL, or WHERE IN SELECT MAX/MIN), see this example:

CREATE TABLE test1 ( id bigserial primary key, val int, name text );

INSERT INTO test1(val,name) VALUES(6,'a'),(9,'a'),(5,'b'),(11,'b'),(2,'c'),(8,'c');

SELECT * FROM test1;

 id | val | name 
----+-----+------
  1 |   6 | a
  2 |   9 | a
  3 |   5 | b
  4 |  11 | b
  5 |   2 | c
  6 |   8 | c

To find which id for each name that has greatest value, we usually do something like this (WHERE IN SELECT MAX):

SELECT *
FROM test1 x1
WHERE (name, val) IN (
  SELECT name, MAX(val) 
  FROM test1
  GROUP BY 1
)
ORDER BY x1.name;


 id | val | name
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Sort  (cost=73.95..74.64 rows=275 width=44)
   Sort Key: test1.name
   ->  Hash Join  (cost=33.50..62.81 rows=275 width=44)
         Hash Cond: ((test1.name = test1_1.name) AND (test1.val = (max(test1_1.val))))
         ->  Seq Scan on test1  (cost=0.00..21.00 rows=1100 width=44)
         ->  Hash  (cost=30.50..30.50 rows=200 width=36)
               ->  HashAggregate  (cost=26.50..28.50 rows=200 width=36)
                     Group Key: test1_1.name
                     ->  Seq Scan on test1 test1_1  (cost=0.00..21.00 rows=1100 width=36)

Or this (LEFT JOIN less IS NULL):

SELECT x1.*
FROM test1 x1
  LEFT JOIN test1 x2
    ON x1.name = x2.name
    AND x1.val < x2.val
WHERE x2.id IS NULL
GROUP BY x1.id, x1.name
ORDER BY x1.name; 


 id | val | name 
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Group  (cost=264.68..264.75 rows=10 width=44)
   Group Key: x1.name, x1.id
   ->  Sort  (cost=264.68..264.70 rows=10 width=44)
         Sort Key: x1.name, x1.id
         ->  Merge Left Join  (cost=153.14..264.51 rows=10 width=44)
               Merge Cond: (x1.name = x2.name)
               Join Filter: (x1.val < x2.val)
               Filter: (x2.id IS NULL)
               ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
                     Sort Key: x1.name
                     ->  Seq Scan on test1 x1  (cost=0.00..21.00 rows=1100 width=44)
               ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
                     Sort Key: x2.name
                     ->  Seq Scan on test1 x2  (cost=0.00..21.00 rows=1100 width=44)


There some other way, that I found simpler (and faster in real life) also return only one row if there's duplicate value, that is DISTINCT ON!

SELECT DISTINCT ON (x1.name) *
FROM test1 x1
ORDER BY x1.name, x1.val DESC;

 id | val | name 
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Unique  (cost=76.57..82.07 rows=200 width=44)
   ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
         Sort Key: name, val
         ->  Seq Scan on test1 x1  (cost=0.00..21.00 rows=1100 width=44)

So in this case, we look for highest value, distinct by name. Well, that's all for now.

2015-10-15

State of JSON in SQLite and NoSQL

Another news for those who like JSON format, there's JSON1 extension for SQLite, this is a great news for those who only need embedded database for their app. Their functions are similar to PostgreSQL's JSON/JSONB functions. On other hand, there's already a lot of BSON solution in NoSQL world, such as the infamous MongoDB  (and TokuMX) RethinkDB, ArangoDB (SQL-like syntax), PouchDB (Javascript version of CouchDB),  SequoiaDBDjonDB, etc. For more information about those databases, see the video below.

SQLite 3 Tutorial

PostgreSQL 9.4's JSONB

What's new in MongoDB 3

Fractal Tree Index (TokuMX)

RethinkDB

ArangoDB

EDIT Hey, apparently there are also JSON data type support in MySQL 5.7.8 (2015-08-03, Release Candidate).

2015-09-08

Query for Student Transcript (and Cumulative GPA)

There are some rules to generate transcript, some universities will use bet the highest grade, but some other will use get the latest grade. For example if we have this schema

CREATE TABLE schedules (
  id BIGSERIAL PRIMARY KEY,
  subject_id BIGINT FOREIGN KEY REFERENCES subjects(id),
  semester VARCHAR(5) -- YYYYS -- year and semester
);
CREATE TABLE enrollments (
  id BIGSERIAL PRIMARY KEY,
  student_id BIGINT FOREIGN KEY REFERENCES students(id),
  schedule_id BIGINT FOREIGN KEY REFERENCES schedules(id),
  grade VARCHAR(2) -- letter
);

If the university rule for transcript is get the highest score, the query is quite simple, first you'll need to make sure that grade is ordered in certain way on MAX agregate (for example: A+ > A > A- > B+ > ...) , then you'll need to do this query:

SELECT s.subject_id
  , MAX(e.grade)
FROM enrollments e
  JOIN schedules s
    ON e.schedule_id = s.id
WHERE e.student_id = ?
GROUP BY 1

If the university rules is get the latest score, first of all, you'll need to make sure that the semester is ordered (for example: 2015A < 2015B < 2015C < 2015A < ...), then you'll need to find the latest semester of each subject, for example:

SELECT s.subject_id
  , MAX(s.semester)
FROM enrollments e
  JOIN schedules s
    ON e.schedule_id = s.id
WHERE e.student_id = ?
GROUP BY 1

Then you'll need to fetch the credits that is the latest semester, for example:

SELECT s.subject_id
  , e.grade
FROM enrollments e
  JOIN schedules s
    ON e.schedule_id = s.id
WHERE e.student_id = ?
  AND (s.subject_id, s.semester) IN
    ( SELECT s.subject_id
        , MAX(s.semester)
      FROM enrollments e
        JOIN schedules s
          ON e.schedule_id = s.id
      WHERE e.student_id = ?
      GROUP BY 1
    )
GROUP BY 1

That way, it you'll only get the latest grade (as defined on the sub-query).

There are another alternative for this, you can use aggregate with OVER and PARTITION BY syntax, then compare the result with the aggregate to remove the previous scores, for example:

SELECT r.subject_id
  , r.grade
FROM (
  SELECT s.subject_id
    , e.grade
    , s.semester
    , MAX(s.semester) OVER (PARTITION BY s.subject_id) max_sem
  FROM enrollments e
    JOIN schedules s
      ON e.schedule_id = s.id
  WHERE e.student_id = ?) r
WHERE r.semester_id = r.max_sem

That's it, that's the way to get the transcript. Those queries can be modified to fetch the Cumulative GPA. just add one more join with subjects table to get the units/credits, then just calculate the SUM(subjects.unit * enrollments.num_grade) / SUM(subjects.unit)
You can also calculate the Cumulative GPA per semester by adding the semester for the partition/group by part. If you wonder, why do I write this post? because I have googled and not found any posts that write about this.