Programming Rants: scylladb

Showing posts with label scylladb. Show all posts

2022-03-17

Getting started with Cassandra or ScyllaDB

Today we're gonna learn Cassandra (it's been 5 years since I last use ScyllaDB: C++ version of Cassandra), to install Cassandra and ScyllaDB, you can use this docker-compose:

version: '3.3'
services:

testcassandra:

image: cassandra:3.11 # or latest

environment:

- HEAP_NEWSIZE=256M

- MAX_HEAP_SIZE=1G

- "JVM_OPTS=-XX:+PrintGCDateStamps"

- CASSANDRA_BROADCAST_ADDRESS

ports:

- "9042:9042"

testscylla:

image: scylladb/scylla:4.5.1 # because latest 4.6 is broken

command: --smp 2 --memory 1G --overprovisioned 1 --api-address 0.0.0.0 --developer-mode 1

ports:

- 19042:9042

# - 9142:9142

# - 7000:7000

# - 7001:7001

# - 7199:7199

# - 10000:10000

# scylla-manager:

# image: scylladb/scylla-manager

# depends_on:

# - testscylla

Using docker we can create spawn multiple nodes to test NetworkTopologyStrategy, consistency level, and replication factor (or even multiple datacenter):

docker run --name NodeX -d scylladb/scylla:4.5.1
docker run --name NodeY -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"
docker run --name NodeZ -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"

docker exec -it NodeZ nodetool status
# wait for UJ (up joining) became UN (up normal)

Since I failed to run latest ScyllaDB (so we use 4.5). To install cqlsh locally, you can use this command:

pip3 install cqlsh

cqlsh 127.0.0.1 9042 # cassandra
cqlsh 127.0.0.1 19042 # scylladb

node=`docker ps | grep /scylla: | head -n 1 | cut -f 1 -d ' '`

docker exec -it $node cqlsh # using cqlsh inside scylladb
# ^ must wait 30s+ before docker ready

docker exec -it $node nodetool status
# ^ show node status

As we already know, Cassandra is columnar database, that we have to make a partition key (where the rows will be located) and clustering key (ordering of that data inside the partition), the SSTable part works similar to Clickhouse merges.

To create a keyspace (much like a "database" or collection of tables but we can set replication region), use this command:

CREATE KEYSPACE my_keyspace WITH replication = {'class':

'SimpleStrategy', 'replication_factor': 1};
-- ^ single node
-- {'class' : 'NetworkTopologyStrategy', 'replication_factor': '3'};
-- ^ multiple node but in a single datacenter and/or rack
-- {'class' : 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'};
-- ^ multiple datacenter
USE my_keyspace;

CONSISTENCY; -- how many read/write ack
-- ANY
-- ONE, TWO, THREE
-- LOCAL_ONE
-- QUORUM = replication_factor / 2 + 1
-- LOCAL_QUORUM
-- EACH_QUORUM -- only for write
-- ALL -- will failed if nodes < replication_factor
CONSISTENCY new_level;

To create a table with same partition key and clustering/ordering key:

CREATE TABLE users ( -- or TYPE for custom type, [keyspace.]
fname text,

lname text,
title text,
PRIMARY KEY (lname, fname)
);
DESCRIBE TABLE users; -- only for 4.0+
CREATE TABLE foo (
pkey text,
okey text,
PRIMARY KEY ((pkey), okey) -- different partition and ordering
-- add WITH CLUSTERING ORDER BY (okey DESC) for descending
); -- add WITH cdc = { 'enabled' = true, preimage = 'true' }
DESC SCHEMA; -- show all tables and materialized views

To upsert, use insert or update command (last write wins):

INSERT INTO users (fname, lname, title)

VALUES ('A', 'B', 'C');

INSERT INTO users (fname, lname, title)

VALUES ('A', 'B', 'D'); -- add IF NOT EXISTS to prevent replace
SELECT * FROM users; -- USING TIMEOUT XXms

UPDATE users SET title = 'E' WHERE fname = 'A' AND lname = 'C';
SELECT * FROM users; -- add IF EXISTS to prevent insert

# INSERT INTO users( ... ) VALUES ( ... ) USING TTL 600
# UPDATE users USING TTL 600 SET ...
# SELECT TTL(fname) FROM users WHERE ...
-- set TTL to 0 to remove TTL

-- column will be NULL if TTL became 0

-- whole row will be deleted if all non-PK column TTL is zero
# ALTER TABLE users WITH default_time_to_live = 3600;

# SELECT * FROM users LIMIT 3
# SELECT * FROM users PER PARTITION LIMIT 2
# SELECT * FROM users PER PARTITION LIMIT 1 LIMIT 3

CREATE TABLE stats(city text PRIMARY KEY,total COUNTER);
UPDATE stats SET total = total + 6 WHERE city = 'Kuta';
SELECT * FROM stats;

To change the schema, use usual alter table command:

ALTER TABLE users ADD mname text;
-- tinyint, smallint, int, bigint (= long)
-- variant (= the real bigint)
-- float, double
-- decimal
-- text/varchar, ascii
-- timestamp
-- date, time
-- uuid
-- timeuuid (with mac address, conflict free, set now())
-- boolean
-- inet
-- counter
-- set<type> (set {val,val}, +{val}, -{val})
-- list<type> (set [idx]=, [val,val], +[], []+, -[], DELETE [idx])
-- map<type,type> (set {key: val}, [key]=, DELETE [key] FROM)
-- tuple<type,...> (set (val,...))>
SELECT * FROM users;

UPDATE users SET mname = 'F' WHERE fname = 'A' AND lname = 'D';
-- add IF col=val to prevent update (aka lightweight transaction)
-- IF NOT EXISTS
--
SELECT * FROM users;

Complex nested type example from this page:

CREATE TYPE phone (
    country_code int,
    number text,
);
CREATE TYPE address (
street text,
city text,
zip text,
phones map<text, frozen<phone>> -- must be frozen, cannot be updated
);
CREATE TABLE pets_v4 (
name text PRIMARY KEY,
addresses map<text, frozen<address>>
);
INSERT INTO pets_v4 (name, addresses)
VALUES ('Rocky', {
    'home' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: {
        'cell' : { country_code: 1, number: '202 456-1111' },
        'landline' : { country_code: 1, number: '202 456-1234' }
      }
    },
    'work' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: { 'fax' : { country_code: 1, number: '202 5444' } }
    }
});

To create index (since Cassandra only allows retrieve by partition and cluster key or full scan):

CREATE INDEX ON users(title); -- global index (2 hops per query)
SELECT * FROM users WHERE title = 'E';
DROP INDEX users_title_idx;
SELECT * FROM users WHERE title = 'E' ALLOW FILTERING; -- full scan
CREATE INDEX ON users((lname),title); -- local index (1 hop per query)

To create a materialized view (that works similar to Clickhouse's materialized view):

CREATE MATERIALIZED VIEW users_by_title AS
SELECT * -- ALTER TABLE will automatically add this VIEW too
FROM users
WHERE title IS NOT NULL
AND fname IS NOT NULL
AND lname IS NOT NULL
PRIMARY KEY ((title),lname,fname);
SELECT * FROM users_by_title;
INSERT INTO users(lname,fname,title) VALUES('A','A','A');
SELECT * FROM users_by_title WHERE title = 'A';
DROP MATERIALIZED VIEW users_by_title;
-- docker exec -it NodeZ nodetool viewbuildstatus

To create "transaction" use BATCH statement:

BEGIN BATCH;
INSERT INTO ...
UPDATE ...
DELETE ...
APPLY BATCH;

To import from file, use COPY command:

COPY users FROM 'users.csv' WITH HEADER=true;

Tips for performance optimization:
1. for multi-DC use LocalQuorum on read, and TokenAware+DCAwareRoundRobin to prevent reading from nodes on different DC

2. ALLOW FILTERING if small number of records low cardinality (eg. values are true vs false only) -- 0 hop

3. global INDEX when primary key no need to be included, and latency doesn't matter (2 hops)

4. local INDEX for when primary key can be included (1 hops)

5. MATERIALIZED VIEW when want to use different partition for the same data, and storage doesn't matter

6. always use prepared statement

2021-11-14

Databases with Automatic Rebalance Benchmark (TIDB vs YugabyteDB vs CockroachDB)

Automatic rebalance/repair/self-healing (we can remove or add new node, and it will distribute the data and rebalance itself, data are replicated to more than 1 node). Previous benchmark doesn't really care about this awesome feature (no more cutoff downtime to kill master instance and promote slave as master then switch every client to connect to new master -- if not using any proxy).

Some databases that I found that support this feature:

Aerospike (in-mem kv database, community edition max 4 billion object) https://aerospike.com/products/product-matrix/
Couchbase (document database, there's community edition) https://www.couchbase.com/products/editions
MongoDB (document database, without SQL syntax)
Cassandra (columnar database, CQL a subset of SQL)
ScyllaDB (columnar database, cassandra-compatible, oss version max 5 node) https://www.scylladb.com/pricing/
CockroachDB (postgresql compatible, there's core edition) https://www.cockroachlabs.com/compare/
TiDB (mysql compatible)
YugaByteDB (postgresql/cassandra/redis compatible)
RavenDB (community edition max ram 6GB) https://ravendb.net/buy
SingleStore/MemSQL (mysql compatible, free edition max 4 node) https://www.singlestore.com/free-software/

Reproducibility

The repository are here: https://github.com/kokizzu/hugedbbench on the 2021 folder. We're going to test local single (if possible) and multi server deployment using docker. Why using docker? because i don't want to ruin my computer/server with trash files they are creating in system directory (if any). Some of databases not included if not supporting SQL or if a license key required to start. Why only benchmarking 2 column? because it fit my project's most common use case, where there's 1 PK (bigint or string), and 1 unique key (mostly string), and the rest mostly some indexed or non-indexed column. Why are you even doing this? Just want to select the best thing for my next side project's techstack (and because my past companies I've work with seems love to move around database server location a lot).

The specs for the server that used in this benchmark: 32-core 128GB RAM 500GB NVMe disk.

CockroachDB

CockroachDB is one of NewSQL movement that support PostgreSQL syntax, to deploy in single node we can use docker compose. The UI for cluster monitor on port 8080 is quite ok :3 better than nothing.

Here's the result for 100 inserts x 1000 goroutines:

Version used: v21.1.11

CockroachDB InsertOne 10.034616078s

CockroachDB Count 42.326487ms

CockroachDB UpdateOne 12.804722812s

CockroachDB Count 78.221432ms

CockroachDB SelectOne 2.281355728s

CockroachDB Total 25.2420225s

$ sudo du -h --max-depth 0 2021/cockroachdb/cockroach1

442M 2021/cockroachdb/cockroach1

CockroachDB InsertOne 7.125466063s

CockroachDB Count 39.753102ms

CockroachDB UpdateOne 10.221870484s

CockroachDB Count 70.624908ms

CockroachDB SelectOne 2.196985441s

CockroachDB Total 19.655920219s

432M 2021/cockroachdb/cockroach1

# multiple cockroach docker but connect only into one

# seems high availability (>1 RF) turned on by default

# but you have to init the cluster manually after docker-compose up

CockroachDB InsertOne 13.979257573s

CockroachDB Count 46.824883ms

CockroachDB UpdateOne 1m22.941738013s

CockroachDB Count 42.374814ms

CockroachDB SelectOne 2.676679427s

CockroachDB Total 1m39.687566436s

433M 2021/cockroachdb/cockroach1

292M 2021/cockroachdb/cockroach2

222M 2021/cockroachdb/cockroach3

TiDB

TiDB is one of NewSQL movement that support MySQL syntax, the recommended way is using tiup command, but we're going to use docker so it would be fair with other database product. The official docker use 3 placement driver and 3 kv server, so I try that first. The cluster monitor in port 10080 but it blocked by chrome, so I moved it on 10081, it's very plaintexty compared to other products.

Version used: 5.7.25-TiDB-v5.0.1

TiDB InsertOne 14.063953386s

TiDB Count 32.523526ms

TiDB UpdateOne 11.329688001s

TiDB Count 49.320725ms

TiDB SelectOne 2.110410282s

TiDB Total 27.601866351s

$ sudo du -h --max-depth 0 2021/tidb/t*/

24G 2021/tidb/tikv0/

24G 2021/tidb/tikv1/

24G 2021/tidb/tikv2/

123M 2021/tidb/tipd0/

123M 2021/tidb/tipd1/

123M 2021/tidb/tipd2/

TiDB InsertOne 13.434256392s

TiDB Count 44.192782ms

TiDB UpdateOne 12.575839233s

TiDB Count 63.126285ms

TiDB SelectOne 2.00257672s

TiDB Total 28.134319527s

24G 2021/tidb/tikv0/
24G 2021/tidb/tikv1/
24G 2021/tidb/tikv2/
123M 2021/tidb/tipd0/
62M 2021/tidb/tipd1/
62M 2021/tidb/tipd2/

# reducing to single server mode (1 pd, 1 kv, 1 db), first run:

TiDB InsertOne 3.216365486s

TiDB Count 34.30629ms

TiDB UpdateOne 3.913131711s

TiDB Count 62.202395ms

TiDB SelectOne 1.991229179s

TiDB Total 9.233077269s
24G 2021/tidb/tikv0/
62M 2021/tidb/tipd0/

YugaByteDB

YugaByteDB is one of NewSQL movement that support PostgreSQL syntax, to deploy in single node we can use docker compose too. The cluster monitor on port :7000 is quite ok. The tmp directory mounted because if it isn't it would stuck starting on 2nd time unless the temporary file manually deleted. limits.conf applied.

Version used: 2.9.1.0

YugaByteDB InsertOne 11.402609701s

YugaByteDB Count 159.357304ms

YugaByteDB UpdateOne 19.232827282s

YugaByteDB Count 214.389496ms

YugaByteDB SelectOne 2.778803557s

YugaByteDB Total 33.834838111s

$ sudo du -h --max-depth 0 2021/yugabytedb/yb*1

25M 2021/yugabytedb/ybmaster1

519M 2021/yugabytedb/ybtserver1

YugaByteDB InsertOne 13.536083917s

YugaByteDB Count 202.381009ms

YugaByteDB UpdateOne 20.78337085s

YugaByteDB Count 190.119437ms

YugaByteDB SelectOne 2.849347721s

YugaByteDB Total 37.607747856s

25M 2021/yugabytedb/ybmaster1

519M 2021/yugabytedb/ybtserver1

# multiple ybtserver but only connect to one

# replication factor 1, first run:

YugaByteDB InsertOne 15.260747636s

YugaByteDB Count 66.599257ms

YugaByteDB UpdateOne 26.246382158s

YugaByteDB Count 63.119089ms

YugaByteDB SelectOne 3.213271599s

YugaByteDB Total 44.90095282s

25M 2021/yugabytedb/ybmaster1

242M 2021/yugabytedb/ybtserver1

156M 2021/yugabytedb/ybtserver2

132M 2021/yugabytedb/ybtserver3

# after changing replication factor to 2, first run:

YugaByteDB InsertOne 38.614091068s

YugaByteDB Count 76.615212ms

YugaByteDB UpdateOne 56.796680169s

YugaByteDB Count 84.35411ms

YugaByteDB SelectOne 3.14747611s

YugaByteDB Total 1m38.756226195s

26M 2021/yugabytedb/ybmaster1

343M 2021/yugabytedb/ybtserver1

349M 2021/yugabytedb/ybtserver2

349M 2021/yugabytedb/ybtserver3

# after changing replication factor to 3, first run:

YugaByteDB InsertOne 45.289805293s

YugaByteDB Count 97.112383ms

YugaByteDB UpdateOne 54.665380464s

YugaByteDB Count 64.206741ms

YugaByteDB SelectOne 3.125693618s
YugaByteDB Total 1m43.290014042s
26M 2021/yugabytedb/ybmaster1/
513M 2021/yugabytedb/ybtserver1/
512M 2021/yugabytedb/ybtserver2/
512M 2021/yugabytedb/ybtserver3/

Conclusion

Here's the recap of 100 records x 1000 goroutine insert/update/select duration, only for single instance:

Only Best of 2 runs	CockroachDB (single)	TiDB (single)	YugaByteDB (single)
InsertOne (s)	7.1	(best) 2.9	(worst) 11.4
UpdateOne (s)	10.2	(best) 3.9	(worst) 19.2
SelectOne (s)	2.1	(best) 1.9	(worst) 2.7
Total (s)	19.6	(best) 9.2	(worst) 33.8
Disk Usage (MB)	(best) 432	(worst) 24062	544

So, at best, it roughly on average take 29 μs to insert, 39 μs to update, 19 μs to select one record.

Comparing only multi (RF=2+):

Only Best of 2 runs	CockroachDB (multi)	TiDB (multi)	YugaByteDB (multi)	YugaByteDB (multi 2)	YugaByteDB (multi 3)
InsertOne (s)	12.5	(best) 3.1	14.3	34.9	(worst) 45.2
UpdateOne (s)	(worst) 60.1	(best) 4.1	19.6	52.7	54.6
SelectOne (s)	(worst) 3.1	(best) 2.1	2.8	(worst) 3.1	2.9
Total (s)	77.2	(best) 9.5	40.8	91.7	(worst) 103.2
Disk Usage (MB)	1015	(worst) 72247	(best) 521	1046	1563

So, at best, it roughly on average take 31 μs to insert, 41 μs to update, 21 μs to select one record.

Comparing only multi with replication factor with true HA:

Only Best of 2 runs	CockroachDB (multi)	TiDB (multi)	YugaByteDB (multi 3)
InsertOne (s)	13.9	(best) 3.1	(worst) 45.2
UpdateOne (s)	(worst) 82.9	(best) 4.1	54.6
SelectOne (s)	2.6	(best) 2.1	2.9
Total (s)	(worst) 99.6	(best) 9.5	(worst) 103.2
Disk Usage (MB)	(best) 1015	(worst) 72247	1563

It seems TiDB has most balanced performance in expense the need to have pre-allocated disk space, while CockroachDB has worst performance on multi-instance update task, and YugabyteDB has worst performance on multi-instance insert task.

What happened if we do the benchmark once more, remove one storage node (docker stop), then redo the benchmark (only for RF=2+)?

Yugabytedb test doesn't even entering the insert stage after 5 minutes '__') may be because of truncate is slow? so I changed the benchmark scenario only for yugabyte to be 1 node be killed after 2 seconds of insertion phase, but still yugabyte giving an error "ERROR: Timed out: Write RPC (request call id 3873) to 172.21.0.5:9100 timed out after 60.000s (SQLSTATE XX000)", it cannot complete. EDIT yugabyte staff on slack suggested that it should be using RF=3 so it would still survive when one node died.

Only 1 run	CockroachDB (multi, kill 1)	TiDB (multi, kill 1 tikv)	TiDB (multi, kill 1 tipd)	TiDB (multi, kill 1 tikv 1 tipd)	YugaByteDB (multi 3, kill 1)
InsertOne (s)	(worst) 35.9	19.2	14.3	(best) 9.2	34.8
UpdateOne (s)	18.6	11.6	16.0	(best) 9.9	(worst) 68.8
SelectOne (s)	4.0	1.9	(best) 1.8	2.3	3.1
Total (s)	58.8	32.9	34.0	(best) 21.5	(worst) 106.9
Disk Usage (MB)	(best) 1076	(worst) 72364	72308	72308	2659

TiDB seems to be the winner also for case when a node died, in expense of the need of 7 initial node (1 tidb [should be at least 2 for HA], 3 tipd, 3 tikv, but probably can be squeezed to be 1 tidb, 1 tipd, 2 tikv, since apparently the default replication factor is 3), where cockroachdb only need 3, and yugabytedb need 4 (1 ybmaster, 3 ybserver). Not sure tho what would happened if 1 tidb/ybmaster instance is died. The recap spreadsheet are here.

Next time we're gonna test how simple is it to add and remove node (and securely, if possible only limited set of servers can join without have to set firewall/DMZ to restrict unprivileged servers) then re-benchmark with more complex common use case (like UPSERT, range queries, WHERE-IN, JOIN, and secondary index). If automatic rebalance not in the requirement, I would still use Tarantool (since 2020.09) and Clickhouse (since 2021.04), but now I found one more new favorite automatic-rebalance database other than

Aerospike (since 2016.11), :3 myahaha! So this is probably the reason lots of companies moving to TiDB.

Btw do not comment on this blog (since it's too much spammy comment and there's no notification whether new comment added), just use github issue or reddit instead.

UPDATE #1: redo the benchmark for all database after updating the limits.conf, TiDB improved by a lot, while CockroachDB remains the same except for update benchmark.

UPDATE #2: TiKV can be set to use only a little of initial disk space by setting "reserve-space" it would only use 4.9GB per TiKV after the test

Subscribe to: Comments ( Atom )