Last time we're dumping from Cassandra and restore it to Clickhouse for proper analytics. To do that we need cqlsh and cassandradump
pip install -U cqlsh
pip install cassandra-driver
# list all tables
tables=`cqlsh -k keyspace1 -e 'describe keyspace' | grep TABLE | cut -d '.' -f 2 | cut -d ' ' -f 1`
# run copy to file
mkdir backup
echo $tables | while read x; do time cqlsh -k keyspace1 -e "COPY $x TO STDOUT WITH DELIMITER='|' AND HEADER=TRUE AND NULL=NULL" | grep -v 'WARNING: cqlsh' | sed 's|\\\\|\\|g' > backup/$x.csv ; done
# run copy from file
echo $tables | while read x; do time cat backup/$x.csv | sed 's|\\\\|\\|g' | cqlsh -k keyspace1 -e "COPY $x FROM stdin WITH DELIMITER='|' AND HEADER=TRUE AND NULL=NULL" ; done
That's it, you can update the rows before restoring for your integration test. There's a weird on Cassandra's CSV when backslash being escaped with backslash that cannot be removed even with tr command (probably because tr cant be more than 1 characters), it restored as more backslash and NULL imported as zero.
programming: the action or process of writing computer programs. | rants: speak or shout at length in a wild, [im]passioned way.
2024-06-13
Backup and Restore Cassandra Database for Development
2024-02-20
Dump/export Cassandra/BigQuery tables and import to Clickhouse
Today we're gonna dump Cassandra table and put it to Clickhouse. Cassandra is columnar database but use as OLTP since it have really good distributed capability (customizable replication factor, multi-cluster/region, clustered/partitioned by default -- so good for multitenant applications), but if we need to do analytics queries or some complex query, it became super sucks, even with ScyllaDB's materialized view (which only good for recap/summary). To dump Cassandra database, all you need to do just construct a query and use dsbulk, something like this:
./dsbulk unload -delim '|' -k "KEYSPACE1" \
-query "SELECT col1,col2,col3 FROM table1" -c csv \
-u 'USERNAME1' -p 'PASSWORD1' \
-b secure-bundle.zip | tr '\\' '"' |
gzip -9 > table1_dump_YYYYMMDD.csv.gz ;
tr command above used to unescape backslash, since dsbulk export csv not in proper format (\" not ""), after than you can just restore it by running something like this:
CREATE TABLE table1 (
col1 String,
col2 Int64,
col3 UUID,
) ENGINE = ReplacingMergeTree()
ORDER BY (col1, col2);
SET format_csv_delimiter = '|';
SET input_format_csv_skip_first_lines = 1;
INSERT INTO table1
FROM INFILE 'table1_dump_YYYYMMDD.csv.gz'
FORMAT CSV;
BigQuery
Similar to Clickhouse, BigQuery is one of the best analytical engine (because of unlimited compute and massively parallel storage), but it comes with cost, improper partitioning/clustering (even with proper one, because it's limited to only 1 column unlike Clickhouse that can do more) with large table will do a huge scan ($6.25 per TiB) and a lot of compute slot, if combined with materialized view or periodic query on cron, it would definitely kill your wallet. To dump from BigQuery all you need to do just create GCS (Google Cloud Storage) bucket then run some query something like this:
EXPORT DATA
OPTIONS (
uri = 'gs://BUCKET1/table2_dump/1-*.parquet',
format = 'PARQUET',
overwrite = true
--, compression = 'GZIP' -- import failed: ZLIB_INFLATE_FAILED
)
AS (
SELECT * FROM `dataset1.table2`
);
-- it's better to create snapshot table
-- if you do WHERE filter on above query, eg.
CREATE TABLE dataset1.table2_filtered_snapshot AS
SELECT * FROM `dataset1.table2` WHERE col1 = 'yourFilter';
Not using compression because it's failed to import, not sure why. The parquet files will be shown on your bucket, click on "Remove public access prevention", and allow it to be publicly available with gcloud command:
gcloud storage buckets add-iam-policy-binding gs://BUCKET1 --member=allUsers --role=roles/storage.objectViewer
# remove-iam-policy-binding to undo this
Then just restore it:
CREATE TABLE table2 (
Col1 String,
Col2 DateTime,
Col3 Int32
) ENGINE = ReplacingMergeTree()
ORDER BY (Col1, Col2, Col3);
SET parallel_distributed_insert_select = 1;
INSERT INTO table2
SELECT Col1, Col2, Col3
FROM s3Cluster(
'default',
'https://storage.googleapis.com/BUCKET1/table2_dump/1-*.parquet',
'', -- s3 access id, remove or leave empty if public
'' -- s3 secret key, remove or leave empty if public
);
2022-05-07
Getting started with Trino
# Docker
docker run -d -p 8080:8080 --name trino1 trinodb/trino
# web UI only for monitoring, use random username
docker exec -it trino1 trino
# Ubuntu 22.04
java --version
python3 --version
# download and extract from https://trino.io/download.html
mkdir ./trino-server-379/etc
SRCURL=https://raw.githubusercontent.com/trinodb/trino-the-definitive-guide/master/single-installation/etc
wget -c $SRCURL/jvm.config
wget -c $SRCURL/log.properties
wget -c $SRCURL/node.properties
echo '
' > config.properties
node.data-dir=/tmp/
' >> node.properties
mkdir catalog
echo '
# more here https://trino.io/docs/current/connector/cassandra.html
python3 ./bin/launcher.py run # to run in background: start
# CLI/Client
EXELOC=/usr/bin/trino
curl -O $EXELOC https://repo1.maven.org/maven2/io/trino/trino-cli/379/trino-cli-379-executable.jar
chmod a+x $EXELOC
trino --server http://localhost:8081
These are the list of commands can be used in trino (other than standard SQL):
That's it, you can add more databases connection by creating more etc/catalog/*.properties file with proper configuration (username, password, port, etc).
2022-03-17
Getting started with Cassandra or ScyllaDB
services:
docker run --name NodeY -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"
docker run --name NodeZ -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"
docker exec -it NodeZ nodetool status
# wait for UJ (up joining) became UN (up normal)
Since I failed to run latest ScyllaDB (so we use 4.5). To install cqlsh locally, you can use this command:
pip3 install cqlsh
cqlsh 127.0.0.1 19042 # scylladb
# ^ must wait 30s+ before docker ready
docker exec -it $node nodetool status
# ^ show node status
As we already know, Cassandra is columnar database, that we have to make a partition key (where the rows will be located) and clustering key (ordering of that data inside the partition), the SSTable part works similar to Clickhouse merges.
-- ^ single node
-- {'class' : 'NetworkTopologyStrategy', 'replication_factor': '3'};
-- ^ multiple node but in a single datacenter and/or rack
-- {'class' : 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'};
-- ^ multiple datacenter
USE my_keyspace;
CONSISTENCY; -- how many read/write ack
-- ANY
-- ONE, TWO, THREE
-- LOCAL_ONE
-- QUORUM = replication_factor / 2 + 1
-- LOCAL_QUORUM
-- EACH_QUORUM -- only for write
-- ALL -- will failed if nodes < replication_factor
CONSISTENCY new_level;
fname text,
title text,
PRIMARY KEY (lname, fname)
);
DESCRIBE TABLE users; -- only for 4.0+
CREATE TABLE foo (
pkey text,
okey text,
PRIMARY KEY ((pkey), okey) -- different partition and ordering
-- add WITH CLUSTERING ORDER BY (okey DESC) for descending
); -- add WITH cdc = { 'enabled' = true, preimage = 'true' }
DESC SCHEMA; -- show all tables and materialized views
SELECT * FROM users; -- USING TIMEOUT XXms
SELECT * FROM users; -- add IF EXISTS to prevent insert
# INSERT INTO users( ... ) VALUES ( ... ) USING TTL 600
# UPDATE users USING TTL 600 SET ...
# SELECT TTL(fname) FROM users WHERE ...
-- set TTL to 0 to remove TTL
# ALTER TABLE users WITH default_time_to_live = 3600;
# SELECT * FROM users LIMIT 3
# SELECT * FROM users PER PARTITION LIMIT 2
# SELECT * FROM users PER PARTITION LIMIT 1 LIMIT 3
CREATE TABLE stats(city text PRIMARY KEY,total COUNTER);
UPDATE stats SET total = total + 6 WHERE city = 'Kuta';
SELECT * FROM stats;
-- tinyint, smallint, int, bigint (= long)
-- variant (= the real bigint)
-- float, double
-- decimal
-- text/varchar, ascii
-- timestamp
-- date, time
-- uuid
-- timeuuid (with mac address, conflict free, set now())
-- boolean
-- inet
-- counter
-- set<type> (set {val,val}, +{val}, -{val})
-- list<type> (set [idx]=, [val,val], +[], []+, -[], DELETE [idx])
-- map<type,type> (set {key: val}, [key]=, DELETE [key] FROM)
-- tuple<type,...> (set (val,...))>
SELECT * FROM users;
-- add IF col=val to prevent update (aka lightweight transaction)
-- IF NOT EXISTS
--
CREATE TYPE phone (
country_code int,
number text,
);
CREATE TYPE address (
street text,
city text,
zip text,
phones map<text, frozen<phone>> -- must be frozen, cannot be updated
);
CREATE TABLE pets_v4 (
name text PRIMARY KEY,
addresses map<text, frozen<address>>
);
INSERT INTO pets_v4 (name, addresses)
VALUES ('Rocky', {
'home' : {
street: '1600 Pennsylvania Ave NW',
city: 'Washington',
zip: '20500',
phones: {
'cell' : { country_code: 1, number: '202 456-1111' },
'landline' : { country_code: 1, number: '202 456-1234' }
}
},
'work' : {
street: '1600 Pennsylvania Ave NW',
city: 'Washington',
zip: '20500',
phones: { 'fax' : { country_code: 1, number: '202 5444' } }
}
});
SELECT * FROM users WHERE title = 'E';
DROP INDEX users_title_idx;
SELECT * FROM users WHERE title = 'E' ALLOW FILTERING; -- full scan
CREATE INDEX ON users((lname),title); -- local index (1 hop per query)
SELECT * -- ALTER TABLE will automatically add this VIEW too
FROM users
WHERE title IS NOT NULL
AND fname IS NOT NULL
AND lname IS NOT NULL
PRIMARY KEY ((title),lname,fname);
SELECT * FROM users_by_title;
INSERT INTO users(lname,fname,title) VALUES('A','A','A');
SELECT * FROM users_by_title WHERE title = 'A';
DROP MATERIALIZED VIEW users_by_title;
-- docker exec -it NodeZ nodetool viewbuildstatus
INSERT INTO ...
UPDATE ...
DELETE ...
APPLY BATCH;
1. for multi-DC use LocalQuorum on read, and TokenAware+DCAwareRoundRobin to prevent reading from nodes on different DC
2021-11-14
Databases with Automatic Rebalance Benchmark (TIDB vs YugabyteDB vs CockroachDB)
- Aerospike (in-mem kv database, community edition max 4 billion object) https://aerospike.com/products/product-matrix/
- Couchbase (document database, there's community edition) https://www.couchbase.com/products/editions
- MongoDB (document database, without SQL syntax)
- Cassandra (columnar database, CQL a subset of SQL)
- ScyllaDB (columnar database, cassandra-compatible, oss version max 5 node) https://www.scylladb.com/pricing/
- CockroachDB (postgresql compatible, there's core edition) https://www.cockroachlabs.com/compare/
- TiDB (mysql compatible)
- YugaByteDB (postgresql/cassandra/redis compatible)
- RavenDB (community edition max ram 6GB) https://ravendb.net/buy
- SingleStore/MemSQL (mysql compatible, free edition max 4 node) https://www.singlestore.com/free-software/
# multiple cockroach docker but connect only into one
YugaByteDB Total 1m43.290014042s
Conclusion
Only Best of 2 runs | CockroachDB (single) | TiDB (single) | YugaByteDB (single) |
InsertOne (s) | 7.1 | (best) 2.9 | (worst) 11.4 |
UpdateOne (s) | 10.2 | (best) 3.9 | (worst) 19.2 |
SelectOne (s) | 2.1 | (best) 1.9 | (worst) 2.7 |
Total (s) | 19.6 | (best) 9.2 | (worst) 33.8 |
Disk Usage (MB) | (best) 432 | (worst) 24062 | 544 |
Only Best of 2 runs | CockroachDB (multi) | TiDB (multi) | YugaByteDB (multi) | YugaByteDB (multi 2) | YugaByteDB (multi 3) |
InsertOne (s) | 12.5 | (best) 3.1 | 14.3 | 34.9 | (worst) 45.2 |
UpdateOne (s) | (worst) 60.1 | (best) 4.1 | 19.6 | 52.7 | 54.6 |
SelectOne (s) | (worst) 3.1 | (best) 2.1 | 2.8 | (worst) 3.1 | 2.9 |
Total (s) | 77.2 | (best) 9.5 | 40.8 | 91.7 | (worst) 103.2 |
Disk Usage (MB) | 1015 | (worst) 72247 | (best) 521 | 1046 | 1563 |
Only Best of 2 runs | CockroachDB (multi) | TiDB (multi) | YugaByteDB (multi 3) |
InsertOne (s) | 13.9 | (best) 3.1 | (worst) 45.2 |
UpdateOne (s) | (worst) 82.9 | (best) 4.1 | 54.6 |
SelectOne (s) | 2.6 | (best) 2.1 | 2.9 |
Total (s) | (worst) 99.6 | (best) 9.5 | (worst) 103.2 |
Disk Usage (MB) | (best) 1015 | (worst) 72247 | 1563 |
Only 1 run | CockroachDB (multi, kill 1) | TiDB (multi, kill 1 tikv) | TiDB (multi, kill 1 tipd) | TiDB (multi, kill 1 tikv 1 tipd) | YugaByteDB (multi 3, kill 1) |
InsertOne (s) | (worst) 35.9 | 19.2 | 14.3 | (best) 9.2 | 34.8 |
UpdateOne (s) | 18.6 | 11.6 | 16.0 | (best) 9.9 | (worst) 68.8 |
SelectOne (s) | 4.0 | 1.9 | (best) 1.8 | 2.3 | 3.1 |
Total (s) | 58.8 | 32.9 | 34.0 | (best) 21.5 | (worst) 106.9 |
Disk Usage (MB) | (best) 1076 | (worst) 72364 | 72308 | 72308 | 2659 |
UPDATE #1: redo the benchmark for all database after updating the limits.conf, TiDB improved by a lot, while CockroachDB remains the same except for update benchmark.