2024-02-20

Dump/export Cassandra/BigQuery tables and import to Clickhouse

Today we're gonna dump Cassandra table and put it to Clickhouse. Cassandra is columnar database but use as OLTP since it have really good distributed capability (customizable replication factor, multi-cluster/region, clustered/partitioned by default -- so good for multitenant applications), but if we need to do analytics queries or some complex query, it became super sucks, even with ScyllaDB's materialized view (which only good for recap/summary). To dump Cassandra database, all you need to do just construct a query and use dsbulk, something like this:

./dsbulk unload -delim '|' -k "KEYSPACE1" \
   -query "SELECT col1,col2,col3 FROM table1" -c csv \
   -u 'USERNAME1' -p 'PASSWORD1' \
   -b secure-bundle.zip | tr '\\' '"' |
    gzip -9 > table1_dump_YYYYMMDD.csv.gz ;


tr command above used to unescape backslash, since dsbulk export csv not in proper format (\" not ""), after than you can just restore it by running something like this:

CREATE TABLE table1 (
    col1 String,
    col2 Int64,
    col3 UUID,
) ENGINE = ReplacingMergeTree()
ORDER BY (col1, col2);

SET format_csv_delimiter = '|';
SET input_format_csv_skip_first_lines = 1;

INSERT INTO table1
FROM INFILE 'table1_dump_YYYYMMDD.csv.gz'
FORMAT CSV;


BigQuery

 Similar to Clickhouse, BigQuery is one of the best analytical engine (because of unlimited compute and massively parallel storage), but it comes with cost, improper partitioning/clustering (even with proper one, because it's limited to only 1 column unlike Clickhouse that can do more) with large table will do a huge scan ($6.25 per TiB) and a lot of compute slot, if combined with materialized view or periodic query on cron, it would definitely kill your wallet. To dump from BigQuery all you need to do just create GCS (Google Cloud Storage) bucket then run some query something like this:

EXPORT DATA
  OPTIONS (
    uri = 'gs://
BUCKET1/table2_dump/1-*.parquet',
    format = 'PARQUET',
    overwrite = true
    --, compression = 'GZIP' -- import failed: ZLIB_INFLATE_FAILED
  )
AS (
  SELECT * FROM `dataset1.table2`
);

-- it's better to create snapshot table
-- if you do WHERE filter on above query, eg.
CREATE TABLE dataset1.table2_filtered_snapshot AS
  SELECT * FROM `
dataset1.table2` WHERE col1 = 'yourFilter';

Not using compression because it's failed to import, not sure why. The parquet files will be shown on your bucket, click on "Remove public access prevention", and allow it to be publicly available with gcloud command:

gcloud storage buckets add-iam-policy-binding gs://BUCKET1 --member=allUsers --role=roles/storage.objectViewer
# remove-iam-policy-binding to undo this

 

 Then just restore it:

CREATE TABLE table2 (
          Col1 String,
          Col2 DateTime,
          Col3 Int32
) ENGINE = ReplacingMergeTree()
ORDER BY (Col1, Col2, Col3);

SET parallel_distributed_insert_select = 1;

INSERT INTO table2
SELECT Col1, Col2, Col3
FROM s3Cluster(
    'default',
    'https://storage.googleapis.com/BUCKET1/table2_dump/1-*.parquet',
    '', -- s3 access id, remove or leave empty if public
    '' -- s3 secret key,
remove or leave empty if public
);