2022-03-26

Move docker Data Directory to Another Partition

My first NVMe (i hate this brand S*****P****), sometimes hangs periodically (marking the filesystem readonly) so 50-100% CPU usage and nothing more can be done. So I have to move some parts of it into another NVMe drive, here's how I moved docker to another partition

sudo systemctl stop docker

then you can edit

sudo vim /etc/docker/daemon.json
# where /media/asd/nvme2 is your mount point to your other partition)
# add something like this
{
  "dns": ["8.8.8.8","1.1.1.1"],
  "data-root": "/media/asd/nvme2/docker"
}

Clone your partition to that new partition (no need to mkdir docker folder):

sudo rsync -aP --progress /var/lib/docker/ /media/asd/nvme2/docker
mv /var/lib/docker /var/lib/docker.backup

Then try to start again the docker service:

sudo systemctl start docker 
sudo systemctl status docker 

If it all works, you can delete the backup of original data directory.

2022-03-20

1 million Go goroutine vs C# task

Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo

Name     UserCPU    SysCPU     AvgRSS       MaxRSS    Wall
c#_1t      38.96    0.68      582,951      636,136    1:00
c#_2t      88.33    0.95      623,956      820,620    1:02
c#_4t     142.86    1.09      687,365      814,028    1:03
c#_8t     235.80    1.71      669,882      820,704    1:05
c#_16t    434.76    4.01      734,545      771,240    1:08
c#_32t    717.39    4.81      720,235      769,888    1:11
go_1t      58.77    0.65    2,635,380    2,635,632    1:04
go_2t      64.48    0.71    2,639,206    2,642,752    1:00
go_4t      72.55    1.42    2,651,086    2,654,972    1:00
go_8t      80.87    2.82    2,641,664    2,643,392    1:00
go_16t     83.18    4.03    2,673,404    2,681,100    1:00
go_32t     86.65    4.30    2,645,494    2,657,580    1:00


Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines  requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:

Name     UserCPU    SysCPU     AvgRSS        MaxRSS    Wall
c#_1t      12.78    1.28    2,459,190     5,051,528    0:13
c#_2t      22.60    1.54    2,692,439     5,934,796    0:18
c#_4t      42.09    1.54    2,370,239     5,538,280    0:21
c#_8t      88.54    2.29    2,522,053     6,334,176    0:29
c#_16t    204.39    3.32    2,395,001     5,803,808    0:34
c#_32t    259.09    3.25    1,842,458     4,710,012    0:28
go_1t      13.97    0.97    4,514,200     6,151,088    0:14
go_2t      12.35    1.51    5,595,418     9,506,076    0:07
go_4t      22.09    2.40    6,394,162    12,517,848    0:07
go_8t      31.00    3.09    7,115,281    13,428,344    0:06
go_16t     40.32    3.52    7,126,851    13,764,940    0:06
go_32t     58.58    3.58    7,104,882    12,145,396    0:06

Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:

Name    UserCPU SysCPU    AvgRSS     MaxRSS    Wall
c#_1t    1.18    0.16     328,134     511,652    0:02
c#_2t    2.18    0.22     294,608     554,488    0:02
c#_4t    3.19    0.20     305,336     554,064    0:02
c#_8t    7.77    0.31     292,281     530,368    0:02
c#_16t  12.33    0.25     304,352     569,460    0:02
c#_32t  37.90    1.25     337,837     684,252    0:03
go_1t    2.72    0.42   1,592,978   2,519,040    0:03
go_2t    3.04    0.47   1,852,084   2,637,532    0:03
go_4t    3.65    0.54   1,936,626   2,637,272    0:03
go_8t    3.27    0.59   1,768,540   2,655,208    0:02
go_16t   4.01    0.71   1,770,673   2,664,504    0:02
go_32t   4.96    0.72   1,770,354   2,669,244    0:02

The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:

Name    UserCPU  SysCPU    AvgRSS     MaxRSS     Wall
c#_1t     0.61    0.08    146,849     284,436    0:05
c#_2t     1.17    0.10    131,778     261,720    0:05
c#_4t     1.53    0.08    133,505     289,584    0:05
c#_8t     4.17    0.15    131,924     284,960    0:05
c#_16t   10.94    0.68    135,446     289,028    0:05
c#_32t   19.86    3.01    130,533     284,924    0:05
go_1t     1.84    0.24    731,872   1,317,796    0:06
go_2t     1.87    0.26    659,382   1,312,220    0:05
go_4t     2.00    0.30    661,296   1,322,152    0:05
go_8t     2.37    0.34    660,641   1,324,684    0:05
go_16t    2.82    0.39    660,225   1,323,932    0:05
go_32t    3.36    0.45    659,176   1,327,264    0:05

And for 5 millions:

Name    UserCPU    SysCPU    AvgRSS       MaxRSS    Wall
c#_1t     3.39    0.24      309,103      573,772    0:11
c#_2t     8.30    0.26      278,683      553,592    0:11
c#_4t    13.65    0.32      274,679      658,104    0:11
c#_8t    23.20    0.46      286,336      641,376    0:12
c#_16t   45.85    1.32      286,311      640,336    0:12
c#_32t   64.83    2.46      264,866      615,552    0:12
go_1t     6.25    0.50    1,397,434    2,629,936    0:13
go_2t     6.20    0.56    1,386,336    2,631,580    0:11
go_4t     7.52    0.65    1,410,523    2,625,308    0:11
go_8t     8.21    0.86    1,441,080    2,779,456    0:11
go_16t   11.17    0.96    1,436,220    2,687,908    0:11
go_32t   12.97    1.06    1,430,573    2,668,816    0:11

And for 25 millions:

Name    UserCPU   SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     15.94    0.69     590,411    1,190,340    0:24
c#_2t     34.88    0.84     699,288    1,615,372    0:32
c#_4t     59.95    0.89     761,308    1,794,116    0:34
c#_8t    100.64    1.36     758,161    1,845,944    0:36
c#_16t   199.56    2.99     765,791    2,014,856    0:38
c#_32t   332.02    4.07     811,809    1,972,400    0:41
go_1t     21.76    0.71   2,846,565    4,413,968    0:29
go_2t     25.77    1.03   2,949,433    5,553,608    0:25
go_4t     28.74    1.24   2,920,447    5,800,088    0:24
go_8t     37.28    1.96   2,869,074    5,502,776    0:23
go_16t    43.46    2.67   2,987,114    5,769,356    0:24
go_32t    43.77    2.92   3,027,179    5,867,084    0:24

How about 25 millions and sleep per 200K?

Name    UserCPU   SysCPU      AvgRSS       MaxRSS    Wall
c#_1t     18.47    0.91      842,492    1,820,788    0:22
c#_2t     40.32    0.93    1,070,555    2,454,324    0:31
c#_4t     62.39    1.16    1,103,741    2,581,476    0:33
c#_8t    100.84    1.34    1,074,820    2,377,580    0:34
c#_16t   218.26    2.91    1,062,642    2,726,700    0:37
c#_32t   339.00    6.51    1,042,254    2,275,644    0:40
go_1t     22.61    0.88    3,474,195    5,071,944    0:27
go_2t     25.83    1.20    3,912,071    6,964,640    0:20
go_4t     37.98    1.68    4,180,188    7,392,800    0:20
go_8t     38.56    2.44    4,189,265    8,481,852    0:18
go_16t    44.49    3.19    4,187,142    8,483,236    0:18
go_32t    48.82    3.44    4,218,591    8,424,200    0:18

And lastly 25 millions and sleep per 400K?

Name    UserCPU    SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     18.66    0.98    1,183,313    2,622,464    0:20
c#_2t     41.27    1.14    1,326,415    3,155,948    0:31
c#_4t     67.21    1.11    1,436,280    3,015,212    0:33
c#_8t    107.14    1.56    1,492,179    3,378,688    0:35
c#_16t   233.50    2.45    1,498,421    3,732,368    0:41
c#_32t   346.87    3.74    1,335,756    2,882,676    0:39
go_1t     24.13    0.82    4,048,937    5,099,220    0:26
go_2t     28.85    1.41    4,936,677    8,023,568    0:18
go_4t     31.51    1.95    5,193,653    9,537,080    0:14
go_8t     45.27    2.65    5,461,107    9,499,308    0:14
go_16t    53.43    3.19    5,183,009    9,476,084    0:14
go_32t    61.98    3.86    5,589,156   10,587,788    0:14

How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:

go version go1.17.6 linux/amd64
dotnet --version
6.0.201

2022-03-17

Getting started with Cassandra or ScyllaDB

Today we're gonna learn Cassandra (it's been 5 years since I last use ScyllaDB: C++ version of Cassandra), to install Cassandra and ScyllaDB, you can use this docker-compose:

version: '3.3'
services:
  testcassandra:
    image: cassandra:3.11 # or latest
    environment:
      - HEAP_NEWSIZE=256M
      - MAX_HEAP_SIZE=1G
      - "JVM_OPTS=-XX:+PrintGCDateStamps"
      - CASSANDRA_BROADCAST_ADDRESS
    ports:
      - "9042:9042"
  testscylla:
    image: scylladb/scylla:4.5.1 # because latest 4.6 is broken
    command: --smp 2 --memory 1G --overprovisioned 1 --api-address 0.0.0.0 --developer-mode 1
    ports:
      - 19042:9042
#      - 9142:9142
#      - 7000:7000
#      - 7001:7001
#      - 7199:7199
#      - 10000:10000
#  scylla-manager:
#    image: scylladb/scylla-manager
#    depends_on:
#      - testscylla
 
Using docker we can create spawn multiple nodes to test NetworkTopologyStrategy, consistency level, and replication factor (or even multiple datacenter):
 
docker run --name NodeX -d scylladb/scylla:4.5.1
docker run --name NodeY -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"
docker run --name NodeZ -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"

docker exec -it NodeZ nodetool status
# wait for UJ (up joining) became UN (up normal)

Since I failed to run latest ScyllaDB (so we use 4.5). To install cqlsh locally, you can use this command:

pip3 install cqlsh
cqlsh 127.0.0.1 9042 # cassandra
cqlsh 127.0.0.1 19042 # scylladb

node=`docker ps | grep /scylla: | head -n 1 | cut -f 1 -d ' '`
docker exec -it $node cqlsh # using cqlsh inside scylladb
# ^ must wait 30s+ before docker ready

docker exec -it $node nodetool status
# ^ show node status

As we already know, Cassandra is columnar database, that we have to make a partition key (where the rows will be located) and clustering key (ordering of that data inside the partition), the SSTable part works similar to Clickhouse merges.

To create a keyspace (much like a "database" or collection of tables but we can set replication region), use this command:

CREATE KEYSPACE my_keyspace WITH replication = {'class':
'SimpleStrategy', 'replication_factor': 1};
-- ^ single node
-- {'class' : 'NetworkTopologyStrategy', 'replication_factor': '3'};
-- ^ multiple node but in a single datacenter and/or rack

-- {'class' : 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'};
-- ^ multiple datacenter

USE my_keyspace;

CONSISTENCY; -- how many read/write ack
-- ANY
-- ONE, TWO, THREE
-- LOCAL_ONE
-- QUORUM = replication_factor / 2 + 1
-- LOCAL_QUORUM
-- EACH_QUORUM -- only for write
-- ALL -- will failed if nodes < replication_factor
CONSISTENCY new_level;

To create a table with same partition key and clustering/ordering key:

CREATE TABLE users ( -- or TYPE for custom type, [keyspace.]
  
fname text, 
  lname text,
  title text,
  PRIMARY KEY (lname, fname)
);
DESCRIBE TABLE users; -- only for 4.0+
CREATE TABLE foo (
  pkey text,
  okey text,
  PRIMARY KEY ((pkey), okey) -- different partition and ordering
  -- add WITH CLUSTERING ORDER BY (okey DESC) for descending
); -- add WITH cdc = { 'enabled' = true, preimage = 'true' }
DESC SCHEMA; -- show all tables and materialized views


To upsert, use insert or update command (last write wins):

INSERT INTO users (fname, lname, title)
VALUES ('A', 'B', 'C');
INSERT INTO users (fname, lname, title)
VALUES ('A', 'B', 'D'); -- add IF NOT EXISTS to prevent replace
SELECT * FROM users; -- USING TIMEOUT XXms

UPDATE users SET title = 'E' WHERE fname = 'A' AND lname = 'C';
SELECT * FROM users; -- add IF EXISTS to prevent insert

# INSERT INTO users( ... ) VALUES ( ... ) USING TTL 600
# UPDATE users USING TTL 600 SET ...
# SELECT TTL(fname) FROM users WHERE ...
-- set TTL to 0 to remove TTL
-- column will be NULL if TTL became 0
-- whole row will be deleted if all non-PK column TTL is zero
# ALTER TABLE users WITH default_time_to_live = 3600;

# SELECT * FROM users LIMIT 3
# SELECT * FROM users PER PARTITION LIMIT 2
# SELECT * FROM users PER PARTITION LIMIT 1 LIMIT 3

CREATE TABLE stats(city text PRIMARY KEY,total COUNTER);
UPDATE stats SET total =
total + 6 WHERE city = 'Kuta';
SELECT * FROM stats;

To change the schema, use usual alter table command:

ALTER TABLE users ADD mname text;
-- tinyint, smallint, int, bigint (= long)
-- variant (= the real bigint)
-- float, double
-- decimal
-- text/varchar, ascii
-- timestamp
-- date, time
-- uuid
-- timeuuid (with mac address, conflict free, set now())
-- boolean
-- inet
-- counter
-- set<type> (set {val,val}, +{val}, -{val})
-- list<type> (set [idx]=, [val,val], +[], []+, -[], DELETE [idx])
-- map<type,type> (set {key: val}, [key]=, DELETE [key] FROM)
-- tuple<type,...> (set (val,...))>

SELECT * FROM users;

UPDATE users SET mname = 'F' WHERE fname = 'A' AND lname = 'D';
-- add IF col=val to prevent update (aka lightweight transaction)
-- IF NOT EXISTS
--
SELECT * FROM users;
 
Complex nested type example from this page:

CREATE TYPE phone (
    country_code int,
    number text,
);
CREATE TYPE address (
  street text,
  city text,
  zip text,
  phones map<text, frozen<phone>> -- must be frozen, cannot be updated
);
CREATE TABLE pets_v4 (
  name text PRIMARY KEY,
  addresses map<text, frozen<address>>
);
INSERT INTO pets_v4 (name, addresses)
  VALUES ('Rocky', {
    'home' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: {
        'cell' : { country_code: 1, number: '202 456-1111' },
        'landline' : { country_code: 1, number: '202 456-1234' }
      }
    },
    'work' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: { 'fax' : { country_code: 1, number: '202 5444' } }
    }
  });

 
To create index (since Cassandra only allows retrieve by partition and cluster key or full scan):

CREATE INDEX ON users(title); -- global index (2 hops per query)
SELECT * FROM users WHERE title = 'E';
DROP INDEX users_title_idx;
SELECT * FROM users WHERE title = 'E' ALLOW FILTERING; -- full scan
CREATE INDEX ON users((lname),title); -- local index (1 hop per query)

To create a materialized view (that works similar to Clickhouse's materialized view):

CREATE MATERIALIZED VIEW users_by_title AS 
SELECT * -- ALTER TABLE will automatically add this VIEW too
FROM users
WHERE title IS NOT NULL
  AND fname IS NOT NULL

  AND lname IS NOT NULL

PRIMARY KEY ((title),lname,fname);
SELECT * FROM users_by_title;
INSERT INTO users(lname,fname,title) VALUES('A','A','A');
SELECT * FROM users_by_title WHERE title = 'A';
DROP MATERIALIZED VIEW users_by_title;
-- docker exec -it NodeZ nodetool viewbuildstatus

To create "transaction" use BATCH statement:

BEGIN BATCH;
INSERT INTO ...
UPDATE ...
DELETE ...
APPLY BATCH; 

To import from file, use COPY command:

COPY users FROM 'users.csv' WITH HEADER=true;

Tips for performance optimization:
1. for multi-DC use LocalQuorum on read, and TokenAware+DCAwareRoundRobin to prevent reading from nodes on different DC
2. ALLOW FILTERING if small number of records low cardinality (eg. values are true vs false only) -- 0 hop
3. global INDEX when primary key no need to be included, and latency doesn't matter (2 hops)
4. local INDEX for when primary key can be included (1 hops)
5. MATERIALIZED VIEW when want to use different partition for the same data, and storage doesn't matter
6. always use prepared statement

2022-02-22

C# vs Go in Simple Benchmark

Today we're gonna retry two of my few favorite language in associative array and comb sort benchmark (compile and run, not just runtime performance, because development waiting for compilation time also important) like in the past benchmark. For installing DotNet:

wget https://packages.microsoft.com/config/ubuntu/21.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt install apt-transport-https
sudo apt-get update
sudo apt-get install -y dotnet-sdk-6.0 aspnetcore-runtime-6.0

For installing Golang:

sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update
sudo apt install -y golang-1.17

Result (best of 3 runs)

cd assoc; time dotnet run
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.16s     Real: 14.41s    RAM: 1945904KB

cd assoc; time go run map.go
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.80s     Real: 12.01s    RAM: 2305384KB

This is a bit weird, usually I see that Go use less memory but slower, but in this benchmark C# that are using less memory but a bit slower (14.41s vs 12.01s), possibly because the compilation speed also included.

cd num-assoc; time dotnet run
CPU: 2.21s      Real: 2.19s     RAM: 169208KB

cd num-assoc; time go run comb.go
CPU: 0.46s      Real: 0.44s     RAM: 83100KB

What if we increase the N from 1 million to 10 million?

cd num-assoc; time dotnet run
CPU: 19.25s     Real: 19.16s    RAM: 802296KB

cd num-assoc; time go run comb.go
CPU: 4.60s      Real: 4.67s     RAM: 808940KB

If you want to contribute (if I make mistake when coding the C# or Go version of the algorithm, or if there's more efficient data structure, just fork and create a pull request, and I will redo the benchmark).