Programming Rants

2022-04-05

Start/restart Golang or any other binary program automatically on boot/crash

There are some alternative to make program start on boot on Linux, the usual way is using:

1. SystemD, it could ensure that dependency started before your service, also could limit your CPU/RAM usage. Generate a template using this website or use kardianos/service

2. PM2 (requires NodeJS), or PMG

3. docker-compose (requires docker, but you can skip the build part, just copy the binary directly on Dockerfile command (that can be deployed using rsync), just set restart property on docker-compose and it would restart when computer boot), -- bad part, you cannot limit cpu/ram unless using docker swarm. But you can use docker directly to limit and use --restart flag.

3. lxc/lxd or multipass or other vm/lightweight vm (but still need systemd inside it XD at least it won't ruin your host), you can rsync directly to the container to redeploy, for example using overseer or tableflip, you must add reverse proxy or nat or proper routing/ip forwarding tho if you want it to be accessed from outside

4. supervisord (python) or ochinchina/supervisord (golang) tutorial here

5. create one daemon manager with systemd/docker-compose, then spawn the other services using goproc or pioz/god

6. monit it can monitor and ensure a program started/not dead

7. nomad (actually this one is deployment tool), but i can also manage workload

8. kubernetes XD overkill

9. immortal.run a supervisor, this one actually using systemd

10. other containerization/VM workload orchestrator/manager that usually already provided by the hoster/PaaS provider (Amazon ECS/Beanstalk/Fargate, Google AppEngine, Heroku, Jelastic, etc)

This is the systemd script that I usually use (you need to create user named "web" and install "unbuffer"):

$ cat /usr/lib/systemd/system/xxx.service
[Unit]
Description=xxx
After=network-online.target postgresql.service
Wants=network-online.target

[Service]
Type=simple
Restart=on-failure
User=web
Group=users
WorkingDirectory=/home/web/xxx
ExecStart=/home/web/xxx/run_production.sh
ExecStop=/usr/bin/killall xxx
LimitNOFILE=2097152
LimitNPROC=65536
ProtectSystem=full
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

$ cat /home/web/xxx/run_production.sh
#!/usr/bin/env bash

mkdir -p `pwd`/logs
ofile=`pwd`/logs/access_`date +%F_%H%M%S`.log
echo Logging into: $ofile
unbuffer time ./xxx | tee $ofile

2022-04-04

Automatic Load Balancer Registration/Deregistration with NATS or FabioLB

Today we're gonna test 2 alternative for automatic load balancing (previously I always use Caddy or NginX (because most of my projects is single server -- the bottleneck is always the database not the backend/compute part), with manual reverse proxy configuration, but today we're gonna test 2 possible way to high-availability load balance strategy (without kubernetes of course), first is using NATS, second one is using standard load balancer, in this case FabioLB.

To use NATS, we're gonna use this strategy:
first one we deploy is the our custom reverse proxy, that should able to convert any query string, form body with any kind of content-type, and any header if needed, we can use any serialization format (json, msgpack, protobuf, etc), but in this case we're just gonna use normal string, we call this service "apiproxy". The apiproxy will send the serialized payload (from map/object) into NATS using request-reply mechanism. Another service is our backend "worker"/handler, that could be anything, but in this case is our real handler that would contain our business logic, so we need to subscribe and return a reply to the apiproxy and it would be deserialized back to the client with any serizaliation format and protocol (gRPC/Websocket/HTTP-REST/JSONP/etc). Here's the benchmark result of normal Fiber without any proxy, apiproxy-nats-worker with single nats vs multi nats instance

# no proxy
go run main.go apiserver
hey -n 1000000 -c 255 http://127.0.0.1:3000
Average:      0.0011 secs
Requests/sec: 232449.1716

# single nats
go run main.go apiproxy
go run main.go # worker
hey -n 1000000 -c 255 http://127.0.0.1:3000
Average:      0.0025 secs
Requests/sec: 100461.5866

# 2 worker
Average:      0.0033 secs
Requests/sec: 76130.4079

# 4 worker
Average:      0.0051 secs
Requests/sec: 50140.6288

# limit the apiserver CPU
GOMAXPROCS=2 go run main.go apiserver
Average:      0.0014 secs
Requests/sec: 184234.0106

# apiproxy 2 core
# 1 worker 2 core each
Average:      0.0025 secs
Requests/sec: 103007.4516

# 2 worker 2 core each
Average:      0.0029 secs
Requests/sec: 87522.6801

# 4 worker 2 core each
Average:      0.0037 secs
Requests/sec: 67714.5851

# seems that the bottleneck is spawning the producer's NATS
# spawning 8 connections using round-robin

# 1 worker 2 core each
Average:      0.0021 secs
Requests/sec: 121883.4324

# 4 worker 2 core each
Average:      0.0030 secs
Requests/sec: 84289.4330

# seems also the apiproxy is hogging all the CPU cores
# limiting to 8 core for apiproxy
# now synchronous handler changed into async/callback version
GOMAXPROCS=8 go run main.go apiserver

# 1 worker 2 core each
Average:      0.0017 secs
Requests/sec: 148298.8623

# 2 worker 2 core each
Average:      0.0017 secs
Requests/sec: 143958.4056

# 4 worker 2 core each
Average:      0.0029 secs
Requests/sec: 88447.5352

# limiting the NATS to 4 core using go run on the source
# 1 worker 2 core each
Average:      0.0013 secs
Requests/sec: 194787.6327

# 2 worker 2 core each
Average:      0.0014 secs
Requests/sec: 176702.0119

# 4 worker 2 core each
Average:      0.0022 secs
Requests/sec: 116926.5218

# same nats core count, increase worker core count
# 1 worker 4 core each
Average:      0.0013 secs
Requests/sec: 196075.4366

# 2 worker 4 core each
Average:      0.0014 secs
Requests/sec: 174912.7629

# 4 worker 4 core each
Average:      0.0021 secs
Requests/sec: 121911.4473 --> see update below

Could be better if it was tested in multiple server, but it seems the bottleneck is on NATS connection when have many subscriber, they could not scale linearly (16-66% overhead for a single API proxy) IT's A BUG ON MY SIDE, SEE UPDATE BELOW. Next we're gonna try FabioLB with Consul, Consul used for service registry (it's a synchronous-consistent "database" like Zookeeper or Etcd). To install all of it use this commands:

# setup:
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt install consul
go install github.com/fabiolb/fabio@latest

# start:
sudo consul agent -dev --data-dir=/tmp/consul
fabio
go run main.go -addr 172.17.0.1:5000 -name svc-a -prefix /foo -consul 127.0.0.1:8500

# benchmark:
# without fabio
Average:      0.0013 secs
Requests/sec: 197047.9124

# with fabio 1 backend
Average:      0.0038 secs
Requests/sec: 65764.9021

# with fabio 2 backend
go run main.go -addr 172.17.0.1:5001 -name svc-a -prefix /foo -consul 127.0.0.1:8500

# the bottleneck might be the cores, so we limit the cores to 2 for each worker
# with fabio 1 backend 2 core each
Average:      0.0045 secs
Requests/sec: 56339.5518

# with fabio 2 backend 2 core each
Average:      0.0042 secs
Requests/sec: 60296.9714

# what if we limit also the fabio
GOMAXPROCS=8 fabio

# with fabio 8 core, 1 backend 2 core each
Average:      0.0042 secs
Requests/sec: 59969.5206

# with fabio 8 core, 2 backend 2 core each
Average:      0.0041 secs
Requests/sec: 62169.2256

# with fabio 8 core, 4 backend 2 core each
Average:      0.0039 secs
Requests/sec: 64703.8253

All CPU cores utilized around 50% of 32-core server 128GB RAM, can't find which part the bottleneck for now, but for sure both strategy have around 16% vs 67% overhead compared for non proxies (which is make sense because adding more layer will add more transport and more things to copy/transfer and transform/serialize-deserialize). The code used in this benchmark is here, on 2022mid directory, and the code for fabio-consul registration copied from ebay's github repository.

Why even we need to do this? If we're using api gateway pattern (one of the pattern that being used in my past company, but with Kubernetes on worker part), we could deploy independently and communicate between service using the gateway (proxy) without knowing the IP address or domain name of the service itself, as long as it have proper route and payload it can be handled wherever the service being deployed. What if you want to do canary or blue green deployment? you can just register a handler in nats or consul with different route name (especially for communication between services, not public to service), and wait for all traffic to be moved there before killing previous deployment.

So what should you choose? both strategy requires 3 moving part (apiproxy-nats-worker, fabio-consul-worker) but NATS strategy simpler in the development and can give better performance (especially if you make the apiproxy to be as flexible as possible), but it needs to have better serialization, since in this benchmark the serialization not measured, if you need better performance on serialization you must use codegen, which may require you to deploy 2 times (one for apiproxy, one for worker, unless you split the raw response meta with jsonparser or use map only for apiproxy). FabioLB strategy have more features, also you can use consul for service discovery (contacting other services directly by name without have to go thru FabioLB). NATS strategy have some benefit in terms of security, which is the NATS cluster can be inside DMZ, and worker can be on the different subnet without ability to connect each other and it would still works, where if you use consul to connect directly to another service, they should have route or connection to access each other. The bad part about NATS is that you should not use it for file upload, or it would hogging a lot of resource, so it should handled by apiproxy directly, then the reference of the uploaded file should be forwarded as payload to NATS. You can check NATS traffic statistics using nats-top.

What's next? Maybe we can try traefik, which is a service registry combined with load balancer in one binary, it can also use consul.

UPDATE: by changing the code from Subscribe (broadcast/fan-out) to QueueSubscribe (load balance), it have similar performance on 1/2/4 subscribers, so we can NATS for high availability/fault tolerance in api gateway pattern with cost of 16% overhead.

TL:DR

no LB: 232K rps
-> LB with NATS request-reply: 196K rps (16% overhead)
no LB: 197K rps
-> LB with Fabio+Consul: 65K rps (67% overhead)

2022-04-01

Georeplicable Architecture Tech Stacks

Since I got a use case for geo-replicable (multiple datacenters around the world), I need to create metrics and billing and log aggregator that survives even when multiple datacenter is down. So we need to create an architecture that the each cluster of databases in datacenter can give correct answer on their own, the layers are:

1. Cache Layer

2. Source of Immediate Truth (OLTP) Layer

3. Source of Event Truth Layer (either CDC or subscribable events or both)

4. Metrics and Analytics (OLAP) Layer

For single datacenter I might choose Tarantool or Aerospike (too bad if you want to store more than your RAM you must use paid version) or Redis (master-slave) for cache layer, especially if the DNS is smart enough to route the client's request to nearest datacenter.

Tarantool can also be source of immediate truth, but the problem is it should be master-slave replication so it could be fault tolerant, but it seems there might have to do manual intervention to fallback to promote slave node as master and reroute other slave to recognize new master?

So the other choice is either TiDB (for consistent use case, lower write rate, more complex query use case) or ScyllaDB (for partition tolerant use case, higher write rate). Both are good in terms of availability. TiDB's TiFlash also good for analytics use case.

For source of event truth, we can use RedPanda for pubsub/MQ use cases, and Debezium (that requires Kafka/RedPanda) for change data capture from ScyllaDB, or TiCDC for TiDB to stream to RedPanda.

Lastly for analytics, we can use Clickhouse, also to store all structured logs that can be easily queried, or can also be Loki. For metrics might aggregate from RedPanda using MaterializeIO (too bad that cluster is paid).

So.. what are the combinations possible?

1. Tarantool (manual counter for metrics) + Clickhouse (manual publish logs and analytics), this one good only for single location/datacenter, unless the clients are all can hit proper server location (like game servers, or smart CDNs)

2. same as #1 but with RedPanda if have multiple service, all logs and events published thru RedPanda

3. Aerospike/Tarantool/Redis (manual counter for metrics) + TiDB + TiFlash (analytics) + Loki for logs

4. Aerospike/Tarantool/Redis (manual counter for metrics) + TIDB + TiCDC + ClickHouse (for logs and analytics)

5. Aerospike/Tarantool/Redis (cache only) + TiDB + TiFlash (analytics) + TiCDC + MaterializeIO (for metrics) + Loki (logs)

6. Aerospike/Tarantool/Redis (cache only) + TiDB + TiCDC + Clickhouse (analytics and logs) + MaterializeIO (for metrics)

7. Aerospike/Tarantool/Redis (cache only) + ScyllaDB + Debezium + RedPanda + Clickhouse (analytics and logs) + MaterializeIO (for metrics)

for number #5-#7 you can remove the Aerospike/Tarantool part if no need to cache (consistency matters, and you have a very large cluster that can handle peaks).

wait, why don't you include full-text search use case? '__') ok, we can use Debezium that publish to ElasticSearch (might be overkill), or manual publish to TypeSense or MeiliSearch.

That's only for data part of the cluster, what about computation and presentation layer (backend/frontend)?

Backend of course I will use Golang or C#, frontend either Svelte (web only) or Flutter (mobile and web), Unity3D (game).

For object storage (if locally I would use MinIO) and CDN (CloudFlare for example), you can see my previous post.

So why you choose those?

1. Tarantool, one of the fastest in-mem, like 200K wps/qps but the cons I already mentioned above, the performance similar to Redis, but this one supports SQL.

2. Aerospike, also one of the fastest in-mem, like also 200K wps/qps last time I checked, also can do master-master replication, if I'm not mistaken, limited to 4 nodes for free verrsion, can set the replication factor, but see other cons I mentioned above.

3. TiDB, one newsql with automatic rebalance (one node died, and it would still works fast), can do around 30-50K single node last time I benchmarked, their benchmark mostly shows 30K-40K rps for writes, 120K rps for mixed read-write, 330K rps for read-only multi nodes benchmark, need more space? add more TiKV instance, need more compute? add more TiDB instance, need to do faster analytics queries? add TiFlash instance. But.. the bad part is, so many moving parts, you have to deploy TiDB (query engine), TiKV (storage), and PD (placement driver), also TiCDC and TiFlash if you need those too, also I'm not sure how it would perform at multi-DC use-cases. What's the managed alternative? AlloyDB-PostgreSQL (GCP) or TiDB-Cloud (AWS)

4. ScyllaDB, is faster version of Cassandra, most of benchmark can do 120K-400K insert and queries per second, one of the best database for mutli-DC, each keyspace can be set whether should replicate in which datacenter, how many the replication factor, consistency controlled in client-side, also can manage multiple view since 2 years ago (materialized view), so we don't have to create and maintain multiple table manually for each query pattern. What's the managed alternative? ScyllaDB-Cloud

5. RedPanda, is faster version of Kafka, last time I checked, one instance can receive 800K events per seconds, and publishes 4 millions events per second.

6. Clickhouse, is one of the best analytics database, can do a 1M batched ingestion per second in single node, can do complex queries really fast under a second (but depends on how big your data, but kind of query that if you did it in normal RDBMS would took minutes), the cons is one node can only handle 100 queries concurrently.

7. MaterializeIO, is like ksqldb but written in Rust, haven't checked the performance but they claimed they perform as fast as Redis.

What's other alternative? Yugabyte looks ok, especially the YCQL part that works like Cassandra/ScyllaDB. Especially yugabyte seems to be combining Redis, Postgres, and Cassandra in one deployment, but I like TiDB more because last time I checked, I need to config something to make it writable when one node died.

Proxy and load balancer? Caddy, Traefik, FabioLB (with Consul), or NATS (eg. 3 load balancer/custom api gateway that deployed in front, then it would serialize request to NATS inside DMZ, the worker/handler will receive that and return response that deserialized back to the loadbalancer/api gateway, that way load balancer doesn't need to know how exactly how many worker/handler, services can also communicate synchronously thru NATS without knowing each other service's IP address, and the worker/handler part can be scaled independently)

Deployment? Nomad, just rsync it to Jelastic, DockerCompose for dependencies (don't forget to bind the volume or your data will gone), Kubernetes

Tracker? maybe Pirsch

Why there's no __INSERT_DB_TYPE_HERE__? Graph database? because I rarely do recursive/graph queries, and especially I don't know which one that are best? Neo4J? DGraph? NebulaGraph? TigerGraph? AnzoGraph? TerminusDB? Age? JanusGraph? HugeGraph?

Have some other cool/high-performance tech stack suggestion? Chat with me at http://t.me/kokizzu (give proper intro tho, or I would think you are a crypto scam spammer XD)

2022-03-31

Getting started with Ansible

Ansible is one of the most popular server automation tool (other than Pulumi and Terraform), it's agentless and only need SSH access to run. It also can help you provision server or VM instances using cloud module. You can also provision vagrant/virtualbox/qemu/docker/lxc/containers inside an already running server using Ansible. Other competitor in this category includes Puppet and Chef but they both require an agent to be installed inside the nodes that want to be controlled. To install Ansible in Ubuntu, run:

sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

You can put list of servers you want to control in /etc/ansible/hosts
or any other inventory file, something like this:

[DC01]
hostname.or.ip

[DC02]
hostname.or.ip
hostname.or.ip

[DC01:vars]
ansible_user=foo
ansible_pass=bar
# it's preferred to use ssh-keygen ssh-copy-id (passwordless login)
# and sudoers set to ALL=(ALL) NOPASSWD:ALL for the ansible user
# instead of hardcoding the username and password

If you put it on another file, you can use -i inventoryFileName to set the inventory file, also don't forget to check /etc/ansible/ansible.cfg default configs, for example you can set default inventory file to another file there.

Example for checking whether all server on DC02 up:

ansible DC02 -m ping

To run arbitary command on all server on DC01:

ansible DC01 -a "cat /etc/lsb-release"
# add -f N, to run N forks in parallel

To create a set of commands, we can create a playbook file (which contains one or more play, and has one or more tasks), which is just a yaml file that contains specific structure, something like this:

---
- name: make sure all tools installed # a play
    hosts: DC01 # or all or "*" or localhost
    become: yes # sudo
    tasks:
      - name: make sure micro installed # a task
        apt: # a module
          name: micro
          state: latest
      - name: make sure golang ppa added
        ansible.builtin.apt_repository:
          repo: deb http://ppa.launchpad.net/longsleep/golang-backports/ubuntu/ focal main
      - name: make sure latest golang installed
        apt: name=golang-1.18 state=present # absent to uninstall
      - name: make sure docker gpg key installed
        raw: 'curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o - > /usr/share/keyrings/docker-archive-keyring.gpg'
      - name: make sure docker-ce repo added
        ansible.builtin.apt_repository:
          repo: 'deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu impish stable'
      - name: make sure docker-ce installed
        apt: name=docker-ce

That playbook, for example if you save it on playbooks/ensure-tools-installed.yml, you can run it using ansible-playbook playbooks/ensure-tools-installed.yml

How to know list of modules and what's their key-value options? visit this site https://docs.ansible.com/ansible/2.9/modules/modules_by_category.html

2022-03-26

Move docker Data Directory to Another Partition

My first NVMe (i hate this brand S*****P****), sometimes hangs periodically (marking the filesystem readonly) so 50-100% CPU usage and nothing more can be done. So I have to move some parts of it into another NVMe drive, here's how I moved docker to another partition

sudo systemctl stop docker

then you can edit

sudo vim /etc/docker/daemon.json
# where /media/asd/nvme2 is your mount point to your other partition)
# add something like this
{
"dns": ["8.8.8.8","1.1.1.1"],
"data-root": "/media/asd/nvme2/docker"
}

Clone your partition to that new partition (no need to mkdir docker folder):

sudo rsync -aP --progress /var/lib/docker/ /media/asd/nvme2/docker
mv /var/lib/docker /var/lib/docker.backup

Then try to start again the docker service:

sudo systemctl start docker
sudo systemctl status docker

If it all works, you can delete the backup of original data directory.

2022-03-20

1 million Go goroutine vs C# task

Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo

Name UserCPU    SysCPU     AvgRSS       MaxRSS    Wall
c#_1t      38.96    0.68    582,951    636,136    1:00
c#_2t      88.33    0.95 623,956    820,620    1:02
c#_4t   142.86    1.09 687,365 814,028    1:03
c#_8t   235.80    1.71 669,882      820,704    1:05
c#_16t    434.76    4.01 734,545      771,240    1:08
c#_32t    717.39    4.81      720,235 769,888    1:11
go_1t      58.77    0.65    2,635,380    2,635,632    1:04
go_2t 64.48    0.71    2,639,206    2,642,752    1:00
go_4t 72.55    1.42    2,651,086    2,654,972    1:00
go_8t 80.87    2.82    2,641,664    2,643,392    1:00
go_16t    83.18    4.03    2,673,404    2,681,100    1:00
go_32t    86.65    4.30    2,645,494    2,657,580    1:00

Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:

Name     UserCPU    SysCPU     AvgRSS MaxRSS    Wall
c#_1t    12.78    1.28    2,459,190 5,051,528    0:13
c#_2t    22.60    1.54    2,692,439 5,934,796    0:18
c#_4t    42.09    1.54    2,370,239 5,538,280    0:21
c#_8t 88.54    2.29    2,522,053 6,334,176    0:29
c#_16t    204.39    3.32    2,395,001 5,803,808    0:34
c#_32t    259.09    3.25    1,842,458 4,710,012    0:28
go_1t 13.97    0.97    4,514,200 6,151,088    0:14
go_2t 12.35    1.51    5,595,418 9,506,076    0:07
go_4t 22.09    2.40    6,394,162    12,517,848    0:07
go_8t 31.00    3.09    7,115,281    13,428,344    0:06
go_16t 40.32    3.52    7,126,851    13,764,940    0:06
go_32t 58.58    3.58    7,104,882    12,145,396    0:06

Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:

Name    UserCPU SysCPU    AvgRSS    MaxRSS    Wall
c#_1t    1.18    0.16    328,134    511,652    0:02
c#_2t    2.18    0.22 294,608    554,488    0:02
c#_4t    3.19    0.20 305,336    554,064    0:02
c#_8t    7.77    0.31 292,281    530,368    0:02
c#_16t 12.33    0.25 304,352    569,460    0:02
c#_32t 37.90    1.25 337,837    684,252    0:03
go_1t    2.72    0.42   1,592,978   2,519,040    0:03
go_2t    3.04    0.47   1,852,084   2,637,532    0:03
go_4t    3.65    0.54   1,936,626   2,637,272    0:03
go_8t    3.27    0.59   1,768,540   2,655,208    0:02
go_16t   4.01    0.71   1,770,673   2,664,504    0:02
go_32t   4.96    0.72   1,770,354   2,669,244    0:02

The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:

Name    UserCPU SysCPU    AvgRSS   MaxRSS   Wall
c#_1t 0.61    0.08    146,849   284,436    0:05
c#_2t 1.17    0.10    131,778   261,720    0:05
c#_4t 1.53    0.08    133,505   289,584    0:05
c#_8t    4.17    0.15    131,924   284,960    0:05
c#_16t   10.94    0.68    135,446   289,028    0:05
c#_32t   19.86    3.01    130,533   284,924    0:05
go_1t 1.84    0.24    731,872   1,317,796    0:06
go_2t    1.87    0.26    659,382   1,312,220    0:05
go_4t 2.00    0.30    661,296   1,322,152    0:05
go_8t 2.37    0.34    660,641   1,324,684    0:05
go_16t    2.82    0.39    660,225   1,323,932    0:05
go_32t    3.36    0.45    659,176   1,327,264    0:05

And for 5 millions:

Name    UserCPU    SysCPU    AvgRSS    MaxRSS    Wall
c#_1t 3.39    0.24    309,103    573,772    0:11
c#_2t 8.30    0.26    278,683    553,592    0:11
c#_4t    13.65    0.32    274,679    658,104    0:11
c#_8t    23.20    0.46    286,336 641,376    0:12
c#_16t   45.85    1.32    286,311 640,336    0:12
c#_32t   64.83    2.46    264,866    615,552    0:12
go_1t 6.25    0.50    1,397,434    2,629,936    0:13
go_2t 6.20    0.56    1,386,336    2,631,580    0:11
go_4t    7.52    0.65    1,410,523    2,625,308    0:11
go_8t   8.21    0.86    1,441,080    2,779,456    0:11
go_16t   11.17    0.96    1,436,220    2,687,908    0:11
go_32t   12.97    1.06    1,430,573    2,668,816    0:11

And for 25 millions:

Name    UserCPU   SysCPU    AvgRSS      MaxRSS    Wall
c#_1t 15.94    0.69 590,411    1,190,340    0:24
c#_2t 34.88    0.84 699,288    1,615,372    0:32
c#_4t 59.95    0.89 761,308    1,794,116    0:34
c#_8t    100.64    1.36    758,161    1,845,944    0:36
c#_16t   199.56    2.99 765,791    2,014,856    0:38
c#_32t   332.02    4.07 811,809    1,972,400    0:41
go_1t 21.76    0.71   2,846,565    4,413,968    0:29
go_2t 25.77    1.03   2,949,433    5,553,608    0:25
go_4t 28.74    1.24   2,920,447    5,800,088    0:24
go_8t 37.28    1.96   2,869,074    5,502,776    0:23
go_16t    43.46    2.67   2,987,114    5,769,356    0:24
go_32t    43.77    2.92   3,027,179    5,867,084    0:24

How about 25 millions and sleep per 200K?

Name    UserCPU   SysCPU AvgRSS    MaxRSS    Wall
c#_1t    18.47    0.91    842,492    1,820,788    0:22
c#_2t    40.32    0.93    1,070,555    2,454,324    0:31
c#_4t    62.39    1.16    1,103,741    2,581,476    0:33
c#_8t    100.84    1.34    1,074,820    2,377,580    0:34
c#_16t   218.26    2.91    1,062,642    2,726,700    0:37
c#_32t   339.00    6.51    1,042,254    2,275,644    0:40
go_1t    22.61    0.88    3,474,195    5,071,944    0:27
go_2t 25.83    1.20    3,912,071    6,964,640    0:20
go_4t 37.98    1.68    4,180,188    7,392,800    0:20
go_8t 38.56    2.44    4,189,265    8,481,852    0:18
go_16t    44.49    3.19    4,187,142    8,483,236    0:18
go_32t    48.82    3.44    4,218,591    8,424,200    0:18

And lastly 25 millions and sleep per 400K?

Name    UserCPU    SysCPU    AvgRSS MaxRSS    Wall
c#_1t    18.66    0.98    1,183,313    2,622,464    0:20
c#_2t    41.27    1.14    1,326,415    3,155,948    0:31
c#_4t    67.21    1.11    1,436,280    3,015,212    0:33
c#_8t    107.14    1.56    1,492,179    3,378,688    0:35
c#_16t   233.50    2.45    1,498,421    3,732,368    0:41
c#_32t   346.87    3.74    1,335,756    2,882,676    0:39
go_1t    24.13    0.82    4,048,937    5,099,220    0:26
go_2t    28.85    1.41    4,936,677    8,023,568    0:18
go_4t    31.51    1.95    5,193,653    9,537,080    0:14
go_8t    45.27    2.65    5,461,107    9,499,308    0:14
go_16t    53.43    3.19    5,183,009    9,476,084    0:14
go_32t    61.98    3.86    5,589,156   10,587,788    0:14

How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:

go version go1.17.6 linux/amd64
dotnet --version
6.0.201

2022-03-17

Getting started with Cassandra or ScyllaDB

Today we're gonna learn Cassandra (it's been 5 years since I last use ScyllaDB: C++ version of Cassandra), to install Cassandra and ScyllaDB, you can use this docker-compose:

version: '3.3'
services:

testcassandra:

image: cassandra:3.11 # or latest

environment:

- HEAP_NEWSIZE=256M

- MAX_HEAP_SIZE=1G

- "JVM_OPTS=-XX:+PrintGCDateStamps"

- CASSANDRA_BROADCAST_ADDRESS

ports:

- "9042:9042"

testscylla:

image: scylladb/scylla:4.5.1 # because latest 4.6 is broken

command: --smp 2 --memory 1G --overprovisioned 1 --api-address 0.0.0.0 --developer-mode 1

ports:

- 19042:9042

# - 9142:9142

# - 7000:7000

# - 7001:7001

# - 7199:7199

# - 10000:10000

# scylla-manager:

# image: scylladb/scylla-manager

# depends_on:

# - testscylla

Using docker we can create spawn multiple nodes to test NetworkTopologyStrategy, consistency level, and replication factor (or even multiple datacenter):

docker run --name NodeX -d scylladb/scylla:4.5.1
docker run --name NodeY -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"
docker run --name NodeZ -d scylladb/scylla:4.5.1 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' NodeX)"

docker exec -it NodeZ nodetool status
# wait for UJ (up joining) became UN (up normal)

Since I failed to run latest ScyllaDB (so we use 4.5). To install cqlsh locally, you can use this command:

pip3 install cqlsh

cqlsh 127.0.0.1 9042 # cassandra
cqlsh 127.0.0.1 19042 # scylladb

node=`docker ps | grep /scylla: | head -n 1 | cut -f 1 -d ' '`

docker exec -it $node cqlsh # using cqlsh inside scylladb
# ^ must wait 30s+ before docker ready

docker exec -it $node nodetool status
# ^ show node status

As we already know, Cassandra is columnar database, that we have to make a partition key (where the rows will be located) and clustering key (ordering of that data inside the partition), the SSTable part works similar to Clickhouse merges.

To create a keyspace (much like a "database" or collection of tables but we can set replication region), use this command:

CREATE KEYSPACE my_keyspace WITH replication = {'class':

'SimpleStrategy', 'replication_factor': 1};
-- ^ single node
-- {'class' : 'NetworkTopologyStrategy', 'replication_factor': '3'};
-- ^ multiple node but in a single datacenter and/or rack
-- {'class' : 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'};
-- ^ multiple datacenter
USE my_keyspace;

CONSISTENCY; -- how many read/write ack
-- ANY
-- ONE, TWO, THREE
-- LOCAL_ONE
-- QUORUM = replication_factor / 2 + 1
-- LOCAL_QUORUM
-- EACH_QUORUM -- only for write
-- ALL -- will failed if nodes < replication_factor
CONSISTENCY new_level;

To create a table with same partition key and clustering/ordering key:

CREATE TABLE users ( -- or TYPE for custom type, [keyspace.]
fname text,

lname text,
title text,
PRIMARY KEY (lname, fname)
);
DESCRIBE TABLE users; -- only for 4.0+
CREATE TABLE foo (
pkey text,
okey text,
PRIMARY KEY ((pkey), okey) -- different partition and ordering
-- add WITH CLUSTERING ORDER BY (okey DESC) for descending
); -- add WITH cdc = { 'enabled' = true, preimage = 'true' }
DESC SCHEMA; -- show all tables and materialized views

To upsert, use insert or update command (last write wins):

INSERT INTO users (fname, lname, title)

VALUES ('A', 'B', 'C');

INSERT INTO users (fname, lname, title)

VALUES ('A', 'B', 'D'); -- add IF NOT EXISTS to prevent replace
SELECT * FROM users; -- USING TIMEOUT XXms

UPDATE users SET title = 'E' WHERE fname = 'A' AND lname = 'C';
SELECT * FROM users; -- add IF EXISTS to prevent insert

# INSERT INTO users( ... ) VALUES ( ... ) USING TTL 600
# UPDATE users USING TTL 600 SET ...
# SELECT TTL(fname) FROM users WHERE ...
-- set TTL to 0 to remove TTL

-- column will be NULL if TTL became 0

-- whole row will be deleted if all non-PK column TTL is zero
# ALTER TABLE users WITH default_time_to_live = 3600;

# SELECT * FROM users LIMIT 3
# SELECT * FROM users PER PARTITION LIMIT 2
# SELECT * FROM users PER PARTITION LIMIT 1 LIMIT 3

CREATE TABLE stats(city text PRIMARY KEY,total COUNTER);
UPDATE stats SET total = total + 6 WHERE city = 'Kuta';
SELECT * FROM stats;

To change the schema, use usual alter table command:

ALTER TABLE users ADD mname text;
-- tinyint, smallint, int, bigint (= long)
-- variant (= the real bigint)
-- float, double
-- decimal
-- text/varchar, ascii
-- timestamp
-- date, time
-- uuid
-- timeuuid (with mac address, conflict free, set now())
-- boolean
-- inet
-- counter
-- set<type> (set {val,val}, +{val}, -{val})
-- list<type> (set [idx]=, [val,val], +[], []+, -[], DELETE [idx])
-- map<type,type> (set {key: val}, [key]=, DELETE [key] FROM)
-- tuple<type,...> (set (val,...))>
SELECT * FROM users;

UPDATE users SET mname = 'F' WHERE fname = 'A' AND lname = 'D';
-- add IF col=val to prevent update (aka lightweight transaction)
-- IF NOT EXISTS
--
SELECT * FROM users;

Complex nested type example from this page:

CREATE TYPE phone (
    country_code int,
    number text,
);
CREATE TYPE address (
street text,
city text,
zip text,
phones map<text, frozen<phone>> -- must be frozen, cannot be updated
);
CREATE TABLE pets_v4 (
name text PRIMARY KEY,
addresses map<text, frozen<address>>
);
INSERT INTO pets_v4 (name, addresses)
VALUES ('Rocky', {
    'home' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: {
        'cell' : { country_code: 1, number: '202 456-1111' },
        'landline' : { country_code: 1, number: '202 456-1234' }
      }
    },
    'work' : {
      street: '1600 Pennsylvania Ave NW',
      city: 'Washington',
      zip: '20500',
      phones: { 'fax' : { country_code: 1, number: '202 5444' } }
    }
});

To create index (since Cassandra only allows retrieve by partition and cluster key or full scan):

CREATE INDEX ON users(title); -- global index (2 hops per query)
SELECT * FROM users WHERE title = 'E';
DROP INDEX users_title_idx;
SELECT * FROM users WHERE title = 'E' ALLOW FILTERING; -- full scan
CREATE INDEX ON users((lname),title); -- local index (1 hop per query)

To create a materialized view (that works similar to Clickhouse's materialized view):

CREATE MATERIALIZED VIEW users_by_title AS
SELECT * -- ALTER TABLE will automatically add this VIEW too
FROM users
WHERE title IS NOT NULL
AND fname IS NOT NULL
AND lname IS NOT NULL
PRIMARY KEY ((title),lname,fname);
SELECT * FROM users_by_title;
INSERT INTO users(lname,fname,title) VALUES('A','A','A');
SELECT * FROM users_by_title WHERE title = 'A';
DROP MATERIALIZED VIEW users_by_title;
-- docker exec -it NodeZ nodetool viewbuildstatus

To create "transaction" use BATCH statement:

BEGIN BATCH;
INSERT INTO ...
UPDATE ...
DELETE ...
APPLY BATCH;

To import from file, use COPY command:

COPY users FROM 'users.csv' WITH HEADER=true;

Tips for performance optimization:
1. for multi-DC use LocalQuorum on read, and TokenAware+DCAwareRoundRobin to prevent reading from nodes on different DC

2. ALLOW FILTERING if small number of records low cardinality (eg. values are true vs false only) -- 0 hop

3. global INDEX when primary key no need to be included, and latency doesn't matter (2 hops)

4. local INDEX for when primary key can be included (1 hops)

5. MATERIALIZED VIEW when want to use different partition for the same data, and storage doesn't matter

6. always use prepared statement

2022-02-22

C# vs Go in Simple Benchmark

Today we're gonna retry two of my few favorite language in associative array and comb sort benchmark (compile and run, not just runtime performance, because development waiting for compilation time also important) like in the past benchmark. For installing DotNet:

wget https://packages.microsoft.com/config/ubuntu/21.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt install apt-transport-https
sudo apt-get update
sudo apt-get install -y dotnet-sdk-6.0 aspnetcore-runtime-6.0

For installing Golang:

sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update
sudo apt install -y golang-1.17

Result (best of 3 runs)

cd assoc; time dotnet run
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.16s Real: 14.41s RAM: 1945904KB

cd assoc; time go run map.go
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.80s Real: 12.01s RAM: 2305384KB

This is a bit weird, usually I see that Go use less memory but slower, but in this benchmark C# that are using less memory but a bit slower (14.41s vs 12.01s), possibly because the compilation speed also included.

cd num-assoc; time dotnet run
CPU: 2.21s      Real: 2.19s     RAM: 169208KB

cd num-assoc; time go run comb.go
CPU: 0.46s      Real: 0.44s     RAM: 83100KB

What if we increase the N from 1 million to 10 million?

cd num-assoc; time dotnet run
CPU: 19.25s     Real: 19.16s    RAM: 802296KB

cd num-assoc; time go run comb.go
CPU: 4.60s      Real: 4.67s     RAM: 808940KB

If you want to contribute (if I make mistake when coding the C# or Go version of the algorithm, or if there's more efficient data structure, just fork and create a pull request, and I will redo the benchmark).

2022-01-28

Getting started with Kubernetes

Since the containerization got more popular, kubernetes gained more traction than using just one VM for deployment. In previous post I explained why or when you don't need kubernetes and when you'll need it. At least from deployment perspective we can categorize it into 5 types (based on ownership, initial cost, and granularity of the recurring cost, the need for capacity planning):

1. On-Premise Dedicated Server, your own server, your own rack, or put it the colocation, we own the hardware, and we have to replace it when it's broken, we have to maintain the network part (switches, routers). Usually this one is best choice for internal services (software that used only by internal staff, from the security and bandwidth perspective.

2. VM, we rent the "cloud" infrastructure, this can be considered IaaS (Infrastructure as a Service), we rent a virtual machine/server or sometimes named Virtual Private/Dedicated Server, so we pay monthly when the server turned on (or based on contract, X per month/year) also sometimes the bandwidth (unless the provider have unmetered billing). Some of notable product in this category: Google Compute Engine, Amazon EC2, Azure VM, Contabo VPS/VDS, etc. Usually this one is the best for databases (unless you are using managed database service) or other stateful applications or if the maximum number of users are limited that estimated based on the capacity planning (not whole world will be accessing this VM), because it a bit insane if you want your data moving/replicating automatically on high load, so manual trigger/scheduled for scale up/out/down/in, for stateful/database this one will still better option (eg. scale up/out when black friday, scale down/in when it ends).

3. Kubernetes, we rent or use managed kubernetes, or install kubernetes on top of our own own-premise dedicated server. Usually the company will rent 3 huge servers 64 core, 256GB RAM, with very large harddisk, and let developer to deploy containers/pod inside the kubernetes themself splitted based on their team or service's namespace. This have constant cost (those 3 huge VMs/OnPrems, and the managed kubernetes service's cost), some provider also provides automatic node scale-out (so the kubernetes nodes/VM [where the pods will be put inside] can be spawned/deleted based on load). Some notable product in this category: GKE, Amazon EKS, AKS, DOKS, Jelastic Kubernetes Cluster, etc. This one best if you have a lot number of services. For truly self-managed one: Mesos, Docker Swarm, Nomad combined with some other service (since they only manage a fraction of Kubernetes features, eg. Consul for service discovery, FabioLB for load balancer, Vault for secrets management, etc) can be other alternative. The different between this and number 4 below is, you still mostly have to provision the node VM/hardware manually (but can be automated with scripts as long as the provider supported it, eg. via Pulumi or IaaS provider's provisioning/autoscaling group API).

4. Container Engine, in this type of deployment we use the infrastructure provider's platform, so we only need to supply container without have to rent the VM manually, some provider will deploy the container inside a single VM, some other will deploy the container on shared dedicated server/VM. All of them have the same feature: autoscale out, but only some of them have auto scale up. Some notable product in this category: Google AppEngine, Amazon ECS/Beanstalk/Fargate, Azure App Service, Jelastic Cloud, Heroku, etc. Usually this one is the best choice for most cases on budget-wise and for scalability side, especially if you have a small number of services or using monolith, or can also great for large number of services if the billing is granular (pay only the resources [CPU, RAM, Disk] you utilize, not when the server turned on) like in Jelastic.

5. Serverless/FaaS, we only need to supply the function (mostly based on a specific template) that will run on specific event (eg. on specific time like CRON, or when load balancer receiving a request like in old CGI). Usually the function put inside a container, and used as standby instance, so the scale out only happened when it receives high load. If the function requires database as dependency, it's recommended to use managed databases that support high number of connections/connect-disconnect or offloaded to MQ/PubSub service. Notable products in this category: Google CloudRun, AWS Lambda, Azure Functions, OpenFaaS, Netlify, Vercel, Cloudflare Workers, etc. We pay this service usually based on CPU duration, number of calls, total RAM usage, bandwidth, and any other metrics, so it would be very cheap when number of function calls are small, but can be really costly if you are writing inefficient function or have large number of calls. Usually lambda only used for handling spikes or as atomic CRON.

Because of the hype, or because it fit their use cases (bunch of teams that want to do independent service deployments), and the possibility of avoiding vendor locking, sometimes a company might decide to use kubernetes. Most of the company can survive by not following the hype by only managed database (or deploying database on VM or even using docker-compose with volume binding) + container engine (for scaling out strategy), not having to train everyone to learn Kubernetes and not having to have a dedicated team/person to set up and manage Kubernetes (security, policies, etc).

But today we're gonna try one of the fastest local kubernetes for development use-case (not for production).

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

minikube start

# use --driver=kvm2 or virtualbox if docker cannot connect internet
#sudo apt install virtualbox
#sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils virt-manager
#sudo adduser `id -un` libvirt
#sudo adduser `id -un` kvm

alias kubectl='minikube kubectl --'
alias k=kubectl

# will download kubectl if it's the first time
k

# get pods from all namespace
k get po -A

# open dashboard and authenticate
minikube dashboard

# destroy minikube cluster
minikube ssh # sudo poweroff
minikube delete

create Dockerfile you want to deploy to kubernetes cluster, or if it's just simple single binary golang project, build locally then put it to alpine docker (instead of build cleanly inside docker), then push to image registry will be work just fine:

# build binary
CGO_ENABLED=0 GOOS=linux go build -o ./bla.exe

# create Dockerfile
echo '
FROM alpine:latest
WORKDIR /
COPY bla.exe .
CMD ./bla.exe # or ENTRYPOINT if you need to override
' > Dockerfile

# build docker image
VERSION=$(ruby -e 't = Time.now; print "v1.#{t.month+(t.year-2021)*12}%02d.#{t.hour}%02d" % [t.day, t.min]')
COMMIT=$(git rev-parse --verify HEAD)
APPNAME=local-bla
docker image build -f ./Dockerfile . \
--build-arg "app_name=$APPNAME" \
-t "$APPNAME:latest" \
-t "$APPNAME:$COMMIT" \
-t "$APPNAME:$VERSION"

# example how to test locally without kubernetes
docker image build -f ./Dockerfile -t docktest1 .
docker run --network=host \
--env 'DBMAST=postgres://usr:pwd@127.1:5432/db1?sslmode=disable' \
--env 'DBREPL=postgres://usr:pwd@127.1:5432/db1?sslmode=disable' \
--env DEBUG=true \
--env PORT=8082 \
docktest1

# push image to minikube
minikube image load $APPNAME

# create deployment config
echo '
apiVersion: v1
kind: Pod
metadata:
name: bla-pod
spec:
containers:
    - name: bla
      image: bla
      imagePullPolicy: Never
      env:
      - name: BLA_ENV
        value: "ENV_VALUE_TO_INJECT"
      # if you need access to docker-compose outside the kube cluster
      # use minikube ssh, route -n, check the ip of the gateway
      # and use that ip as connection string
      # it should work, as long the port forwarded
restartPolicy: Never
' > bla-pod.yaml

# deploy
kubectl apply -f bla-pod.yaml

# check
k get pods
k logs bla-pod

# delete deployment
kubectl delete pod bla-pod

If you need NewRelic log forwarding, it's just as easy as adding a helm chart (it would automatically attach new pod logs and send it to newrelic):

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add newrelic https://helm-charts.newrelic.com
helm search repo newrelic/
helm install newrelic-logging newrelic/newrelic-logging --set licenseKey=eu01xx2xxxxxxxxxxxxxxxRAL
kubectl get daemonset -o wide -w --namespace default newrelic-logging

The next step should be adding load balancer or ingress so that pod can receive http requests. Other alternative for faster development workflow (autosync after recompile to the pod), you can try Tilt.

2022-01-21

Multipass: Easy minimal Ubuntu VM on any OS

Normally we use LXC/LXD, KVM, QEMU, Docker, Vagrant, VirtualBox, VMWare or any other virtualization and containerization software to spawn a VM-like instance locally. Today we're gonna try multipass, a tool to spawn and orchestrate ubuntu VM. To install multipass, it's as easy as running these commands:

snap install multipass
ls -al /var/snap/multipass/common/multipass_socket
snap connect multipass:libvirt # if error: ensure libvirt is installed and running
snap info multipass

To spawn a VM on Ubuntu (for other OSes, see the link above), we can run:

multipass find

Image        Aliases      Version   Description
...
18.04        bionic       20220104 Ubuntu 18.04 LTS
20.04        focal,lts    20220118 Ubuntu 20.04 LTS
21.10        impish       20220118 Ubuntu 21.10
daily:22.04 devel,jammy 20220114 Ubuntu 22.04 LTS
...
minikube                  latest    minikube is local Kubernetes

multipass launch --name groovy-lagomorph lts
# 20.04 --cpus 1 --disk 5G --mem 1G

multipass list
Name                    State             IPv4             Image
groovy-lagomorph        Running           10.204.28.99     Ubuntu 20.04 LTS

multipass info --all
Name:           groovy-lagomorph
State:          Running
IPv4:           10.204.28.99
Release:        Ubuntu 20.04.3 LTS
Image hash:     e1264d4cca6c (Ubuntu 20.04 LTS)
Load:           0.00 0.00 0.00
Disk usage:     1.3G out of 4.7G
Memory usage:   134.2M out of 976.8M
Mounts:         --

To run shell inside newly spawned VM, we can run:

multipass shell groovy-lagomorph

multipass exec groovy-lagomorph -- bash

If you need to simulate ssh, according to this issue you can either:

sudo ssh -i /var/snap/multipass/common/data/multipassd/ssh-keys/id_rsa ubuntu@10.204.28.99

# or add ssh key before launch on cloud-init.yaml
ssh_authorized_keys:
- <your_ssh_key>

# or copy ssh key manually after launch
sudo ssh-copy-id -f -o 'IdentityFile=/var/snap/multipass/common/data/multipassd/ssh-keys/id_rsa' -i ~/.ssh/id_rsa.pub ubuntu@10.204.28.99

If to stop/start/delete the VM:

multipass stop groovy-lagomorph
multipass start groovy-lagomorph
multipass delete groovy-lagomorph
multipass purge

What technology used by multipass? it's QEMU, but maybe it's different on another platform (it can run on Windows and MacOSX too).

Subscribe to: Posts ( Atom )