Showing posts with label golang. Show all posts
Showing posts with label golang. Show all posts

2022-06-07

How to profile your Golang Fiber server

Usually you need to load test your webserver to measure where's the memory leak or where's the bottleneck, and Golang already provided tool to do that, called pprof. What you need to do is depends on your framework that you use, but it's all similar, most framework already have a middleware that you can import and use, for example in fiber there's pprof middleware, to use:

// import
  "github.com/gofiber/fiber/v2/middleware/pprof"

// use
  app.Use(pprof.New())

It would create a route called /debug/pprof that you can use, just start the server, then open that path. To profile or check heap you just need to click profile/heap link, it would wait for around 10 seconds, meanwhile it waits, you must hit other endpoints to generate traffic/function calls. After 10 seconds, it would show a download dialog to save your cpu profile or heap profile. From that file, you can run a command similar to gops, for example if you want to generate svg or generate web that shows your profiling:

pprof -web /tmp/profile # or
pprof -svg /tmp/profile # <-- file that you just downloaded

It would generate something like this:



So you can find out which function that took most of the CPU time (or if it's heap profile, which function that generates/allocate most memory usage. In my case the bottleneck is the default built-in pretty logger, it limits the number of requests it can only handle to ~9K rps for concurrency 255 on database write benchmark, that if we remove built-in logging and replace with zerolog, it can handle ~57K rps for same benchmark.

2022-04-05

Start/restart Golang or any other binary program automatically on boot/crash

There are some alternative to make program start on boot on Linux, the usual way is using:

1. SystemD, it could ensure that dependency started before your service, also could limit your CPU/RAM usage. Generate a template using this website or use kardianos/service

2. PM2 (requires NodeJS), or PMG

3. docker-compose (requires docker, but you can skip the build part, just copy the binary directly on Dockerfile command (that can be deployed using rsync), just set restart property on docker-compose and it would restart when computer boot),  -- bad part, you cannot limit cpu/ram unless using docker swarm. But you can use docker directly to limit and use --restart flag.

3. lxc/lxd or multipass or other vm/lightweight vm (but still need systemd inside it XD at least it won't ruin your host), you can rsync directly to the container to redeploy, for example using overseer or tableflip, you must add reverse proxy or nat or proper routing/ip forwarding tho if you want it to be accessed from outside

4. supervisord (python) or ochinchina/supervisord (golang) tutorial here

5. create one daemon manager with systemd/docker-compose, then spawn the other services using goproc or pioz/god

6. monit it can monitor and ensure a program started/not dead

7. nomad (actually this one is deployment tool), but i can also manage workload

8. kubernetes XD overkill

9. immortal.run a supervisor, this one actually using systemd

10. other containerization/VM workload orchestrator/manager that usually already provided by the hoster/PaaS provider (Amazon ECS/Beanstalk/Fargate, Google AppEngine, Heroku, Jelastic, etc)


This is the systemd script that I usually use (you need to create user named "web" and install "unbuffer"):

$ cat /usr/lib/systemd/system/xxx.service
[Unit]
Description=xxx
After=network-online.target postgresql.service
Wants=network-online.target

[Service]
Type=simple
Restart=on-failure
User=web
Group=users
WorkingDirectory=/home/web/xxx
ExecStart=/home/web/xxx/run_production.sh
ExecStop=/usr/bin/killall xxx
LimitNOFILE=2097152
LimitNPROC=65536
ProtectSystem=full
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

$ cat /home/web/xxx/run_production.sh
#!/usr/bin/env bash

mkdir -p `pwd`/logs
ofile=`pwd`/logs/access_`date +%F_%H%M%S`.log
echo Logging into: $ofile
unbuffer time ./xxx | tee $ofile



2022-04-04

Automatic Load Balancer Registration/Deregistration with NATS or FabioLB

Today we're gonna test 2 alternative for automatic load balancing (previously I always use Caddy or NginX (because most of my projects is single server -- the bottleneck is always the database not the backend/compute part), with manual reverse proxy configuration, but today we're gonna test 2 possible way to high-availability load balance strategy (without kubernetes of course), first is using NATS, second one is using standard load balancer, in this case FabioLB.

To use NATS, we're gonna use this strategy:
first one we deploy is the our custom reverse proxy, that should able to convert any query string, form body with any kind of content-type, and any header if needed, we can use any serialization format (json, msgpack, protobuf, etc), but in this case we're just gonna use normal string, we call this service "apiproxy". The apiproxy will send the serialized payload (from map/object) into NATS using request-reply mechanism. Another service is our backend "worker"/handler, that could be anything, but in this case is our real handler that would contain our business logic, so we need to subscribe and return a reply to the apiproxy and it would be deserialized back to the client with any serizaliation format and protocol (gRPC/Websocket/HTTP-REST/JSONP/etc). Here's the benchmark result of normal Fiber without any proxy, apiproxy-nats-worker with single nats vs multi nats instance

# no proxy
go run main.go apiserver
hey -n 1000000 -c 255 http://127.0.0.1:3000
  Average:      0.0011 secs
  Requests/sec: 232449.1716

# single nats
go run main.go apiproxy
go run main.go # worker
hey -n 1000000 -c 255 http://127.0.0.1:3000
  Average:      0.0025 secs
  Requests/sec: 100461.5866

# 2 worker
  Average:      0.0033 secs
  Requests/sec: 76130.4079

# 4 worker
  Average:      0.0051 secs
  Requests/sec: 50140.6288

# limit the apiserver CPU
GOMAXPROCS=2 go run main.go apiserver
  Average:      0.0014 secs
  Requests/sec: 184234.0106

# apiproxy 2 core
# 1 worker 2 core each
  Average:      0.0025 secs
  Requests/sec: 103007.4516

# 2 worker 2 core each
  Average:      0.0029 secs
  Requests/sec: 87522.6801

# 4 worker 2 core each
  Average:      0.0037 secs
  Requests/sec: 67714.5851

# seems that the bottleneck is spawning the producer's NATS
# spawning 8 connections using round-robin

# 1 worker 2 core each
  Average:      0.0021 secs
  Requests/sec: 121883.4324

# 4 worker 2 core each
  Average:      0.0030 secs
  Requests/sec: 84289.4330

# seems also the apiproxy is hogging all the CPU cores
# limiting to 8 core for apiproxy
# now synchronous handler changed into async/callback version
GOMAXPROCS=8 go run main.go apiserver

# 1 worker 2 core each
  Average:      0.0017 secs
  Requests/sec: 148298.8623

# 2 worker 2 core each
  Average:      0.0017 secs
  Requests/sec: 143958.4056

# 4 worker 2 core each
  Average:      0.0029 secs
  Requests/sec: 88447.5352

# limiting the NATS to 4 core using go run on the source
# 1 worker 2 core each
  Average:      0.0013 secs
  Requests/sec: 194787.6327

# 2 worker 2 core each
  Average:      0.0014 secs
  Requests/sec: 176702.0119

# 4 worker 2 core each
  Average:      0.0022 secs
  Requests/sec: 116926.5218

# same nats core count, increase worker core count
# 1 worker 4 core each
  Average:      0.0013 secs
  Requests/sec: 196075.4366

# 2 worker 4 core each
  Average:      0.0014 secs
  Requests/sec: 174912.7629

# 4 worker 4 core each
  Average:      0.0021 secs
  Requests/sec: 121911.4473 --> see update below


Could be better if it was tested in multiple server, but it seems the bottleneck is on NATS connection when have many subscriber, they could not scale linearly (16-66% overhead for a single API proxy) IT's A BUG ON MY SIDE, SEE UPDATE BELOW. Next we're gonna try FabioLB with Consul, Consul used for service registry (it's a synchronous-consistent "database" like Zookeeper or Etcd). To install all of it use this commands:

# setup:
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt install consul
go install github.com/fabiolb/fabio@latest

# start:
sudo consul agent -dev --data-dir=/tmp/consul
fabio
go run main.go -addr 172.17.0.1:5000 -name svc-a -prefix /foo -consul 127.0.0.1:8500

# benchmark:
# without fabio
  Average:      0.0013 secs
  Requests/sec: 197047.9124

# with fabio 1 backend
  Average:      0.0038 secs
  Requests/sec: 65764.9021

# with fabio 2 backend
go run main.go -addr 172.17.0.1:5001 -name svc-a -prefix /foo -consul 127.0.0.1:8500

# the bottleneck might be the cores, so we limit the cores to 2 for each worker
# with fabio 1 backend 2 core each
  Average:      0.0045 secs
  Requests/sec: 56339.5518

# with fabio 2 backend 2 core each
  Average:      0.0042 secs
  Requests/sec: 60296.9714

# what if we limit also the fabio
GOMAXPROCS=8 fabio

# with fabio 8 core, 1 backend 2 core each
  Average:      0.0042 secs
  Requests/sec: 59969.5206

# with fabio 8 core, 2 backend 2 core each
  Average:      0.0041 secs
  Requests/sec: 62169.2256

# with fabio 8 core, 4 backend 2 core each
  Average:      0.0039 secs
  Requests/sec: 64703.8253

All CPU cores utilized around 50% of 32-core server 128GB RAM, can't find which part the bottleneck for now, but for sure both strategy have around 16% vs 67% overhead compared for non proxies (which is make sense because adding more layer will add more transport and more things to copy/transfer and transform/serialize-deserialize). The code used in this benchmark is here, on 2022mid directory, and the code for fabio-consul registration copied from ebay's github repository.

Why even we need to do this? If we're using api gateway pattern (one of the pattern that being used in my past company, but with Kubernetes on worker part), we could deploy independently and communicate between service using the gateway (proxy) without knowing the IP address or domain name of the service itself, as long as it have proper route and payload it can be handled wherever the service being deployed. What if you want to do canary or blue green deployment? you can just register a handler in nats or consul with different route name (especially for communication between services, not public to service), and wait for all traffic to be moved there before killing previous deployment.

So what should you choose? both strategy requires 3 moving part (apiproxy-nats-worker, fabio-consul-worker) but NATS strategy simpler in the development and can give better performance (especially if you make the apiproxy to be as flexible as possible), but it needs to have better serialization, since in this benchmark the serialization not measured, if you need better performance on serialization you must use codegen, which may require you to deploy 2 times (one for apiproxy, one for worker, unless you split the raw response meta with jsonparser or use map only for apiproxy). FabioLB strategy have more features, also you can use consul for service discovery (contacting other services directly by name without have to go thru FabioLB). NATS strategy have some benefit in terms of security, which is the NATS cluster can be inside DMZ, and worker can be on the different subnet without ability to connect each other and it would still works, where if you use consul to connect directly to another service, they should have route or connection to access each other. The bad part about NATS is that you should not use it for file upload, or it would hogging a lot of resource, so it should handled by apiproxy directly, then the reference of the uploaded file should be forwarded as payload to NATS. You can check NATS traffic statistics using nats-top.

What's next? Maybe we can try traefik, which is a service registry combined with load balancer in one binary, it can also use consul.

UPDATE: by changing the code from Subscribe (broadcast/fan-out) to QueueSubscribe (load balance), it have similar performance on 1/2/4 subscribers, so we can NATS for high availability/fault tolerance in api gateway pattern with cost of 16% overhead.

TL:DR

no LB: 232K rps
-> LB with NATS request-reply: 196K rps (16% overhead)
no LB: 197K rps
-> LB with Fabio+Consul: 65K rps (67% overhead)

 



2022-03-20

1 million Go goroutine vs C# task

Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo

Name     UserCPU    SysCPU     AvgRSS       MaxRSS    Wall
c#_1t      38.96    0.68      582,951      636,136    1:00
c#_2t      88.33    0.95      623,956      820,620    1:02
c#_4t     142.86    1.09      687,365      814,028    1:03
c#_8t     235.80    1.71      669,882      820,704    1:05
c#_16t    434.76    4.01      734,545      771,240    1:08
c#_32t    717.39    4.81      720,235      769,888    1:11
go_1t      58.77    0.65    2,635,380    2,635,632    1:04
go_2t      64.48    0.71    2,639,206    2,642,752    1:00
go_4t      72.55    1.42    2,651,086    2,654,972    1:00
go_8t      80.87    2.82    2,641,664    2,643,392    1:00
go_16t     83.18    4.03    2,673,404    2,681,100    1:00
go_32t     86.65    4.30    2,645,494    2,657,580    1:00


Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines  requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:

Name     UserCPU    SysCPU     AvgRSS        MaxRSS    Wall
c#_1t      12.78    1.28    2,459,190     5,051,528    0:13
c#_2t      22.60    1.54    2,692,439     5,934,796    0:18
c#_4t      42.09    1.54    2,370,239     5,538,280    0:21
c#_8t      88.54    2.29    2,522,053     6,334,176    0:29
c#_16t    204.39    3.32    2,395,001     5,803,808    0:34
c#_32t    259.09    3.25    1,842,458     4,710,012    0:28
go_1t      13.97    0.97    4,514,200     6,151,088    0:14
go_2t      12.35    1.51    5,595,418     9,506,076    0:07
go_4t      22.09    2.40    6,394,162    12,517,848    0:07
go_8t      31.00    3.09    7,115,281    13,428,344    0:06
go_16t     40.32    3.52    7,126,851    13,764,940    0:06
go_32t     58.58    3.58    7,104,882    12,145,396    0:06

Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:

Name    UserCPU SysCPU    AvgRSS     MaxRSS    Wall
c#_1t    1.18    0.16     328,134     511,652    0:02
c#_2t    2.18    0.22     294,608     554,488    0:02
c#_4t    3.19    0.20     305,336     554,064    0:02
c#_8t    7.77    0.31     292,281     530,368    0:02
c#_16t  12.33    0.25     304,352     569,460    0:02
c#_32t  37.90    1.25     337,837     684,252    0:03
go_1t    2.72    0.42   1,592,978   2,519,040    0:03
go_2t    3.04    0.47   1,852,084   2,637,532    0:03
go_4t    3.65    0.54   1,936,626   2,637,272    0:03
go_8t    3.27    0.59   1,768,540   2,655,208    0:02
go_16t   4.01    0.71   1,770,673   2,664,504    0:02
go_32t   4.96    0.72   1,770,354   2,669,244    0:02

The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:

Name    UserCPU  SysCPU    AvgRSS     MaxRSS     Wall
c#_1t     0.61    0.08    146,849     284,436    0:05
c#_2t     1.17    0.10    131,778     261,720    0:05
c#_4t     1.53    0.08    133,505     289,584    0:05
c#_8t     4.17    0.15    131,924     284,960    0:05
c#_16t   10.94    0.68    135,446     289,028    0:05
c#_32t   19.86    3.01    130,533     284,924    0:05
go_1t     1.84    0.24    731,872   1,317,796    0:06
go_2t     1.87    0.26    659,382   1,312,220    0:05
go_4t     2.00    0.30    661,296   1,322,152    0:05
go_8t     2.37    0.34    660,641   1,324,684    0:05
go_16t    2.82    0.39    660,225   1,323,932    0:05
go_32t    3.36    0.45    659,176   1,327,264    0:05

And for 5 millions:

Name    UserCPU    SysCPU    AvgRSS       MaxRSS    Wall
c#_1t     3.39    0.24      309,103      573,772    0:11
c#_2t     8.30    0.26      278,683      553,592    0:11
c#_4t    13.65    0.32      274,679      658,104    0:11
c#_8t    23.20    0.46      286,336      641,376    0:12
c#_16t   45.85    1.32      286,311      640,336    0:12
c#_32t   64.83    2.46      264,866      615,552    0:12
go_1t     6.25    0.50    1,397,434    2,629,936    0:13
go_2t     6.20    0.56    1,386,336    2,631,580    0:11
go_4t     7.52    0.65    1,410,523    2,625,308    0:11
go_8t     8.21    0.86    1,441,080    2,779,456    0:11
go_16t   11.17    0.96    1,436,220    2,687,908    0:11
go_32t   12.97    1.06    1,430,573    2,668,816    0:11

And for 25 millions:

Name    UserCPU   SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     15.94    0.69     590,411    1,190,340    0:24
c#_2t     34.88    0.84     699,288    1,615,372    0:32
c#_4t     59.95    0.89     761,308    1,794,116    0:34
c#_8t    100.64    1.36     758,161    1,845,944    0:36
c#_16t   199.56    2.99     765,791    2,014,856    0:38
c#_32t   332.02    4.07     811,809    1,972,400    0:41
go_1t     21.76    0.71   2,846,565    4,413,968    0:29
go_2t     25.77    1.03   2,949,433    5,553,608    0:25
go_4t     28.74    1.24   2,920,447    5,800,088    0:24
go_8t     37.28    1.96   2,869,074    5,502,776    0:23
go_16t    43.46    2.67   2,987,114    5,769,356    0:24
go_32t    43.77    2.92   3,027,179    5,867,084    0:24

How about 25 millions and sleep per 200K?

Name    UserCPU   SysCPU      AvgRSS       MaxRSS    Wall
c#_1t     18.47    0.91      842,492    1,820,788    0:22
c#_2t     40.32    0.93    1,070,555    2,454,324    0:31
c#_4t     62.39    1.16    1,103,741    2,581,476    0:33
c#_8t    100.84    1.34    1,074,820    2,377,580    0:34
c#_16t   218.26    2.91    1,062,642    2,726,700    0:37
c#_32t   339.00    6.51    1,042,254    2,275,644    0:40
go_1t     22.61    0.88    3,474,195    5,071,944    0:27
go_2t     25.83    1.20    3,912,071    6,964,640    0:20
go_4t     37.98    1.68    4,180,188    7,392,800    0:20
go_8t     38.56    2.44    4,189,265    8,481,852    0:18
go_16t    44.49    3.19    4,187,142    8,483,236    0:18
go_32t    48.82    3.44    4,218,591    8,424,200    0:18

And lastly 25 millions and sleep per 400K?

Name    UserCPU    SysCPU    AvgRSS        MaxRSS    Wall
c#_1t     18.66    0.98    1,183,313    2,622,464    0:20
c#_2t     41.27    1.14    1,326,415    3,155,948    0:31
c#_4t     67.21    1.11    1,436,280    3,015,212    0:33
c#_8t    107.14    1.56    1,492,179    3,378,688    0:35
c#_16t   233.50    2.45    1,498,421    3,732,368    0:41
c#_32t   346.87    3.74    1,335,756    2,882,676    0:39
go_1t     24.13    0.82    4,048,937    5,099,220    0:26
go_2t     28.85    1.41    4,936,677    8,023,568    0:18
go_4t     31.51    1.95    5,193,653    9,537,080    0:14
go_8t     45.27    2.65    5,461,107    9,499,308    0:14
go_16t    53.43    3.19    5,183,009    9,476,084    0:14
go_32t    61.98    3.86    5,589,156   10,587,788    0:14

How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:

go version go1.17.6 linux/amd64
dotnet --version
6.0.201

2022-02-22

C# vs Go in Simple Benchmark

Today we're gonna retry two of my few favorite language in associative array and comb sort benchmark (compile and run, not just runtime performance, because development waiting for compilation time also important) like in the past benchmark. For installing DotNet:

wget https://packages.microsoft.com/config/ubuntu/21.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt install apt-transport-https
sudo apt-get update
sudo apt-get install -y dotnet-sdk-6.0 aspnetcore-runtime-6.0

For installing Golang:

sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update
sudo apt install -y golang-1.17

Result (best of 3 runs)

cd assoc; time dotnet run
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.16s     Real: 14.41s    RAM: 1945904KB

cd assoc; time go run map.go
6009354 6009348 611297
36186112 159701682 23370001

CPU: 14.80s     Real: 12.01s    RAM: 2305384KB

This is a bit weird, usually I see that Go use less memory but slower, but in this benchmark C# that are using less memory but a bit slower (14.41s vs 12.01s), possibly because the compilation speed also included.

cd num-assoc; time dotnet run
CPU: 2.21s      Real: 2.19s     RAM: 169208KB

cd num-assoc; time go run comb.go
CPU: 0.46s      Real: 0.44s     RAM: 83100KB

What if we increase the N from 1 million to 10 million?

cd num-assoc; time dotnet run
CPU: 19.25s     Real: 19.16s    RAM: 802296KB

cd num-assoc; time go run comb.go
CPU: 4.60s      Real: 4.67s     RAM: 808940KB

If you want to contribute (if I make mistake when coding the C# or Go version of the algorithm, or if there's more efficient data structure, just fork and create a pull request, and I will redo the benchmark).

2021-11-17

Alternative Strategy for Dependency Injection (lambda-returning vs function-pointer)

There's some common strategy for injecting dependency (one or sets of function) using interface, something like this:

type Foo interface{ Bla() string }
type RealAyaya struct {}
func(a *RealAyaya) Bla() {}
type MockAyaya struct {} // generated from gomock or others
func(a *MockAyaya) Bla() {}
// real usage:
deps := RealAyaya{}
deps.Bla()
// test usage:
deps := MockAyaya{}
deps.Bla()

and there's another one (dependency parameter on function returning a lambda):

type Bla func() string
type DepsIface interface { ... }
func NewBla(deps DepsIface) Bla {
  return func() string {
    // do something with deps
  }
// real usage:
bla := NewBla(realDeps)
res := bla()
// test usage:
bla := NewBLa(mockedOrFakeDeps)
res := bla()

and there other way by combining both fake and real implementation like this, or alternatively using proxy/cache+codegen if it's for 3rd party dependency.
and there other way (plugggable per-function level):

type Bla func() string
type BlaCaller struct {
  BlaFunc Bla
}
// real usage:
bla := BlaCaller{ BlaFunc: deps.SomeMethod }
res := bla.BlaFunc()
// test usage:
bla := BlaCaller{ BlaFunc: func() string { return `fake` } }
res := bla.BlaFunc()

Analysis


The first one is the most popular way, the 2nd one is one that I saw recently (that also being used in openApi/swagger codegen, i forgot which library), the bad part is that we have to sanitize the trace manually because it would show something like NewBla.func1 in the traces, and we have to use generated mock or implement everything if we have to test. Last style is what I thought when writing some task, where the specs still unclear whether I should:
1. query from local database
2. hit another service
3. or just a fake data (in the tests)
I can easily switch out any function without have to depend on whole struct or interface, and it would be still easy to debug (set breakpoint) and jump around the method, compared to generated mock or interface version.
Probably the bad part is, we have to inject every function one by one for each function that we want to call (which nearly equal effort as the 2nd one). But if that's the case, when your function requires 10+ other function to inject, maybe it's time to refactor?

The real use case would be something like this:

type LoanExpirationFunc func(userId string) time.Time 
type InProcess1 struct {
  UserId string 
// add more input here
  LoanExpirationFunc LoanExpirationFunc
  // add more injectable-function, eg. 3rd party hit or db read/save
}
type OutProcess1 struct {}
func Process1(in *InProcess1) (out *OutProcess1) {
  if ... // eg. validation
  x := in.LoanExpirationFunc(in.UserId) 
  // ... // do something
}

func defaultLoanExpirationFunc(userId string) time.Time {
  // 
eg. query from database
}

type thirdParty struct {} // to put dependencies
func NewThirdParty() (*thirdParty) { return &thirdParty{} }
func (t *thirdParty) extLoanExpirationFunc(userId string) time.Time {
  // eg. hit another service
}

// init input:
func main() {
  http.HandleFunc("/case1", func(w, r ...) {
    in := InProcess1{LoanExpirationFunc: defaultLoanExpirationFunc}
    in.ParseFromRequest(r)
    out := Process1(in)  
    out.WriteToResponse(w)
  })
  tp := NewThirdParty()
  http.HandleFunc("/case2", func(w, r ...) {
    in := InProcess1{LoanExpirationFunc: tp.extLoanExpirationFunc}
    in.ParseFromRequest(r)
    out := Process1(in)  
    out.WriteToResponse(w)
  })
}

// on test:
func TestProcess1(t *testing.T) {
  t.Run(`test one year from now`, func(t *testing.T) {
    in := inProcess1{LoanExpirationFunc: func(string) { return time.Now().Add(1, 0, 0) }}
    out := Process1(in)
    assert.Equal(t, out, ...)
  })
}

Haven't using this strategy extensively on new a project (since I just thought about this today and yesterday when creating horrid integration test), but I'll update this post when I found annoyance with this strategy.
 
UPDATE 2022: after using this strategy extensively for a while, this one is better than interface (especially when using IntelliJ), my tip: it would be better if you use function pointer name and injected function name with same name.

2021-07-30

Mock vs Fake and Classical Testing

Motivation of this article is to promote less painful way of testing, structuring codes, and less broken test when changing logic/implementation details (only changing the logic not changing the input/output). This post recapping past ~4 years compilation of articles that conclude that Fake > Mock, Classical Test > Mock Test from other developers that realize the similar pain points of popular approach (mock).

Mock Approach

Given a code like this:

type Obj struct {
*sql.DB // or Provider
}
func (o *Obj) DoMultipleQuery(in InputStruct) (out OutputStruct, err error) {
... = o.DoSomeQuery()
... = o.DoOtherQuery()
}

I’ve seen code to test with mock technique like this:

func TestObjDoMultipleQuery(t *testing.T) {
o := Obj{mockProvider{}}
testCases := []struct {
name string
mockFunc func(sqlmock.Sqlmock, *gomock.Controller) in InputStruct out OutputStruct
shouldErr bool
} {
{
name: `bast case`,
mockFunc: func(db sqlmock.SqlMock, c *gomock.Controller) {
db.ExpectExec(`UPDATE t1 SET bla = \?, foo = \?, yay = \? WHERE bar = \? LIMIT 1`).
WillReturnResult(sqlmock.NewResult(1,1))
db.ExpectQuery(`SELECT a, b, c, d, bar, bla, yay FROM t1 WHERE bar = \? AND state IN \(1,2\)`).
WithArgs(3).
WillReturnRows(sqlmock.NewRows([]string{"id", "channel_name", "display_name", "color", "description", "active", "updated_at"}).
AddRow("2", "bla2", "Bla2", "#0000", "bla bla", "1", "2021-05-18T15:04:05Z").
AddRow("3", "wkwk", "WkWk", "#0000", "wkwk", "1", "2021-05-18T15:04:05Z"))
...
}, in: InputStruct{...}, out: OutputStruct{...},
wantErr: false,
},
{
... other cases
},
} for _, tc := range testCases { t.Run(tc.name, func(t *testing.T){ ... // prepare mock object o := Obj{mockProvider} out := o.DoMultipleQueryBusinessLogic(tc.in) assert.Equal(t, out, tc.out) }) }
}

This approach has pros and cons:

+ could check whether has typos (eg. add one character in the original query, this test would detect the error)

+ could check whether some queries are properly called, or not called but expected to be called

+ unit test would always faster than integration test

- testing implementation detail (easily break when the logic changed)

- cannot check whether the SQL statements are correct

- possible coupled implementation between data provider and business logic

- duplicating work between original query and the regex-version of query, which if add a column, we must change both implementation

For the last cons, we can change it to something like this:

db.ExpectQuery(`SELECT.+FROM t1.+`).
WillReturnRows( ... )

This approach has pros and cons:

+ not deduplicating works (since it just a simplified regex of the full SQL statements

+ still can check whether queries properly called or not

+ unit test would always faster than integration test

- testing implementation detail (easily break when the logic changed)

- cannot detect typos/whether the query no longer match (eg. if we accidentally add one character on the original query that can cause sql error)

- cannot check correctness of the SQL statement

- possible coupled implementation between data provider and business logic

We could also create a helper function to replace the original query to regex version:

func SqlToRegexSql(sql string) string {
return // replace special characters in regex (, ), ?, . with escaped version
}
db.ExpectQuery(SqlToRegexSql(ORIGINAL_QUERY)) ...

This approach has same pros and cons as previous approach.

Fake Approach

Fake testing use classical approach, instead of checking implementation detail (expected calls to dependency), we use a compatible implementation as dependency (eg. a slice/map of struct for database table/DataProvider)

Given a code like this:

type Obj struct {
FooDataProvider // interface{UpdateFoo,GetFoo,...}
}
func (o *Obj) DoBusinessLogic(in *Input) (out *Output,err error) {
... = o.UpdateFoo(in.bla)
... = o.GetFoo(in.bla)
...
}

It’s better to make a fake data provider like this:

type FakeFooDataProvider struct {
Rows map[int]FooRow{} // or slice
}
func (f *FakeFooDataProvider) UpdateFoo(a string) (...) { /* update Rows */}
func (f *FakeFooDataProvider) GetFoo(a string) (...) { /* get one Rows */}
... // insert, delete, count, get batched/paged

So in the test, we can do something like this:

func TestObjDoBusinessLogic(t *testing.T) {
o := Obj{FakeFooDataProvider{}}
testCases := []struct{
name string
in Input
out Ouput
shouldErr bool
} {
{
name: `best case`,
in: Input{...},
out: Output{...},
shouldErr: false,
},
{
...
},
} for _, tc := range testCases { t.Run(tc.name, func(t *testing.T){ out := o.DoBusinessLogic(tc.in) assert.Equal(t, tc.out, out) }) }
}

This approach have pros and cons:

+ testing behavior (this input should give this output) instead of implementation detail (not easily break/no need to modify the test when algorithm/logic changed)

+ unit test would always faster than integration test

- cannot check whether the queries are called or not called but expected to be called

- double work in Golang (since there’s no generic/template yet, go 1.18 must wait Feb 2022), must create minimal fake implementation (map/slice) that simulate basic database table logic, or if data provider not separated between tables (repository/entity pattern) must create a join logic too – better approach in this case is to always create Insert, Update, Delete, GetOne, GetBatch instead of joining.

+ should be no coupling between queries and business logic

Cannot check whether queries in data provider are correct (which should not be the problem of this unit, it should be DataProvider integration/unit test’s problem, not this unit)

Classical Approach for DataProvider

It’s better to test the queries using classical (black box) approach integration test instead of mock (white box), since mock and fake testing can only test the correctness of business logic, not logic of the data provider that mostly depend to a 2nd party (database). Fake testing also considered a classical approach, since it test input/output not implementation detail.

Using dockertest when test on local and gitlab-ci service when test on pipeline, can be something like this:

var testDbConn *sql.DB
func TestMain(m *testing.M) int { // called before test
if env == `` || env == `development` { // spawn dockertest, return connection to dockertest
prepareDb(func(db *sql.DB){
testDbConn = db
if db == nil {
return 0
}
return m.Run()
})
} else {
// connect to gitlab-ci service var err error testDbConn, err = ... // log error
}
}
func TestDataProviderLogic(t *testing.T) {
if testDbConn == nil {
if env == `` || env == `development` || env == `test` {
t.Fail()
}
return
}
f := FooDataProvider{testDbConn}
f.InitTables()
f.MigrateTables() // if testing migration
// test f.UpdateFoo, f.GetFoo, ...
}

Where the prepareDb function can be something like this (taken from dockertest example):

func prepareDb(onReady func(db *sql.DB) int) {
const dockerRepo = `yandex/clickhouse-server`
const dockerVer = `latest`
const chPort = `9000/tcp`
const dbDriver = "clickhouse"
const dbConnStr = "tcp://127.0.0.1:%s?debug=true"
var err error
if globalPool == nil {
globalPool, err = dockertest.NewPool("")
if err != nil {
log.Printf("Could not connect to docker: %s\n", err)
return
}
}
resource, err := globalPool.Run(dockerRepo, dockerVer, []string{})
if err != nil {
log.Printf("Could not start resource: %s\n", err)
return
}
var db *sql.DB
if err := globalPool.Retry(func() error {
var err error
db, err = sql.Open(dbDriver, fmt.Sprintf(dbConnStr, resource.GetPort(chPort)))
if err != nil {
return err
}
return db.Ping()
}); err != nil {
log.Printf("Could not connect to docker: %s\n", err)
return
}
code := onReady(db)
if err := globalPool.Purge(resource); err != nil {
log.Fatalf("Could not purge resource: %s", err)
}
os.Exit(code)
}

In the pipeline the .gitlab-ci.yml file can be something like this for PostgreSQL (use tmpfs/inmem version for database data directory to make it faster):

test:
stage: test
image: golang:1.16.4
dependencies: []
services:
- postgres:13-alpine # TODO: create a tmpfs version
tags:
- cicd
variables:
ENV: test
POSTGRES_DB: postgres
POSTGRES_HOST: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_PORT: "5432"
POSTGRES_USER: postgres
script:
- source env.sample
- go test

The dockerfile for tmpfs database if using MySQL can be something like this:

FROM circleci/mysql:5.5

RUN echo '\n\
[mysqld]\n\
datadir = /dev/inmemory/mysql\n\
' >> /etc/mysql/my.cnf

Or for MongoDB:

FROM circleci/mongo:3.6.9

RUN sed -i '/exec "$@"/i mkdir \/dev\/inmemory\/mongo' /usr/local/bin/docker-entrypoint.sh

CMD ["mongod", "--nojournal", "--noprealloc", "--smallfiles", "--dbpath=/dev/inmemory/mongo"]

The benefit of this classical integration test approach:

+ high confidence that your SQL statements are correct, can detect typos (wrong column, wrong table, etc)

+ isolated test, not testing business logic but only data provider layer, also can test for schema migrations

- not a good approach for database with eventual consistency (eg. Clickhouse)

- since this is an integration test,  it would be slower than mock/fake unit test (1-3s+ total delay overhead when spawning docker)

Conclusion

  1. use mock for databases with eventual consistency

  2. prefer fake over mock for business logic correctness because it’s better for maintainability to test behavior (this input should give this output), instead of implementation details

  3. prefer classical testing over mock testing for checking data provider logic correctness

References

(aka confirmation bias :3)

https://martinfowler.com/articles/mocksArentStubs.html
https://stackoverflow.com/questions/1595166/why-is-it-so-bad-to-mock-classes
https://medium.com/javascript-scene/mocking-is-a-code-smell-944a70c90a6a
https://chemaclass.medium.com/to-mock-or-not-to-mock-af995072b22e
https://accu.org/journals/overload/23/127/balaam_2108/
https://news.ycombinator.com/item?id=7809402
https://philippe.bourgau.net/careless-mocking-considered-harmful/
https://debugged.it/blog/mockito-is-bad-for-your-code/
https://engineering.talkdesk.com/double-trouble-why-we-decided-against-mocking-498c915bbe1c
https://blog.thecodewhisperer.com/permalink/you-dont-hate-mocks-you-hate-side-effects
https://agilewarrior.wordpress.com/2015/04/18/classical-vs-mockist-testing/
https://www.slideshare.net/davidvoelkel/mockist-vs-classicists-tdd-57218553
https://www.thoughtworks.com/insights/blog/mockists-are-dead-long-live-classicists
https://stackoverflow.com/questions/184666/should-i-practice-mockist-or-classical-tdd
https://bencane.com/2020/06/15/dont-mock-a-db-use-docker-compose/
https://swizec.com/blog/what-i-learned-from-software-engineering-at-google/#stubs-and-mocks-make-bad-tests

https://www.freecodecamp.org/news/end-to-end-api-testing-with-docker/
https://medium.com/@june.pravin/mocking-is-not-practical-use-fakes-e30cc6eaaf4e 
https://www.c-sharpcorner.com/article/stub-vs-fake-vs-spy-vs-mock/

2021-03-13

Pyroscope: Continuous Tracing in Go, Python, or Ruby

Recently I stumbled upon slow library/function problem and don't know chich part that causes it, and found out that there's a easy way to trace either Go, Ruby, or Python code using Pyroscope. The feature is a bit minimalist, there's no memory usage tracing yet unlike in gops or pprof. Pyroscope consist 2 parts: the server and the agent/client library (if using Golang) or executor (if using Ruby or Python). Here's the way how to run and start Pyroscope server:

# run server using docker
docker run -it -p 4040:4040 pyroscope/pyroscope:latest server

And here's the example on how to use the client library/agent (modifying Go's source code, just like in DataDog or any other APM tools) and install  the Pyroscope CLI to run Ruby/Python scripts:

# golang, add agent inside the source code
import "github.com/pyroscope-io/pyroscope/pkg/agent/profiler"
func main() {
  profiler.Start(profiler.Config{
    ApplicationName: "my.app.server", 
    ServerAddress:   "http://pyroscope:4040",
  })
  // rest of your code
}

# ruby or python, install CLI client 
cd /tmp
wget https://dl.pyroscope.io/release/pyroscope_0.0.28_amd64.deb
sudo apt-get install ./pyroscope_0.0.28_amd64.deb

# ruby
pyroscope exec ruby yourcode.rb

# python
pyroscope exec python yourcode.py

It would show something like this if you open the server URL (localhost:4040) in the browser, so you can check which part of the code that took most of the runtime.




2021-01-26

GOPS: Trace your Golang service with ease

GoPS is one alternative (also made by Google, other than pprof) to measure, trace or diagnose the performance and memory usage your Go-powered service/long-lived program. The usage is very easy, you just need to import and add 3 lines in your main (so the gops command line can communicate with your program):

import "github.com/google/gops/agent"

func main() {
  if err := agent.Listen(agent.Options{}); err != nil {
    log.Fatal(err)
  }
  // remaining of your long-lived program logic
}

If you don't put those lines, you can still use gops limited to get list of programs running on your computer/server that made with Go with limited statistics information, using these commands:

$ go get -u -v github.com/google/gops

$ gops  # show the list of running golang program
1248    1       dnscrypt-proxy  go1.13.4  /usr/bin/dnscrypt-proxy
1259    1       containerd      go1.13.15 /usr/bin/containerd
18220   1       dockerd         go1.13.15 /usr/bin/dockerd
1342132 1306434 docker          go1.13.15 /usr/bin/docker

$ gops tree # show running process in tree

$ gops PID # check the stats and whether the program have GOPS agent

#########################################################
# these commands below only available
# if the binary compiled with GOPS agent
# PID can be replaced with GOPS host:port of that program

$ gops stack PID # get current stack trace of running PID

$ gops memstats PID # get memory statistics of running PID

$ gops gc PID # force garbage collection

$ gops setgc PID X # set GC percentage

$ gops pprof-cpu PID # get cpu profile graph
$ gops pprof-heap PID # get memory usage profile graph
profile saved at /tmp/heap_profile070676630
$ gops trace PID # get 5 sec execution trace

# you can install graphviz to visualize the cpu/memory profile
$ sudo apt install graphviz

# visualize the cpu/memory profile graph on the web browser
$ go tool pprof /tmp/heap_profile070676630
> web 

Next step is analyze the call graph for the memory leaks (which mostly just wrongly/forgot to defer body/sql rows or holding slice reference of huge buffer or certain framework's cache trashing) or slow functions, whichever your mission are.

What if golang service you need to trace it inside Kubernetes pod that the GOPS address (host:port) not exposed to outside-world? Kubernetes is a popular solution for companies that manages bunch of servers/microservices or cloud like (GKE, AKS, Amazon EKS, ACK, DOKS, etc) but obviously overkill solution for small companies that doesn't need to scale elastically (or the servers are less than 10 or not using microservice architecture).

First, you must compile gops statically so it can be run inside alpine container (which mostly what people use):

$ cd $GOPATH/go/src/github.com/google/gops
$ export CGO_ENABLED=0
$ go mod vendor
$ go build

# copy gops to your kubernetes pod
$ export POD_NAME=blabla
$ kubectl cp ./gops $POD_NAME:/bin

# ssh/exec to your pod
$ kubectl exec -it $POD_NAME -- sh
$ gops

# for example you want to check heap profile for PID=1
$ gops pprof-heap 1
$ exit

# copy back trace file to local, then you can analyze the dump
kubectl cp $POD:/tmp/heap_profile070676630 out.dump

But if your address and port are exposed you can directly use gops from your computer to the pod or create a tunnel inside the pod if it doesn't have public IP, for example using ngrok.

Btw if you know any companies migrating from/to certain language (especially Go), frameworks or database, you can contribute here.

2020-12-22

String Associative Array and CombSort Benchmark 2020 Edition

5 Years later since last string associative benchmark and lesser string associative benchmark (measuring string concat operation and built-in associative array set and get), numeric comb sort benchmark and string comb sort benchmark (measuring basic array random access, string conversion, and array swap for number and string), this year's using newer processor: AMD Ryzen 3 3100 running on 64-bit Ubuntu 20.04. Now with 10x more data to hopefully make the benchmark runs 10x slower (at least 1 sec), best of 3 runs.

alias time='/usr/bin/time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB"'

$ php -v
PHP 7.4.3 (cli) (built: Oct  6 2020 15:47:56) ( NTS )

$ time php assoc.php 
637912 641149 67002
3808703 14182513 2343937
CPU: 1.25s      Real: 1.34s     RAM: 190644KB

$ python3 -V
Python 3.8.5

$ time python3 dictionary.py
637912 641149 67002
3808703 14182513 2343937
CPU: 5.33s      Real: 5.47s     RAM: 314564KB

$ ruby3.0 -v
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-linux-gnu]

$ time ruby3.0 --jit hash.rb 
637912 641149 67002
3808703 14182513 2343937
CPU: 6.50s      Real: 5.94s     RAM: 371832KB

$ go version
go version go1.14.7 linux/amd64

$ time go run map.go
637912 641149 67002
3808703 14182513 2343937
CPU: 1.79s      Real: 1.56s     RAM: 257440KB

$ node -v       
v14.15.2

$ time node object.js
637912 641149 67002
3808703 14182513 2343937
CPU: 2.24s      Real: 2.21s     RAM: 326636KB

$ luajit -v 
LuaJIT 2.1.0-beta3 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org/

$ time luajit table.lua
637912  641149  67002
3808703 14182513        2343937
CPU: 4.11s      Real: 4.22s     RAM: 250828KB

$ dart --version
Dart SDK version: 2.10.4 (stable) (Unknown timestamp) on "linux_x64"

$ time dart map.dart
637912 641149 67002
3808703 14182513 2343937
CPU: 2.99s      Real: 2.91s     RAM: 385496KB

$ v version
V 0.2 36dcace

$ time v run map.v
637912, 641149, 67002
3808703, 14182513, 2343937
CPU: 4.79s      Real: 5.28s     RAM: 1470668KB

$ tcc -v
tcc version 0.9.27 (x86_64 Linux)

$ time tcc -run uthash.c
637912 641149 67002
3808703 14182513 2343937
Command exited with non-zero status 25

CPU: 2.52s      Real: 2.61s     RAM: 291912KB

export GOPHERJS_GOROOT="$(go1.12.16 env GOROOT)"
$ npm install --global source-map-support

$ goperjs version
GopherJS 1.12-3

$ time gopherjs 
637912 641149 67002
3808703 14182513 2343937

CPU: 14.13s     Real: 12.01s    RAM: 597712KB

$ java -version
java version "14.0.2" 2020-07-14
Java(TM) SE Runtime Environment (build 14.0.2+12-46)
Java HotSpot(TM) 64-Bit Server VM (build 14.0.2+12-46, mixed mode, sharing)

$ time java hashmap.java
637912 641149 67002
3808703 14182513 2343937

CPU: 5.18s      Real: 1.63s     RAM: 545412KB

The result shows a huge improvement for PHP since the old 5.4. NodeJS also huge improvement compared to old 0.10. The rest is quite bit the same. Also please keep note that Golang and V includes build/compile time not just run duration, and it seems V performance really bad when it comes to string operations (the compile itself really fast, less than 1s for 36dcace -- using gcc 9.3.0). 
Next we're gonna benchmark comb sort implementation. But this time we use jit version of ruby 2.7, since it's far way faster (19s vs 26s and 58s vs 66s for string benchmark), for ruby 3.0 we always use jit version since it's faster than non-jit. In case for C (TCC) which doesn't have built-in associative array, I used uthash, because it's the most popular. TinyGo does not complete first benchmark after more than 1000s, sometimes segfault. XS Javascript engine failed to give correct result, engine262 also failed to finish within 1000s.

LanguageCommand FlagsVersionAssocRAMNum CombRAMStr CombRAMTotalRAM
Gogo run1.14.71.56257,4400.7382,8444.74245,4327.03585,716
Gogo run1.15.61.73256,6200.7882,8964.86245,4687.37584,984
Nimnim r -d:release --gc:arc1.4.21.56265,1720.7979,2845.77633,6768.12978,132
Nimnim r -d:release --gc:orc1.4.21.53265,1600.9479,3805.83633,6368.30978,176
Javascriptnode14.15.22.21327,0480.87111,9726.13351,5209.21790,540
Crystalcrystal run --release0.35.11.81283,6481.44146,7006.09440,7969.34871,144
Javascript~/.esvu/bin/v88.9.2011.77177,7480.89105,4166.71335,2369.37618,400
Ctcc -run0.9.272.61291,9121.4580,8326.40393,35210.46766,096
Javajava14.0.2 2020-07-141.63545,4121.50165,8647.69743,57210.821,454,848
Nimnim r -d:release1.4.21.91247,4560.9679,4768.381,211,11611.251,538,048
Dartdart2.10.42.91385,4961.61191,9167.31616,71611.831,194,128
Pythonpypy7.3.1+dfsg-22.19331,7762.83139,7408.04522,64813.06994,164
Javascript~/.esvu/bin/chakra1.11.24.02.73487,4001.27102,19211.27803,16815.271,392,760
Javascript~/.esvu/bin/jsc2711175.90593,6240.68111,9729.09596,08815.671,301,684
Vv -prod run0.2 32091dd gcc-10.24.781,469,9321.8679,37614.061,560,51620.703,109,824
Lualuajit2.1.0-beta34.11250,8283.76133,42412.91511,19620.78895,448
Javascript~/.esvu/bin/smJavaScript-C86.0a15.61378,0641.4096,48013.81393,37620.82867,920
Vv -prod run0.2 32091dd gcc-9.35.051,469,9362.1479,40814.621,560,48421.813,109,828
Javascript~/.esvu/bin/graaljsCE Native 20.3.07.78958,3804.45405,90014.31911,22026.542,275,500
Gogopherjs run1.12-3 (node 14.15.2)11.76594,8962.04119,60418.46397,39632.261,111,896
Nimnim r1.4.26.60247,4443.0579,33231.851,211,20841.501,537,984
PHPphp7.4.31.34190,64410.11328,45234.51641,66445.961,160,760
Rubytruffleruby21.1.0-dev-c1517c5514.542,456,1563.09453,15229.273,660,28446.906,569,592
Crystalcrystal run0.35.15.69284,32812.00153,82831.69441,74049.38879,896
Javascript~/.esvu/bin/quickjs2020-11-083.90252,48423.4880,77234.80471,62462.18804,880
Vv run0.2 36dcace gcc-9.35.281,470,6686.6080,23258.991,561,17670.873,112,076
Lualua5.3.35.98366,51627.26264,64846.05864,30079.291,495,464
Rubyruby2.7.0p06.31371,45619.29100,53658.82694,56084.421,166,552
Pythonpython33.8.55.47314,56433.96404,97647.79722,82087.221,442,360
Rubyjruby9.2.9.07.451,878,18434.111,976,84459.837,115,448101.3910,970,476
Rubyruby3.0.0p05.94371,83224.8792,84474.321,015,096105.131,479,772
Gotinygo run0.16.0999.99318,1483.68300,548252.34711,3401256.011,330,036

Golang still the winner (obviously, since it's compiled), then Nim (Compiled), next best JIT or interpreter is NodeJS, Crystal (Compiled, not JIT), v8, followed Java, by TCC (Compiled) Dart, PyPy, V (Compiled, not JIT), LuaJIT, PHP, Ruby, and Python3. The recap spreadsheet can be accessed here

FAQ:
1. Why you measure the compile duration too? because developer experience also important (feedback loop), at least for me.
2. Why not warming up the VM first? each implementation have it's own advantage and disadvantage.
3. Why there's no C++, VB.NET, C#, D, Object-Pascal? don't want to compile things (since there's no build and run command in one flag).  
4. Why there's no Kotlin, Scala, Rust, Pony, Swift, Groovy, Julia, Crystal, or Zig? Too lazy to add :3 you can contribute tho (create a pull request, then I'll run the benchmark again as preferabbly as there's precompiled binary/deb/apt/ppa repository for the compiler/interpreter).

Contributors
ilmanzo (Nim, Crystal, D)