Let's compare 1 million goroutines with 1 million tasks, which one more efficient in cpu usage and memory usage? The code is forked from kjpgit's techdemo
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 38.96 0.68 582,951 636,136 1:00
c#_2t 88.33 0.95 623,956 820,620 1:02
c#_4t 142.86 1.09 687,365 814,028 1:03
c#_8t 235.80 1.71 669,882 820,704 1:05
c#_16t 434.76 4.01 734,545 771,240 1:08
c#_32t 717.39 4.81 720,235 769,888 1:11
go_1t 58.77 0.65 2,635,380 2,635,632 1:04
go_2t 64.48 0.71 2,639,206 2,642,752 1:00
go_4t 72.55 1.42 2,651,086 2,654,972 1:00
go_8t 80.87 2.82 2,641,664 2,643,392 1:00
go_16t 83.18 4.03 2,673,404 2,681,100 1:00
go_32t 86.65 4.30 2,645,494 2,657,580 1:00
Apparently usual result is as expected because all tasks/goroutine spawned first before processing, Go's scheduler more efficient in CPU usage, but C# runtime more efficient in memory usage, which is normal tho because goroutines requires minimum 2KB overhead per goroutine, way higher cost than spawing a task. What if we increase to 10 millions tasks/goroutines, and let the spawning done in another task/goroutine, so if goroutine done it can restore back memory to the GC, here's the result:
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 12.78 1.28 2,459,190 5,051,528 0:13
c#_2t 22.60 1.54 2,692,439 5,934,796 0:18
c#_4t 42.09 1.54 2,370,239 5,538,280 0:21
c#_8t 88.54 2.29 2,522,053 6,334,176 0:29
c#_16t 204.39 3.32 2,395,001 5,803,808 0:34
c#_32t 259.09 3.25 1,842,458 4,710,012 0:28
go_1t 13.97 0.97 4,514,200 6,151,088 0:14
go_2t 12.35 1.51 5,595,418 9,506,076 0:07
go_4t 22.09 2.40 6,394,162 12,517,848 0:07
go_8t 31.00 3.09 7,115,281 13,428,344 0:06
go_16t 40.32 3.52 7,126,851 13,764,940 0:06
go_32t 58.58 3.58 7,104,882 12,145,396 0:06
Result seems normal, high memory usage caused by a lot of goroutine spawned at the same time in different thread, not blocking the main thread, but after it's done, got collected by GC (previously it was time based exit condition, this time it would exit after all process done, since I move the sleep first before atomic increment). What if we lower back to 1 million but with same exit rule and spawning executed in different task/goroutine also checking completion done every 1s, here's the result:
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 1.18 0.16 328,134 511,652 0:02
c#_2t 2.18 0.22 294,608 554,488 0:02
c#_4t 3.19 0.20 305,336 554,064 0:02
c#_8t 7.77 0.31 292,281 530,368 0:02
c#_16t 12.33 0.25 304,352 569,460 0:02
c#_32t 37.90 1.25 337,837 684,252 0:03
go_1t 2.72 0.42 1,592,978 2,519,040 0:03
go_2t 3.04 0.47 1,852,084 2,637,532 0:03
go_4t 3.65 0.54 1,936,626 2,637,272 0:03
go_8t 3.27 0.59 1,768,540 2,655,208 0:02
go_16t 4.01 0.71 1,770,673 2,664,504 0:02
go_32t 4.96 0.72 1,770,354 2,669,244 0:02
The difference in processing time is nengligible, but the CPU usage and memory usage quite contrast. Next, let's try to spawn in burst (100K per second), so we add 1 second sleep every 100th task/goroutine, since it's not quite realistic even for DDOS'ed server to receive that much (unless the server finely tuned), here's the result:
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 0.61 0.08 146,849 284,436 0:05
c#_2t 1.17 0.10 131,778 261,720 0:05
c#_4t 1.53 0.08 133,505 289,584 0:05
c#_8t 4.17 0.15 131,924 284,960 0:05
c#_16t 10.94 0.68 135,446 289,028 0:05
c#_32t 19.86 3.01 130,533 284,924 0:05
go_1t 1.84 0.24 731,872 1,317,796 0:06
go_2t 1.87 0.26 659,382 1,312,220 0:05
go_4t 2.00 0.30 661,296 1,322,152 0:05
go_8t 2.37 0.34 660,641 1,324,684 0:05
go_16t 2.82 0.39 660,225 1,323,932 0:05
go_32t 3.36 0.45 659,176 1,327,264 0:05
And for 5 millions:
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 3.39 0.24 309,103 573,772 0:11
c#_2t 8.30 0.26 278,683 553,592 0:11
c#_4t 13.65 0.32 274,679 658,104 0:11
c#_8t 23.20 0.46 286,336 641,376 0:12
c#_16t 45.85 1.32 286,311 640,336 0:12
c#_32t 64.83 2.46 264,866 615,552 0:12
go_1t 6.25 0.50 1,397,434 2,629,936 0:13
go_2t 6.20 0.56 1,386,336 2,631,580 0:11
go_4t 7.52 0.65 1,410,523 2,625,308 0:11
go_8t 8.21 0.86 1,441,080 2,779,456 0:11
go_16t 11.17 0.96 1,436,220 2,687,908 0:11
go_32t 12.97 1.06 1,430,573 2,668,816 0:11
And for 25 millions:
c#_1t 15.94 0.69 590,411 1,190,340 0:24
c#_2t 34.88 0.84 699,288 1,615,372 0:32
c#_4t 59.95 0.89 761,308 1,794,116 0:34
c#_8t 100.64 1.36 758,161 1,845,944 0:36
c#_16t 199.56 2.99 765,791 2,014,856 0:38
c#_32t 332.02 4.07 811,809 1,972,400 0:41
go_1t 21.76 0.71 2,846,565 4,413,968 0:29
go_2t 25.77 1.03 2,949,433 5,553,608 0:25
go_4t 28.74 1.24 2,920,447 5,800,088 0:24
go_8t 37.28 1.96 2,869,074 5,502,776 0:23
go_16t 43.46 2.67 2,987,114 5,769,356 0:24
go_32t 43.77 2.92 3,027,179 5,867,084 0:24
How about 25 millions and sleep per 200K?
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 18.47 0.91 842,492 1,820,788 0:22
c#_2t 40.32 0.93 1,070,555 2,454,324 0:31
c#_4t 62.39 1.16 1,103,741 2,581,476 0:33
c#_8t 100.84 1.34 1,074,820 2,377,580 0:34
c#_16t 218.26 2.91 1,062,642 2,726,700 0:37
c#_32t 339.00 6.51 1,042,254 2,275,644 0:40
go_1t 22.61 0.88 3,474,195 5,071,944 0:27
go_2t 25.83 1.20 3,912,071 6,964,640 0:20
go_4t 37.98 1.68 4,180,188 7,392,800 0:20
go_8t 38.56 2.44 4,189,265 8,481,852 0:18
go_16t 44.49 3.19 4,187,142 8,483,236 0:18
go_32t 48.82 3.44 4,218,591 8,424,200 0:18
And lastly 25 millions and sleep per 400K?
Name UserCPU SysCPU AvgRSS MaxRSS Wall
c#_1t 18.66 0.98 1,183,313 2,622,464 0:20
c#_2t 41.27 1.14 1,326,415 3,155,948 0:31
c#_4t 67.21 1.11 1,436,280 3,015,212 0:33
c#_8t 107.14 1.56 1,492,179 3,378,688 0:35
c#_16t 233.50 2.45 1,498,421 3,732,368 0:41
c#_32t 346.87 3.74 1,335,756 2,882,676 0:39
go_1t 24.13 0.82 4,048,937 5,099,220 0:26
go_2t 28.85 1.41 4,936,677 8,023,568 0:18
go_4t 31.51 1.95 5,193,653 9,537,080 0:14
go_8t 45.27 2.65 5,461,107 9,499,308 0:14
go_16t 53.43 3.19 5,183,009 9,476,084 0:14
go_32t 61.98 3.86 5,589,156 10,587,788 0:14
How to read results above? Wall = how much time need to complete, lower is better; AvgRSS/MaxRSS = average/max memory usage, lower is better; UserCPU = CPU time used in percent >100% means more than 1 full core compute time being used, lower is better. Versions used in this benchmark:
go version go1.17.6 linux/amd64
dotnet --version