Programming Rants

2020-12-22

String Associative Array and CombSort Benchmark 2020 Edition

5 Years later since last string associative benchmark and lesser string associative benchmark (measuring string concat operation and built-in associative array set and get), numeric comb sort benchmark and string comb sort benchmark (measuring basic array random access, string conversion, and array swap for number and string), this year's using newer processor: AMD Ryzen 3 3100 running on 64-bit Ubuntu 20.04. Now with 10x more data to hopefully make the benchmark runs 10x slower (at least 1 sec), best of 3 runs.

$ alias time='/usr/bin/time -f "\nCPU: %Us\tReal: %es\tRAM: %MKB"'

$ php -v

PHP 7.4.3 (cli) (built: Oct 6 2020 15:47:56) ( NTS )

$ time php assoc.php

637912 641149 67002

3808703 14182513 2343937

CPU: 1.25s Real: 1.34s RAM: 190644KB

$ python3 -V

Python 3.8.5

$ time python3 dictionary.py

637912 641149 67002

3808703 14182513 2343937

CPU: 5.33s Real: 5.47s RAM: 314564KB

$ ruby3.0 -v

ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-linux-gnu]

$ time ruby3.0 --jit hash.rb

637912 641149 67002

3808703 14182513 2343937

CPU: 6.50s Real: 5.94s RAM: 371832KB

$ go version

go version go1.14.7 linux/amd64

$ time go run map.go

637912 641149 67002

3808703 14182513 2343937

CPU: 1.79s Real: 1.56s RAM: 257440KB

$ node -v

v14.15.2

$ time node object.js

637912 641149 67002

3808703 14182513 2343937

CPU: 2.24s Real: 2.21s RAM: 326636KB

$ luajit -v

$ time luajit table.lua

637912 641149 67002

3808703 14182513 2343937

CPU: 4.11s Real: 4.22s RAM: 250828KB

$ dart --version

Dart SDK version: 2.10.4 (stable) (Unknown timestamp) on "linux_x64"

$ time dart map.dart

637912 641149 67002

3808703 14182513 2343937

CPU: 2.99s Real: 2.91s RAM: 385496KB

$ v version

V 0.2 36dcace

$ time v run map.v

637912, 641149, 67002

3808703, 14182513, 2343937

CPU: 4.79s Real: 5.28s RAM: 1470668KB

$ tcc -v

tcc version 0.9.27 (x86_64 Linux)

$ time tcc -run uthash.c

637912 641149 67002

3808703 14182513 2343937

Command exited with non-zero status 25

CPU: 2.52s Real: 2.61s RAM: 291912KB

$ export GOPHERJS_GOROOT="$(go1.12.16 env GOROOT)"

$ npm install --global source-map-support

$ goperjs version

GopherJS 1.12-3

$ time gopherjs

637912 641149 67002

3808703 14182513 2343937

CPU: 14.13s Real: 12.01s RAM: 597712KB

$ java -version

java version "14.0.2" 2020-07-14

Java(TM) SE Runtime Environment (build 14.0.2+12-46)

Java HotSpot(TM) 64-Bit Server VM (build 14.0.2+12-46, mixed mode, sharing)

$ time java hashmap.java

637912 641149 67002

3808703 14182513 2343937

CPU: 5.18s Real: 1.63s RAM: 545412KB

The result shows a huge improvement for PHP since the old 5.4. NodeJS also huge improvement compared to old 0.10. The rest is quite bit the same. Also please keep note that Golang and V includes build/compile time not just run duration, and it seems V performance really bad when it comes to string operations (the compile itself really fast, less than 1s for 36dcace -- using gcc 9.3.0).

Next we're gonna benchmark comb sort implementation. But this time we use jit version of ruby 2.7, since it's far way faster (19s vs 26s and 58s vs 66s for string benchmark), for ruby 3.0 we always use jit version since it's faster than non-jit. In case for C (TCC) which doesn't have built-in associative array, I used uthash, because it's the most popular. TinyGo does not complete first benchmark after more than 1000s, sometimes segfault. XS Javascript engine failed to give correct result, engine262 also failed to finish within 1000s.

Language	Command Flags	Version	Assoc	RAM	Num Comb	RAM	Str Comb	RAM	Total	RAM
Go	go run	1.14.7	1.56	257,440	0.73	82,844	4.74	245,432	7.03	585,716
Go	go run	1.15.6	1.73	256,620	0.78	82,896	4.86	245,468	7.37	584,984
Nim	nim r -d:release --gc:arc	1.4.2	1.56	265,172	0.79	79,284	5.77	633,676	8.12	978,132
Nim	nim r -d:release --gc:orc	1.4.2	1.53	265,160	0.94	79,380	5.83	633,636	8.30	978,176
Javascript	node	14.15.2	2.21	327,048	0.87	111,972	6.13	351,520	9.21	790,540
Crystal	crystal run --release	0.35.1	1.81	283,648	1.44	146,700	6.09	440,796	9.34	871,144
Javascript	~/.esvu/bin/v8	8.9.201	1.77	177,748	0.89	105,416	6.71	335,236	9.37	618,400
C	tcc -run	0.9.27	2.61	291,912	1.45	80,832	6.40	393,352	10.46	766,096
Java	java	14.0.2 2020-07-14	1.63	545,412	1.50	165,864	7.69	743,572	10.82	1,454,848
Nim	nim r -d:release	1.4.2	1.91	247,456	0.96	79,476	8.38	1,211,116	11.25	1,538,048
Dart	dart	2.10.4	2.91	385,496	1.61	191,916	7.31	616,716	11.83	1,194,128
Python	pypy	7.3.1+dfsg-2	2.19	331,776	2.83	139,740	8.04	522,648	13.06	994,164
Javascript	~/.esvu/bin/chakra	1.11.24.0	2.73	487,400	1.27	102,192	11.27	803,168	15.27	1,392,760
Javascript	~/.esvu/bin/jsc	271117	5.90	593,624	0.68	111,972	9.09	596,088	15.67	1,301,684
V	v -prod run	0.2 32091dd gcc-10.2	4.78	1,469,932	1.86	79,376	14.06	1,560,516	20.70	3,109,824
Lua	luajit	2.1.0-beta3	4.11	250,828	3.76	133,424	12.91	511,196	20.78	895,448
Javascript	~/.esvu/bin/sm	JavaScript-C86.0a1	5.61	378,064	1.40	96,480	13.81	393,376	20.82	867,920
V	v -prod run	0.2 32091dd gcc-9.3	5.05	1,469,936	2.14	79,408	14.62	1,560,484	21.81	3,109,828
Javascript	~/.esvu/bin/graaljs	CE Native 20.3.0	7.78	958,380	4.45	405,900	14.31	911,220	26.54	2,275,500
Go	gopherjs run	1.12-3 (node 14.15.2)	11.76	594,896	2.04	119,604	18.46	397,396	32.26	1,111,896
Nim	nim r	1.4.2	6.60	247,444	3.05	79,332	31.85	1,211,208	41.50	1,537,984
PHP	php	7.4.3	1.34	190,644	10.11	328,452	34.51	641,664	45.96	1,160,760
Ruby	truffleruby	21.1.0-dev-c1517c55	14.54	2,456,156	3.09	453,152	29.27	3,660,284	46.90	6,569,592
Crystal	crystal run	0.35.1	5.69	284,328	12.00	153,828	31.69	441,740	49.38	879,896
Javascript	~/.esvu/bin/quickjs	2020-11-08	3.90	252,484	23.48	80,772	34.80	471,624	62.18	804,880
V	v run	0.2 36dcace gcc-9.3	5.28	1,470,668	6.60	80,232	58.99	1,561,176	70.87	3,112,076
Lua	lua	5.3.3	5.98	366,516	27.26	264,648	46.05	864,300	79.29	1,495,464
Ruby	ruby	2.7.0p0	6.31	371,456	19.29	100,536	58.82	694,560	84.42	1,166,552
Python	python3	3.8.5	5.47	314,564	33.96	404,976	47.79	722,820	87.22	1,442,360
Ruby	jruby	9.2.9.0	7.45	1,878,184	34.11	1,976,844	59.83	7,115,448	101.39	10,970,476
Ruby	ruby	3.0.0p0	5.94	371,832	24.87	92,844	74.32	1,015,096	105.13	1,479,772
Go	tinygo run	0.16.0	999.99	318,148	3.68	300,548	252.34	711,340	1256.01	1,330,036

Golang still the winner (obviously, since it's compiled), then Nim (Compiled), next best JIT or interpreter is NodeJS, Crystal (Compiled, not JIT), v8, followed Java, by TCC (Compiled) Dart, PyPy, V (Compiled, not JIT), LuaJIT, PHP, Ruby, and Python3. The recap spreadsheet can be accessed here.

FAQ:

1. Why you measure the compile duration too? because developer experience also important (feedback loop), at least for me.

2. Why not warming up the VM first? each implementation have it's own advantage and disadvantage.

3. Why there's no C++, VB.NET, C#, D, Object-Pascal? don't want to compile things (since there's no build and run command in one flag).

4. Why there's no Kotlin, Scala, Rust, Pony, Swift, Groovy, Julia, Crystal, or Zig? Too lazy to add :3 you can contribute tho (create a pull request, then I'll run the benchmark again as preferabbly as there's precompiled binary/deb/apt/ppa repository for the compiler/interpreter).

Contributors: ilmanzo (Nim, Crystal, D)

2020-11-23

Tarantool: In-memory SQL Hybrid Database

Tarantool is an awesome product that I've evaded for a while (because it's using Lua), but after a long time seeking a awesome free silver-bullet database (2016's benchmark here), I think only Tarantool fits the bill (so I need to learn about Lua 5.1 first since Tarantool uses LuaJIT that implements Lua 5.1 and some part of Lua 5.2). The benefits of using Tarantool are:

ACID Transaction, like standard RDBMS and Aerospike
Replication, very easy load balance, like Aerospike, Redis, etc
On-Disk Persistence, like Redis Enterprise, Aerospike, KeyDB Pro
Extension Modules, like Redis (RediGraph, RediSearch, RediSQL, etc)
Can use SQL, so no need to learn new language, unlike MongoDB, ArangoDB, or databases that only support partial/subset of SQL: ScyllaDB CQL and GridDB TQL, eg. no WHERE-IN.
Combining both in-memory and disk storage, like paid version of Redis, KeyDB
Can use Go :3 unlike GridDB that only have partial support
Extremely customizable using Lua, even on the server part
Less memory requirement when using vinyl engine (for large datasets that are larger than RAM) compared to in-memory solutions like Aerospike (64 byte per key requirement) or Redis swap, sacrificing the I/O performance.

Some of the cons of using Tarantool is that you cannot prepend or drop fields/column, you can only append and the field should be nullable until ever rows have all values.

To install Tarantool on ubuntu (without Docker) you can use these commands:

curl -L https://tarantool.io/odVEwx/release/2.4/installer.sh | bash

curl -L https://tarantool.io/odVEwx/live/2.5/installer.sh | bash

sudo apt install tarantool

sudo systemctl enable tarantool

sudo systemctl start tarantool
netstat -antp | grep 3301

If it doesn't work or stuck just ctrl-C the curl script then install manually from these repo:

echo '

deb https://download.tarantool.org/odVEwx/tarantool/2.5/ubuntu/ focal main

deb-src https://download.tarantool.org/odVEwx/tarantool/2.5/ubuntu/ focal main

deb https://download.tarantool.org/tarantool/modules/ubuntu/ focal main

deb-src https://download.tarantool.org/tarantool/modules/ubuntu/ focal main

' | sudo tee -a /etc/apt/sources.list.d/tarantool_2_5.list

sudo apt update

sudo apt install tarantool

To connect to existing tarantool, we can use this command:

tarantoolctl connect 3301

-- create "table"
s = box.schema.space.create('tester')

-- s equal to box.space.tester

-- create "schema"

s:format({

{name = 'id', type = 'unsigned'},

{name = 'band_name', type = 'string'},

{name = 'year', type = 'unsigned'}

})

-- create primary index

s:create_index('primary', {

type = 'hash',

parts = {'id'}

})

-- insert records

s:insert{1, 'Roxette', 1986}

s:insert{2, 'Scorpions', 2015}

s:insert{3, 'Ace of Base', 1993}

-- query using primary index

s:select{3}

-- create secondary index

s:create_index('secondary', {

  type = 'hash',

  parts = {'band_name'}

})

-- query using secondary index

s.index.secondary:select{'Scorpions'}

-- grant guest user to access anywhere

box.schema.user.grant('guest', 'read,write,execute', 'universe')

-- reset the admin password, only when listen, 

-- eg. from /etc/tarantool/instances.enabled/example.lua

box.schema.user.passwd('pass')

-- alter table 

box.space.tester:format({

  {name = 'id', type = 'unsigned'},

  {name = 'band_name', type = 'string'},

  {name = 'year', type = 'unsigned'},

  {name = 'rate', type = 'unsigned', is_nullable=true}})

-- update record id=1, 4th column to 5

box.space.tester:update(1, {{'=', 4, 5}})

-- update record id=2, 4th column to 4

box.space.tester:update(2, {{'=', 4, 4}})

-- search id>1

box.space.tester:select(1, {iterator = 'GT'})

-- can be LT, LE, EQ, REQ (reverse equal), GE, GT.

-- must have TREE index

Next, how to connect from golang? We need to install first go-tarantoll library:

go get -u -v github.com/tarantool/go-tarantool

# or: go get -u -v github.com/viciious/go-tarantool

Next let's create a go file:

package main
import (
    "log"
    "fmt"
    "github.com/tarantool/go-tarantool"
)
func main() {
    conn, err := tarantool.Connect("127.0.0.1:3301", tarantool.Opts{
        User: "guest",
        Pass: "",
    })
    if err != nil {
        log.Fatalf("Connection refused "+err.Error())
    }
    defer conn.Close()

    // insert
    resp, err := conn.Insert("tester", []interface{}{4, "ABBA", 1972})
    fmt.Printf("Insert: %#v %v\n",resp.Data,err)

    // select offset=0, limit=1
    resp, err = conn.Select("tester", "primary", 0, 1, tarantool.IterEq, []interface{}{4})
    fmt.Printf("Select: %#v %v\n",resp.Data,err)

    resp, err = conn.Select("tester","secondary", 0, 1, tarantool.IterEq, []interface{}{"ABBA"})

    fmt.Printf("Select: %#v %v\n",resp.Data,err)

    // update col 2 by 3
    resp, err = conn.Update("tester", "primary", []interface{}{4}, []interface{}{[]interface{}{"+", 2, 3}})
    fmt.Printf("Update: %#v %v\n",resp.Data,err)

    // replace

    resp, err = conn.Replace("tester", []interface{}{4, "New band", 2011})

    fmt.Printf("Replace: %#v %v\n",resp.Data,err)

    // upsert: update increment col 2 by 5, or insert if not exists
    // does not return data back
    resp, err = conn.Upsert("tester", []interface{}{4, "Another band", 2000}, []interface{}{[]interface{}{"+", 2, 5}})
    fmt.Printf("Upsert: %#v %v\n",resp.Data,err)

    // delete
    resp, err = conn.Delete("tester", "primary", []interface{}{4})
    fmt.Printf("Delete: %#v %v\n",resp.Data,err)

    // call directly, do not add () when calling
    resp, err = conn.Call("box.space.tester:count", []interface{}{})
    fmt.Printf("Call: %#v %v\n",resp.Data,err)

    // eval Lua 

    resp, err = conn.Eval("return 4 + 5", []interface{}{})
    fmt.Printf("Eval: %#v %v\n",resp.Data,err)
}

It would give an output something like this:

Insert: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Select: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Select: []interface {}{[]interface {}{0x4, "ABBA", 0x7b4}} <nil>
Update: []interface {}{[]interface {}{0x4, "ABBA", 0x7b7}} <nil>
Replace: []interface {}{[]interface {}{0x4, "New band", 0x7db}} <nil>
Upsert: []interface {}{} <nil>
Delete: []interface {}{[]interface {}{0x4, "New band", 0x7e0}} <nil>
Call: []interface {}{[]interface {}{0x3}} <nil>
Eval: []interface {}{0x9} <nil>

See more info here and APIs here. Next you can use cartridge to manage the cluster.

Each row/tuple in Tarantool stored as MsgPack and when displayed in console it uses YAML format. TREE is the default index in tarantool engine, memtx engine support few more: HASH, RTREE, BITSET. The TREE or HASH may only index certain types: integer, number, double, varbinary, boolean, uuid, scalar (null, bool, integer, double, decimal, string, varbinary). TREE or HASH or BITSET may only index string and unsigned. RTREE may only index array type. If we have multiple parts/columns on TREE index, we can do partial search starting from starting column of the index. When using string index, we may specify the collation (ordering): unicode or unicode_ci (case ignore, also ignore accented/diacritic alphabet). Sequence can be accessed through box.schema.sequence.create() with options (start, min, max, cycle, step, if_not_exists), we could call next method to get next value. Tarantool persist data in two modes: WAL and snapshot (can be forced using box.snapshot()). List of available operators for update/upsert:

+ to add, - to subtract
& bitwise and, | bitwise or, ^ bitwise xor
: string splice
! insertion of a new field
# deletion
= assignment

This snippet shows how to update table and check it:

r, err := conn.Call(`box.schema.space.create`, []interface{}{TableName})

if err != nil {

if err.Error() != `Space '`+TableName+`' already exists (0xa)` {

log.Println(err)

return

}

fmt.Println(r.Tuples())

fmt.Println(`---------------------------------------------------`)

r, err = conn.Call(`box.space.`+TableName+`:format`, []interface{}{

[]map[string]interface{}{

{`name`: `id`, `type`: `unsigned`},

{`name`: `name`, `type`: `string`},

})

if err != nil {

log.Println(err)

return

}

fmt.Println(r.Tuples())

r, err = conn.Call(`box.space.`+TableName+`:format`, []interface{}{})

if err != nil {

log.Println(err)

return

}

fmt.Println(r.Tuples())

func ExecSql(conn *tarantool.Connection, query string, parameters ...M.SX) map[interface{}]interface{} {

params := A.X{query}

for _, v := range parameters {

params = append(params, v)

}

L.Describe(params)

res, err := conn.Call(`box.execute`, params)

if L.IsError(err) {

L.Describe(`ERROR ExecSql !!! ` + err.Error())

L.DescribeSql(query, parameters)

L.Describe(err.Error())

return map[interface{}]interface{}{`error`: err.Error()}

}

tup := res.Tuples()

if len(tup) > 0 {

if len(tup[0]) > 0 {

if tup[0][0] != nil {

kv, ok := tup[0][0].(map[interface{}]interface{})

// row_count for UPDATE

// metadata, rows for SELECT

if ok {

return kv

}

// possible error

if len(tup) > 1 {

if len(tup[1]) > 0 {

if tup[1][0] != nil {

L.Describe(`ERROR ExecSql syntax: ` + X.ToS(tup[1][0]))

L.Describe(query, parameters)

L.Describe(tup[1][0])

return map[interface{}]interface{}{`error`: tup[1][0]}

}

return map[interface{}]interface{}{}

}

func QuerySql(conn *tarantool.Connection, query string, callback func(row A.X), parameters ...M.SX) []interface{} {

kv := ExecSql(conn, query, parameters...)

rows, ok := kv[`rows`].([]interface{})

if ok {

for _, v := range rows {

callback(v.([]interface{}))

}

return rows

}

return nil

}

The tarantool have some limitations:

no datetime (use unsigned if greater than 1970-01-01)
can only append column at the end, cannot delete column
cannot alter datatype, except if there’s no data yet
alter table when there’s data exists must have not_null=true flag

Tarantool SQL limitations and gotchas:

table names or column names must be quoted or they will be automatically capitalized (and then error column/space=table not found)
concat just like postgresql (|| operator), but you must convert both operands to string first
does not support right join (not needed anyway), can do natural join or left join, also support foreign key
no date/time data type/functions (tarantool problem)
NULLIF (tarantool), IFNULL (mysql), COALESCE (standard SQL) all works normally
cannot run alter table add column (must use APIs)
no information_schema
calling box.execute to execute SQL (supported since version 2.x), the result is on first row (row_count for UPDATE, metadata and rows for SELECT), the error is on second row.

I think that's it for now, to learn Lua (needed for stored procedure) you can use official documentation. For example project using Tarantool and Meilisearch, you can clone kmt1 dummy project.

2020-10-21

Machine Learning Toolkits

There's a bunch of machine learning toolkit either for supervised (labelled), unsupervised (for clustering), semi-supervised, or reinforced learning (reward and punishmed). Most of them written (or have binding) for Python:

NumPy - array processing
Pandas - data cleansing/sanitizing
MathPlotLib - data visualization
Scikit Learn - ML libraries
TensorFlow - ML framework
PyTorch - ML libraries
Keras - Deep Learning library

And Golang's alternative:

Gonum - numerical and scientific algorithm
Sanitize - data sanitizing
Go-ECharts - data visualization
GoLearn - scikit partially ported
SKLearn - scikit-learn ported to golang
TensorFlow-go - TensorFlow golang binding
go-torch or gotch - LibTorch binding for golang

How's about IDE?

PyCharm - community version available
Jupyter Notebook - it's recommended to install Anaconda first (~500MB)
IBM Watson Studio - web based IDE
Google Collab - web based IDE

Things that we must take note when doing data cleansing, because garbage in - garbage out:

Format consistency (eg. YYYY-MM-DD sthould not be mixed with DD/MM/YYYY or other format)
Data scale (for example if a variable may only have value 0 to 100, there should be no negative number, or value larger than 100)
Duplicated records (which may cause learning weight in ML)
Missing values (nulls, should be normalized, or remove the columns)
Skweness (inbalance distribution, for example there's only 10 samples of class1 but there's 990 samples of class2), we could dowmsample or upweight to solve this problem.

Next step for ML are data preparation, we must convert the data types because some ML algorithm can only support numerical values, eg. SVM or Linear Regression. One way to convert categorical to numerical values is One Hot Encoding, eg. taste=[sweet,salty,bitter] became 3 new columns: is_sweet=0|1, is_salty=0|1, is_bitter=0|1. Some other steps for data preparation:

removing outliers (values that way too unique/differ from the rest of the group)
normalization (changing the scale of values using this formula (val-min)/(max-min)*scale
or use MinMaxScaler from sklearn:
from sklearn.processing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data)
standardization, using z score formula (val-mean)/stddev
or use StandardScaler from sklearn:
from sklearn.processing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data)

There are many kind of storage tools that we could use to store data for ML: RDBMS, NoSQL (graph, key-value, columnar, time series, document-oriented database). Some popular alternative are: Firebase Realtime Database, Google Cloud Datastore, Amazon RDS, Spark ETL, Google BigQuery, etc.

We could reuse popular datasets and test the cross validation score, for example in sklearn:

from sklearn import datasets

idb = datasets.load_iris() # the infamous iris flower dataset

x = idb.data

y = idb.target

# split 20% for test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

# use decision tree

from sklearn import tree

cla = tree.DecisionTreeClassifier()

# calculate cross validation score

from sklearn.model_selection import cross_val_score

scores = cross_val_score(cla,x,y,cv=5)

To do suprevised learning for iris dataset using decision tree:

idb.head()

# remove id column

idb.drop('Id',axis=1,inplace=True)

# take attributes

fn = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm' ]

x = idb[fn]

# take label

y = idb['Species']

# do training

from sklearn.tree import DecisionTreeClassifier

cla = DecisionTreeClassifier()

cla.fit(x,y)

# do prediction

cla.predict([[6,3,5,2]])

# use graphvis to visualize

from sklearn.tree import export_graphviz

cn = ['Iris-setosa','Iris-versicolor','Iris-virginica']

export_graphvis(cla,out_file="/tmp/iris.dot",feature_names=fn,class_names=cn,rounded=True,filled=True)

Then we could open the dot file or use online tool to convert it to other format. That file above would show the decision tree rules.

There are many kinds of regression (predicting a continuous number), some of them are: linear regression and logistic regression. This is the example to do supervised learning using regression using numpy:

import numpy as np

friends = np.array([1,1,2,2,3,3,4,4,5,5])

net_worth = np.array([123,125,234,250,345,360,456,470,567,590])

# plot as scatter chart

import matplotlib.pyplot as plot
%matplotlib inline # to show chart inline not on new window

plot.scatter(friends,net_worth)

# do training

from sklearn.linear_model import LinearRegression

friends = friends.reshape(-1,1)

lr = LinearRegression()

lr.fit(friends,net_worth)

plot.plot(friends, lr.predict(friends))

This another example to load csv and do one hot encoding:

import pandas as panda

dataframe = panda.read_csv('table1.csv')
dataframe.head()

dataframe.info()

dataframe.rename(columns={'YearLived':'Age'})

dataframe['Gender'].replace(['F','M'],[0,1], inplace=True)

data = dataframe.drop(columns=['unneeded column'])

# do one hot encoding
data = panda.get_dummies(data)

# split attributes and labels

attrs = ['Age','Gender']

x = data[attrs]

y = data['BuyerType']

# split training set and test set

from sklearn.model_selection import train_test_split

xtr, xte, ytr, yte = train_test_split(x,y, test_size=0.2, random_state=1)

from sklearn.linear_model import LogisticRegression

m = LogisticRegression()

m.fit(xtr, ytr)

m.score(xte, yte)

How to clustering based using K-Means:

from sklearn.cluster import KMeans

clusters = []

for z in range(1,11):

km = KMeans(n_clusters=z).fit(x)

clusters.append(km.intertia_)

# plot based on inertia

import seaborn as sea

fig, ax = plot.subplots(figsize=(8,4))

sea.lineplot(x=list(range(1,11)), y=clusters, ax=ax)

ax.set_title('Look for Elbow')

ax.set_xlabel('Clusters')

ax.set_ylabel('Inertia')

# do kmeans

km4 = KMeans(n_clusters=4).fit(x)

x['Labels'] = km4.labels_

plot.figure(figsize=(8,4))

sea.scatterplot(x['income'],x['spending'],hue=x['labels'],palette=sea.color_palette('hls',4))

plot.title('KMeans with 4 clusters')

plot.show()

If our ML have too many attributes, we could use PCA (Principal Component Analysis) to calculate the variance to reduce the cost of ML training duration, LDA (Linear Discriminatn Analysis) or t-SNE (t-Distributed Stochastic Neigbor Embedding) to reduce the dimension. This example shows how to train with and without PCA:

from sklearn.decomposition import PCA

pca = PCA(n_components=4)

pca_attr =pca.fit_transform(xtr)

pca.explained_variance_ratio_

# look at the array and find total variance which > 0.95, eg. 2

pca = PCA(n_components=2)

xtr = pca.fit_transform(xtr)

xte = pca.fit_transform(xte)

# train again and test

m = cla.fit(xtr, ytr)

m.score(xte, yte)

SVM (Support Vector Machine) is an algorithm that calculates difference of each vector to create a margin that separates cluster, sometimes it adds another dimension so that it could separate the data correctly. There are some popular kernel functions that could be used to add more dimension: linear, polynomial, RBF, and Sigmoid. This example shows how to do SVM classification using sklearn:

from sklearn.svm import SVC
cla = SVC()

cla.fit(xtr,ytr)

cla.score(xte,yte)

SVM can also be used for SVR (regression, non-linear), for example:

from sklearn.svm import SVR

m = SVR(C=1000,gamma=0.05,kernel='rbf'

m.fit(x,y)

plot.scatter(x,y)

plot.plot(x, model.predict(x))

When we train using certain ML algorithm, we also need to set the parameters to make optimal result. We could also do a grid search which do a combination to search best parameter for that model, for example:

from sklearn.model_selection import GridSearchCV

model = SVR()

params = { 'kernel': ['rbf'],

'C' = [100,1000,10000],

'gamma': [0.5,0.05,0.005],

}

gs = GridSearchCV(m,params)

gs.fit(x,y)
gs.best_params_

Artificial Neural Network is one of the technique that imitates how brain works, which every neuron/perceptron (brain cells) activated (making path when learning) with certain function (eg. sigmoid, hyperbolic tangent, or rectified linear unit/ReLU)). One of the technique used in ANN are backprop which updates/adjust the neuron weights based on loss function (the difference between our own NN calculation with correct answer). CNN (Convolution Neural Network) combines convolution layer/feature maps with max pooling (reducing resolution) to create a hidden layer. Usually we use TensorFlow and Keras to implement CNN. This code shows example how to use TensorFlow for detecting images with 150x150 resolution whether is an certain object or not:

import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# train from train_dir and validation_dir
datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20,
horizontal_flip=True,
shear_range = 0.2,
fill_mode = 'nearest')
traingen = datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=4,
class_mode='binary')
valgen = datagen.flow_from_directory(
validation_dir,
target_size=(150, 150),
batch_size=4,
class_mode='binary')

m = tf.keras.models.Sequential([

tf.keras.layers.Conv2D(128, (3,3), activation='relu', input_shape=(150, 150, 3)),

tf.keras.layers.MaxPooling2D(2,2),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(512, activation='relu'),

tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
optimizer='Adam', # or tf.optimizers.Adam()
metrics=['accuracy'])
model.fit(traingen,steps_per_epoch=25,epochs=20,validation_data=valgen,validation_steps=5,verbose=2)

# predict
import numpy as np
from keras.preprocessing import image
import matplotlib.image as mpimg
%matplotlib inline
img = image.load_img(path, target_size(150,150))
imgplot = plot.imshow(img)
x = image.img_to_array(img)
x = np.expand_dims(x,axis=0)
images = np.vstack([x])
classes = model.predict(images, batch_size=10)

For live demo in Indonesian language, you can visit this youtube video. For automatic training we can use IBM Watson's AutoAI. If you need more training in Indonesian language, try DiCoding, since most of this article taken from there (this is my personal note following the course there).

2020-10-16

Cleanup git and docker Disk Usage

Sometimes our cloned repository became so freakin large, for example golang's google.golang.org/api, currently mine consume about 1.1GB. We could compress it using garbage collection parameter:

git gc

du -hs .

# 664 MB in about 20 seconds

Or if you have time you can use aggresive GC, like this:

git gc --aggressive

du -hs .

# 217 MB in about 5 minutes

Or if you do not need any old history, you can clone then replace, like this:

git clone --mirror --depth=5 file://$PWD ../temp

rm -rf .git/objects

mv ../temp/{shallow,objects} .git

rm -rf ../temp

# 150 MB in about 2 seconds

Next you can reclaim space from docker using this command:

sudo docker system prune -a -f
docker system df

For more disk usage analysis you can use baobab for linux or windirstat on windows.

Subscribe to: Posts ( Atom )