Programming Rants

2020-10-21

Machine Learning Toolkits

There's a bunch of machine learning toolkit either for supervised (labelled), unsupervised (for clustering), semi-supervised, or reinforced learning (reward and punishmed). Most of them written (or have binding) for Python:

NumPy - array processing
Pandas - data cleansing/sanitizing
MathPlotLib - data visualization
Scikit Learn - ML libraries
TensorFlow - ML framework
PyTorch - ML libraries
Keras - Deep Learning library

And Golang's alternative:

Gonum - numerical and scientific algorithm
Sanitize - data sanitizing
Go-ECharts - data visualization
GoLearn - scikit partially ported
SKLearn - scikit-learn ported to golang
TensorFlow-go - TensorFlow golang binding
go-torch or gotch - LibTorch binding for golang

How's about IDE?

PyCharm - community version available
Jupyter Notebook - it's recommended to install Anaconda first (~500MB)
IBM Watson Studio - web based IDE
Google Collab - web based IDE

Things that we must take note when doing data cleansing, because garbage in - garbage out:

Format consistency (eg. YYYY-MM-DD sthould not be mixed with DD/MM/YYYY or other format)
Data scale (for example if a variable may only have value 0 to 100, there should be no negative number, or value larger than 100)
Duplicated records (which may cause learning weight in ML)
Missing values (nulls, should be normalized, or remove the columns)
Skweness (inbalance distribution, for example there's only 10 samples of class1 but there's 990 samples of class2), we could dowmsample or upweight to solve this problem.

Next step for ML are data preparation, we must convert the data types because some ML algorithm can only support numerical values, eg. SVM or Linear Regression. One way to convert categorical to numerical values is One Hot Encoding, eg. taste=[sweet,salty,bitter] became 3 new columns: is_sweet=0|1, is_salty=0|1, is_bitter=0|1. Some other steps for data preparation:

removing outliers (values that way too unique/differ from the rest of the group)
normalization (changing the scale of values using this formula (val-min)/(max-min)*scale
or use MinMaxScaler from sklearn:
from sklearn.processing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data)
standardization, using z score formula (val-mean)/stddev
or use StandardScaler from sklearn:
from sklearn.processing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data)

There are many kind of storage tools that we could use to store data for ML: RDBMS, NoSQL (graph, key-value, columnar, time series, document-oriented database). Some popular alternative are: Firebase Realtime Database, Google Cloud Datastore, Amazon RDS, Spark ETL, Google BigQuery, etc.

We could reuse popular datasets and test the cross validation score, for example in sklearn:

from sklearn import datasets

idb = datasets.load_iris() # the infamous iris flower dataset

x = idb.data

y = idb.target

# split 20% for test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

# use decision tree

from sklearn import tree

cla = tree.DecisionTreeClassifier()

# calculate cross validation score

from sklearn.model_selection import cross_val_score

scores = cross_val_score(cla,x,y,cv=5)

To do suprevised learning for iris dataset using decision tree:

idb.head()

# remove id column

idb.drop('Id',axis=1,inplace=True)

# take attributes

fn = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm' ]

x = idb[fn]

# take label

y = idb['Species']

# do training

from sklearn.tree import DecisionTreeClassifier

cla = DecisionTreeClassifier()

cla.fit(x,y)

# do prediction

cla.predict([[6,3,5,2]])

# use graphvis to visualize

from sklearn.tree import export_graphviz

cn = ['Iris-setosa','Iris-versicolor','Iris-virginica']

export_graphvis(cla,out_file="/tmp/iris.dot",feature_names=fn,class_names=cn,rounded=True,filled=True)

Then we could open the dot file or use online tool to convert it to other format. That file above would show the decision tree rules.

There are many kinds of regression (predicting a continuous number), some of them are: linear regression and logistic regression. This is the example to do supervised learning using regression using numpy:

import numpy as np

friends = np.array([1,1,2,2,3,3,4,4,5,5])

net_worth = np.array([123,125,234,250,345,360,456,470,567,590])

# plot as scatter chart

import matplotlib.pyplot as plot
%matplotlib inline # to show chart inline not on new window

plot.scatter(friends,net_worth)

# do training

from sklearn.linear_model import LinearRegression

friends = friends.reshape(-1,1)

lr = LinearRegression()

lr.fit(friends,net_worth)

plot.plot(friends, lr.predict(friends))

This another example to load csv and do one hot encoding:

import pandas as panda

dataframe = panda.read_csv('table1.csv')
dataframe.head()

dataframe.info()

dataframe.rename(columns={'YearLived':'Age'})

dataframe['Gender'].replace(['F','M'],[0,1], inplace=True)

data = dataframe.drop(columns=['unneeded column'])

# do one hot encoding
data = panda.get_dummies(data)

# split attributes and labels

attrs = ['Age','Gender']

x = data[attrs]

y = data['BuyerType']

# split training set and test set

from sklearn.model_selection import train_test_split

xtr, xte, ytr, yte = train_test_split(x,y, test_size=0.2, random_state=1)

from sklearn.linear_model import LogisticRegression

m = LogisticRegression()

m.fit(xtr, ytr)

m.score(xte, yte)

How to clustering based using K-Means:

from sklearn.cluster import KMeans

clusters = []

for z in range(1,11):

km = KMeans(n_clusters=z).fit(x)

clusters.append(km.intertia_)

# plot based on inertia

import seaborn as sea

fig, ax = plot.subplots(figsize=(8,4))

sea.lineplot(x=list(range(1,11)), y=clusters, ax=ax)

ax.set_title('Look for Elbow')

ax.set_xlabel('Clusters')

ax.set_ylabel('Inertia')

# do kmeans

km4 = KMeans(n_clusters=4).fit(x)

x['Labels'] = km4.labels_

plot.figure(figsize=(8,4))

sea.scatterplot(x['income'],x['spending'],hue=x['labels'],palette=sea.color_palette('hls',4))

plot.title('KMeans with 4 clusters')

plot.show()

If our ML have too many attributes, we could use PCA (Principal Component Analysis) to calculate the variance to reduce the cost of ML training duration, LDA (Linear Discriminatn Analysis) or t-SNE (t-Distributed Stochastic Neigbor Embedding) to reduce the dimension. This example shows how to train with and without PCA:

from sklearn.decomposition import PCA

pca = PCA(n_components=4)

pca_attr =pca.fit_transform(xtr)

pca.explained_variance_ratio_

# look at the array and find total variance which > 0.95, eg. 2

pca = PCA(n_components=2)

xtr = pca.fit_transform(xtr)

xte = pca.fit_transform(xte)

# train again and test

m = cla.fit(xtr, ytr)

m.score(xte, yte)

SVM (Support Vector Machine) is an algorithm that calculates difference of each vector to create a margin that separates cluster, sometimes it adds another dimension so that it could separate the data correctly. There are some popular kernel functions that could be used to add more dimension: linear, polynomial, RBF, and Sigmoid. This example shows how to do SVM classification using sklearn:

from sklearn.svm import SVC
cla = SVC()

cla.fit(xtr,ytr)

cla.score(xte,yte)

SVM can also be used for SVR (regression, non-linear), for example:

from sklearn.svm import SVR

m = SVR(C=1000,gamma=0.05,kernel='rbf'

m.fit(x,y)

plot.scatter(x,y)

plot.plot(x, model.predict(x))

When we train using certain ML algorithm, we also need to set the parameters to make optimal result. We could also do a grid search which do a combination to search best parameter for that model, for example:

from sklearn.model_selection import GridSearchCV

model = SVR()

params = { 'kernel': ['rbf'],

'C' = [100,1000,10000],

'gamma': [0.5,0.05,0.005],

}

gs = GridSearchCV(m,params)

gs.fit(x,y)
gs.best_params_

Artificial Neural Network is one of the technique that imitates how brain works, which every neuron/perceptron (brain cells) activated (making path when learning) with certain function (eg. sigmoid, hyperbolic tangent, or rectified linear unit/ReLU)). One of the technique used in ANN are backprop which updates/adjust the neuron weights based on loss function (the difference between our own NN calculation with correct answer). CNN (Convolution Neural Network) combines convolution layer/feature maps with max pooling (reducing resolution) to create a hidden layer. Usually we use TensorFlow and Keras to implement CNN. This code shows example how to use TensorFlow for detecting images with 150x150 resolution whether is an certain object or not:

import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# train from train_dir and validation_dir
datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20,
horizontal_flip=True,
shear_range = 0.2,
fill_mode = 'nearest')
traingen = datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=4,
class_mode='binary')
valgen = datagen.flow_from_directory(
validation_dir,
target_size=(150, 150),
batch_size=4,
class_mode='binary')

m = tf.keras.models.Sequential([

tf.keras.layers.Conv2D(128, (3,3), activation='relu', input_shape=(150, 150, 3)),

tf.keras.layers.MaxPooling2D(2,2),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(512, activation='relu'),

tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
optimizer='Adam', # or tf.optimizers.Adam()
metrics=['accuracy'])
model.fit(traingen,steps_per_epoch=25,epochs=20,validation_data=valgen,validation_steps=5,verbose=2)

# predict
import numpy as np
from keras.preprocessing import image
import matplotlib.image as mpimg
%matplotlib inline
img = image.load_img(path, target_size(150,150))
imgplot = plot.imshow(img)
x = image.img_to_array(img)
x = np.expand_dims(x,axis=0)
images = np.vstack([x])
classes = model.predict(images, batch_size=10)

For live demo in Indonesian language, you can visit this youtube video. For automatic training we can use IBM Watson's AutoAI. If you need more training in Indonesian language, try DiCoding, since most of this article taken from there (this is my personal note following the course there).

2020-10-16

Cleanup git and docker Disk Usage

Sometimes our cloned repository became so freakin large, for example golang's google.golang.org/api, currently mine consume about 1.1GB. We could compress it using garbage collection parameter:

git gc

du -hs .

# 664 MB in about 20 seconds

Or if you have time you can use aggresive GC, like this:

git gc --aggressive

du -hs .

# 217 MB in about 5 minutes

Or if you do not need any old history, you can clone then replace, like this:

git clone --mirror --depth=5 file://$PWD ../temp

rm -rf .git/objects

mv ../temp/{shallow,objects} .git

rm -rf ../temp

# 150 MB in about 2 seconds

Next you can reclaim space from docker using this command:

sudo docker system prune -a -f
docker system df

For more disk usage analysis you can use baobab for linux or windirstat on windows.

2020-09-30

Implement CI in Gitlab

In this example we would create a gitlab CI Pipeline to compile and run application and check whether the output of that program is as expected. First let's create a source code that we want to test, for example main.go:

package main

import "fmt"

func main() {

fmt.Println("Hello foo")

}

Then we must create the .gitlab-ci.yml file that show the steps of how to build and what kind of test we want to run, for example:

default:

image: golang

before_script:

- apt install -y grep

- go version

- which go

- go build main.go

test1:

script:

- ./main | grep 'Kiswono Prayogo'

In the example above, we would use latest version of golang docker image and run 4 steps to build the golang binary from our main.go code. Then we define next step (in this case just 1 step on test1) to be running the output binary and check whether it contains string 'Kiswono Prayogo'.

We could check the output in Gitlab CI/CD > Pipelines side menu:

It would show something like this:

We could click the failure reason/step:

As we can see on the the lines above, the step that are failed are in the last step, after we fix the code in main.go to give the output 'Kiswono Prayogo', the test1 phase should passed normally:

What if we want to test a service/webserver? What we need to do is install curl and start the service, for example this is the server we want to test (server.go):

package main

import "net/http"

import "fmt"

func hello(w http.ResponseWriter, req *http.Request) {

fmt.Fprintf(w, "hello test\n")

}

func main() {

http.HandleFunc("/", hello)

http.ListenAndServe(":8090", nil)

}

Then we create the CI script to install curl and start the server using ampersand (&) to make it run in background (so it won't hold up until timeout, you might need to add sleep 1 if the server have slow startup, eg. Java).

default:

image: golang

before_script:

- apt install -y grep curl

- go build -o server1 server.go

coba2:

script:

- ./server1 &

- curl --silent http://localhost:8090 | grep 'hello test'

For more steps, examples, and information about CI/CD on Gitlab, you can visit the documentation. Personally I don't like the built-in on CI pipeline the git hosting service, since it always re-pull the docker image (which is slow, will took around >1 minutes just to deploy), I usually make my own CI program that triggered by webhook to pull then deploy when there's certain string on the commit message (this took less than 20s if not counting running the automated test).

2020-08-15

Javascript ES6 Loops, Map, and Spread Operator

It's been a while (14 years) since I last learn about Javascript, I think at that time still ES5, but since ES6 (=ES2015, released June 2015) and newer ones already out (but most of the browsers already implemented most of ES6 features way before), I think it's time to review one of the features (I think the only features I learned was Promise and async-await few years ago). ES6 introduces new loop syntax for..of:

// old one
for(var k in arrOrObj) // iterate by key

// new one
for(const v of arrOrObj) // iterate by value
for(const v of Object.values(obj)) // iterate by value
for(const kv of Object.entries(obj)) // iterate by key-value pair
for(const [k,v] of Object.entries(obj)) // destructure

ES6 also introduces new data structure to store associative array, just like object, but difference:

unlike object, you don't have to worry that the key conflict with the object's property
iteration guaranteed by original insertion order, as far as I remember Object is random for Chrome and Sorted for Firefox
Has .size property, not Array .length
key not automatically converted as string, obj['1'] == obj[1]

let map = new Map(); // order by insert
map.set('a',1) // not equal to map['a']=1
map.get('a') // not equal to map['a']
map.has('a')
map.delete('a') // false if not exists

for(const [k,v] of map) // destructure
for(const k of map.keys()) // iterate by key
for(const v of map.values()) // iterate by value

ES6 spread operator also useful when dealing with Map, Object, or Array:

[...map] // get array of key-value pairs
Array.from(map) // get array of key-value pairs
[...map.keys()] // get array of keys
[...map.values()] // get array of values

map = new Map([...Object.entries(obj)].sort()) // order by key

new Map(map) // clone map
new Map([['a',1],['b',2],['c',3]]) // create from array of key-value
new Map([...map1, ...map2]) // merge map

{...obj} // clone object
{...obj1,...obj2} // merge object

new Array(...arr) // clone array
[...arr] // clone array
[...arr1,...arr2] // concat array

Why not using .forEach method? Because you cannot break early. You can learn about new Javascript features (BigInt, Set, Promise, etc) from MDN.

Subscribe to: Comments ( Atom )