- NumPy - array processing
- Pandas - data cleansing/sanitizing
- MathPlotLib - data visualization
- Scikit Learn - ML libraries
- TensorFlow - ML framework
- PyTorch - ML libraries
- Keras - Deep Learning library

And Golang's alternative:

- Gonum - numerical and scientific algorithm
- Sanitize - data sanitizing
- Go-ECharts - data visualization
- GoLearn - scikit partially ported
- SKLearn - scikit-learn ported to golang
- TensorFlow-go - TensorFlow golang binding
- go-torch - LibTorch binding for golang

How's about IDE?

- PyCharm - community version available
- Jupyter Notebook - it's recommended to install Anaconda first (~500MB)
- IBM Watson Studio - web based IDE
- Google Collab - web based IDE

Things that we must take note when doing data cleansing, because garbage in - garbage out:

- Format consistency (eg. YYYY-MM-DD sthould not be mixed with DD/MM/YYYY or other format)
- Data scale (for example if a variable may only have value 0 to 100, there should be no negative number, or value larger than 100)
- Duplicated records (which may cause learning weight in ML)
- Missing values (nulls, should be normalized, or remove the columns)
- Skweness (inbalance distribution, for example there's only 10 samples of class1 but there's 990 samples of class2), we could dowmsample or upweight to solve this problem.

Next step for ML are data preparation, we must convert the data types because some ML algorithm can only support numerical values, eg. SVM or Linear Regression. One way to convert categorical to numerical values is One Hot Encoding, eg. taste=[sweet,salty,bitter] became 3 new columns: is_sweet=0|1, is_salty=0|1, is_bitter=0|1. Some other steps for data preparation:

- removing outliers (values that way too unique/differ from the rest of the group)
- normalization (changing the scale of values using this formula (val-min)/(max-min)*scale

or use MinMaxScaler from sklearn:**from sklearn.processing import MinMaxScaler****scaler = MinMaxScaler()****scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]**

data = scaler.transform(data) - standardization, using z score formula (val-mean)/stddev

or use StandardScaler from sklearn:**from sklearn.processing import StandardScaler****scaler = StandardScaler()****scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]**

data = scaler.transform(data)

There are many kind of storage tools that we could use to store data for ML: RDBMS, NoSQL (graph, key-value, columnar, time series, document-oriented database). Some popular alternative are: Firebase Realtime Database, Google Cloud Datastore, Amazon RDS, Spark ETL, Google BigQuery, etc.

We could reuse popular datasets and test the cross validation score, for example in sklearn:

**from sklearn import datasets**

**idb = datasets.load_iris() # the infamous iris flower dataset**

**x = idb.data**

**y = idb.target**

**# split 20% for test**

**from sklearn.model_selection import train_test_split**

**x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)**

**# use decision tree**

**from sklearn import tree**

**cla = tree.DecisionTreeClassifier()**

**# calculate cross validation score**

**from sklearn.model_selection import cross_val_score**

**scores = cross_val_score(cla,x,y,cv=5)**

To do suprevised learning for iris dataset using decision tree:

**idb.head()**

**# remove id column**

**idb.drop('Id',axis=1,inplace=True)**

**# take attributes**

**fn =**

**['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm' ]**

**x = idb[fn]**

**# take label**

**y = idb['Species']**

**# do training**

**from sklearn.tree import DecisionTreeClassifier**

**cla = DecisionTreeClassifier()**

**cla.fit(x,y)**

**# do prediction**

**cla.predict([[6,3,5,2]])**

**# use graphvis to visualize**

**from sklearn.tree import export_graphviz**

**cn = ['Iris-setosa','Iris-versicolor','Iris-virginica']**

**export_graphvis(cla,out_file="/tmp/iris.dot",feature_names=fn,class_names=cn,rounded=True,filled=True)**

Then we could open the dot file or use online tool to convert it to other format. That file above would show the decision tree rules.

There are many kinds of regression (predicting a continuous number), some of them are: linear regression and logistic regression. This is the example to do supervised learning using regression using numpy:

**import numpy as np**

**friends = np.array([1,1,2,2,3,3,4,4,5,5])**

**net_worth = np.array([123,125,234,250,345,360,456,470,567,590])**

**# plot as scatter chart**

**import matplotlib.pyplot as plot**

**%matplotlib inline # to show chart inline not on new window**

**plot.scatter(friends,net_worth)**

**# do training**

**from sklearn.linear_model import LinearRegression**

**friends = friends.reshape(-1,1)**

**lr = LinearRegression()**

**lr.fit(friends,net_worth)**

**plot.plot(friends, lr.predict(friends))**

This another example to load csv and do one hot encoding:

**import pandas as panda**

**dataframe = panda.read_csv('table1.csv')**

dataframe.head()

dataframe.head()

**dataframe.info()**

**dataframe.rename(columns={'YearLived':'Age'})**

**dataframe['Gender'].replace(['F','M'],[0,1], inplace=True)**

**data = dataframe.drop(columns=['unneeded column'])**

# do one hot encoding

data = panda.get_dummies(data)

# do one hot encoding

data = panda.get_dummies(data)

**# split attributes and labels**

**attrs = ['Age','Gender']**

**x = data[attrs]**

**y = data['BuyerType']**

**# split training set and test set**

**from sklearn.model_selection import train_test_split**

**xtr, xte, ytr, yte = train_test_split(x,y, test_size=0.2, random_state=1)**

**from sklearn.linear_model import LogisticRegression**

**m = LogisticRegression()**

**m.fit(xtr, ytr)**

**m.score(xte, yte)**

How to clustering based using K-Means:

**from sklearn.cluster import KMeans**

**clusters = []**

**for z in range(1,11):**

**km = KMeans(n_clusters=z).fit(x)**

**clusters.append(km.intertia_)**

**# plot based on inertia**

**import seaborn as sea**

**fig, ax = plot.subplots(figsize=(8,4))**

**sea.lineplot(x=list(range(1,11)), y=clusters, ax=ax)**

**ax.set_title('Look for Elbow')**

**ax.set_xlabel('Clusters')**

**ax.set_ylabel('Inertia')**

**# do kmeans**

**km4 = KMeans(n_clusters=4).fit(x)**

**x['Labels'] = km4.labels_**

**plot.figure(figsize=(8,4))**

**sea.scatterplot(x['income'],x['spending'],hue=x['labels'],palette=sea.color_palette('hls',4))**

**plot.title('KMeans with 4 clusters')**

**plot.show()**

If our ML have too many attributes, we could use PCA (Principal Component Analysis) to calculate the variance to reduce the cost of ML training duration, LDA (Linear Discriminatn Analysis) or t-SNE (t-Distributed Stochastic Neigbor Embedding) to reduce the dimension. This example shows how to train with and without PCA:

**from sklearn.decomposition import PCA**

**pca = PCA(n_components=4)**

**pca_attr =pca.fit_transform(xtr)**

**pca.explained_variance_ratio_**

**# look at the array and find total variance which > 0.95, eg. 2**

**pca = PCA(n_components=2)**

**xtr = pca.fit_transform(xtr)**

**xte = pca.fit_transform(xte)**

**# train again and test**

**m = cla.fit(xtr, ytr)**

**m.score(xte, yte)**

SVM (Support Vector Machine) is an algorithm that calculates difference of each vector to create a margin that separates cluster, sometimes it adds another dimension so that it could separate the data correctly. There are some popular kernel functions that could be used to add more dimension: linear, polynomial, RBF, and Sigmoid. This example shows how to do SVM classification using sklearn:

**from sklearn.svm import SVC**

cla = SVC()

cla = SVC()

**cla.fit(xtr,ytr)**

**cla.score(xte,yte)**

SVM can also be used for SVR (regression, non-linear), for example:

**from sklearn.svm import SVR**

**m = SVR(C=1000,gamma=0.05,kernel='rbf'**

**m.fit(x,y)**

**plot.scatter(x,y)**

**plot.plot(x, model.predict(x))**

When we train using certain ML algorithm, we also need to set the parameters to make optimal result. We could also do a grid search which do a combination to search best parameter for that model, for example:

**from sklearn.model_selection import GridSearchCV**

**model = SVR()**

**params = { 'kernel': ['rbf'],**

**'C' = [100,1000,10000],**

**'gamma': [0.5,0.05,0.005],**

**}**

**gs = GridSearchCV(m,params)**

**gs.fit(x,y)**

gs.best_params_

gs.best_params_

Artificial Neural Network is one of the technique that imitates how brain works, which every neuron/perceptron (brain cells) activated (making path when learning) with certain function (eg. sigmoid, hyperbolic tangent, or rectified linear unit/ReLU)). One of the technique used in ANN are backprop which updates/adjust the neuron weights based on loss function (the difference between our own NN calculation with correct answer). CNN (Convolution Neural Network) combines convolution layer/feature maps with max pooling (reducing resolution) to create a hidden layer. Usually we use TensorFlow and Keras to implement CNN. This code shows example how to use TensorFlow for detecting images with 150x150 resolution whether is an certain object or not:

**import tensorflow as tf**

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# train from train_dir and validation_dir

datagen = ImageDataGenerator(

rescale=1./255,

rotation_range=20,

horizontal_flip=True,

shear_range = 0.2,

fill_mode = 'nearest')

traingen = datagen.flow_from_directory(

train_dir,

target_size=(150, 150),

batch_size=4,

class_mode='binary')

valgen = datagen.flow_from_directory(

validation_dir,

target_size=(150, 150),

batch_size=4,

class_mode='binary')

**m = tf.keras.models.Sequential([**

**tf.keras.layers.Conv2D(128, (3,3), activation='relu', input_shape=(150, 150, 3)),**

**tf.keras.layers.MaxPooling2D(2,2),**

**tf.keras.layers.Flatten(),**

**tf.keras.layers.Dense(512, activation='relu'),**

**tf.keras.layers.Dense(1, activation='sigmoid')**

])

])

model.compile(loss='binary_crossentropy',

optimizer='Adam', # or tf.optimizers.Adam()

metrics=['accuracy'])

model.fit(traingen,steps_per_epoch=25,epochs=20,validation_data=valgen,validation_steps=5,verbose=2)

model.fit(traingen,steps_per_epoch=25,epochs=20,validation_data=valgen,validation_steps=5,verbose=2)

# predict

import numpy as np

from keras.preprocessing import image

import matplotlib.image as mpimg

%matplotlib inline

img = image.load_img(path, target_size(150,150))

imgplot = plot.imshow(img)

x = image.img_to_array(img)

x = np.expand_dims(x,axis=0)

images = np.vstack([x])

classes = model.predict(images, batch_size=10)

For live demo in Indonesian language, you can visit this youtube video. For automatic training we can use IBM Watson's AutoAI. If you need more training in Indonesian language, try DiCoding, since most of this article taken from there (this is my personal note following the course there).