There's a bunch of machine learning toolkit either for supervised (labelled), unsupervised (for clustering), semi-supervised, or reinforced learning (reward and punishmed). Most of them written (or have binding) for Python:
And Golang's alternative:
How's about IDE?
Things that we must take note when doing data cleansing, because garbage in - garbage out:
- Format consistency (eg. YYYY-MM-DD sthould not be mixed with DD/MM/YYYY or other format)
- Data scale (for example if a variable may only have value 0 to 100, there should be no negative number, or value larger than 100)
- Duplicated records (which may cause learning weight in ML)
- Missing values (nulls, should be normalized, or remove the columns)
- Skweness (inbalance distribution, for example there's only 10 samples of class1 but there's 990 samples of class2), we could dowmsample or upweight to solve this problem.
Next step for ML are data preparation, we must convert the data types because some ML algorithm can only support numerical values, eg. SVM or Linear Regression. One way to convert categorical to numerical values is One Hot Encoding, eg. taste=[sweet,salty,bitter] became 3 new columns: is_sweet=0|1, is_salty=0|1, is_bitter=0|1. Some other steps for data preparation:
- removing outliers (values that way too unique/differ from the rest of the group)
- normalization (changing the scale of values using this formula (val-min)/(max-min)*scale
or use MinMaxScaler from sklearn:
from sklearn.processing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data) - standardization, using z score formula (val-mean)/stddev
or use StandardScaler from sklearn:
from sklearn.processing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # 2 dimensional array [ [ x, y ], ... ]
data = scaler.transform(data)
There are many kind of storage tools that we could use to store data for ML: RDBMS, NoSQL (graph, key-value, columnar, time series, document-oriented database). Some popular alternative are: Firebase Realtime Database, Google Cloud Datastore, Amazon RDS, Spark ETL, Google BigQuery, etc.
from sklearn import datasets
idb = datasets.load_iris() # the infamous iris flower dataset
x = idb.data
y = idb.target
# split 20% for test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
# use decision tree
from sklearn import tree
cla = tree.DecisionTreeClassifier()
# calculate cross validation score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(cla,x,y,cv=5)
To do suprevised learning for iris dataset using decision tree:
idb.head()
# remove id column
idb.drop('Id',axis=1,inplace=True)
# take attributes
fn = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm' ]
x = idb[fn]
# take label
y = idb['Species']
# do training
from sklearn.tree import DecisionTreeClassifier
cla = DecisionTreeClassifier()
cla.fit(x,y)
# do prediction
cla.predict([[6,3,5,2]])
# use graphvis to visualize
from sklearn.tree import export_graphviz
cn = ['Iris-setosa','Iris-versicolor','Iris-virginica']
export_graphvis(cla,out_file="/tmp/iris.dot",feature_names=fn,class_names=cn,rounded=True,filled=True)
Then we could open the dot file or use
online tool to convert it to other format. That file above would show the decision tree rules.
There are many kinds of regression (predicting a continuous number), some of them are: linear regression and
logistic regression. This is the example to do supervised learning using regression using numpy:
import numpy as np
friends = np.array([1,1,2,2,3,3,4,4,5,5])
net_worth = np.array([123,125,234,250,345,360,456,470,567,590])
# plot as scatter chart
import matplotlib.pyplot as plot
%matplotlib inline # to show chart inline not on new window
plot.scatter(friends,net_worth)
# do training
from sklearn.linear_model import LinearRegression
friends = friends.reshape(-1,1)
lr = LinearRegression()
lr.fit(friends,net_worth)
plot.plot(friends, lr.predict(friends))
This another example to load csv and do one hot encoding:
import pandas as panda
dataframe = panda.read_csv('table1.csv')
dataframe.head()
dataframe.info()
dataframe.rename(columns={'YearLived':'Age'})
dataframe['Gender'].replace(['F','M'],[0,1], inplace=True)
data = dataframe.drop(columns=['unneeded column'])
# do one hot encoding
data = panda.get_dummies(data)
# split attributes and labels
attrs = ['Age','Gender']
x = data[attrs]
y = data['BuyerType']
# split training set and test set
from sklearn.model_selection import train_test_split
xtr, xte, ytr, yte = train_test_split(x,y, test_size=0.2, random_state=1)
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(xtr, ytr)
m.score(xte, yte)
How to clustering based using K-Means:
from sklearn.cluster import KMeans
clusters = []
for z in range(1,11):
km = KMeans(n_clusters=z).fit(x)
clusters.append(km.intertia_)
# plot based on inertia
import seaborn as sea
fig, ax = plot.subplots(figsize=(8,4))
sea.lineplot(x=list(range(1,11)), y=clusters, ax=ax)
ax.set_title('Look for Elbow')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
# do kmeans
km4 = KMeans(n_clusters=4).fit(x)
x['Labels'] = km4.labels_
plot.figure(figsize=(8,4))
sea.scatterplot(x['income'],x['spending'],hue=x['labels'],palette=sea.color_palette('hls',4))
plot.title('KMeans with 4 clusters')
plot.show()
If our ML have too many attributes, we could use PCA (Principal Component Analysis) to calculate the variance to reduce the cost of ML training duration,
LDA (Linear Discriminatn Analysis) or
t-SNE (t-Distributed Stochastic Neigbor Embedding) to reduce the dimension. This example shows how to train with and without PCA:
from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pca_attr =pca.fit_transform(xtr)
pca.explained_variance_ratio_
# look at the array and find total variance which > 0.95, eg. 2
pca = PCA(n_components=2)
xtr = pca.fit_transform(xtr)
xte = pca.fit_transform(xte)
# train again and test
m = cla.fit(xtr, ytr)
m.score(xte, yte)
SVM (Support Vector Machine) is an algorithm that calculates difference of each vector to create a margin that separates cluster, sometimes it adds another dimension so that it could separate the data correctly. There are some popular
kernel functions that could be used to add more dimension: linear, polynomial, RBF, and Sigmoid. This example shows how to do SVM classification using sklearn:
to make optimal result. We could also do a