Programming Rants: KEDA Kubernetes Event-Driven Autoscaling

Autoscaling mostly useless, if the number of host/nodes/hypervisor is limited, eg. we only have N number of nodes, and we tried to autoscale the services inside of it, so by default we already waste a lot of unused resources (especially if the billing criteria is not like Jelastic, you use whatever you allocate not whatever you utilize). Autoscaling also quite useless if your problem is I/O-bound not CPU-bound, for example you don't use autoscaled database (or whatever the I/O bottleneck are). CPU are rarely the bottleneck in my past experience. But today we're gonna try to use KEDA, to autoscale kubernetes service. First we need to install fastest kube:

# install minikube for local kubernetes cluster
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube
minikube start # --driver=kvm2 or --driver=virtualbox
minikube kubectl
alias k='minikube kubectl --'
k get pods --all-namespaces

# if not using --driver=docker (default)
minikube addons configure registry-creds
Do you want to enable AWS Elastic Container Registry? [y/n]: n
Do you want to enable Google Container Registry? [y/n]: n
Do you want to enable Docker Registry? [y/n]: y
-- Enter docker registry server url: https://hub.docker.com/
-- Enter docker registry username: kokizzu
-- Enter docker registry password:
Do you want to enable Azure Container Registry? [y/n]: n
✅ registry-creds was successfully configured

Next we need to create a dummy container as for a pod that we want to be autoscaled:

# build example docker image
docker login -u kokizzu # replace with your docker hub username
docker build -t pf1 .
docker image ls pf1
# REPOSITORY TAG IMAGE ID CREATED SIZE
# pf1 latest 204670ee86bd 2 minutes ago 89.3MB

# run locally for testing
docker run -it pf1 -p 3000:3000

# tag and upload
docker image tag pf1 kokizzu/pf1:v0001
docker image push kokizzu/pf1:v0001

Create deployment terraform file, something like this:

# main.tf
terraform {
required_version = ">= 1.3.0"
required_providers {
    kubernetes = {
      source = "hashicorp/kubernetes"
      version = "= 2.20.0"
    }
}
backend "local" {
    path = "/tmp/pf1.tfstate"
}
}
provider "kubernetes" {
config_path    = "~/.kube/config"
# from k config view | grep -A 3 minikube | grep server:
host           = "https://240.1.0.2:8443"
config_context = "minikube"
}
provider "helm" {
kubernetes {
    config_path    = "~/.kube/config"
    config_context = "minikube"
}
}
resource "kubernetes_namespace_v1" "pf1ns" {
metadata {
    name        = "pf1ns"
    annotations = {
      name = "deployment namespace"
    }
}
}
resource "kubernetes_deployment_v1" "promfiberdeploy" {
metadata {
    name      = "promfiberdeploy"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
}
spec {
    selector {
      match_labels = {
        app = "promfiber"
      }
    }
    replicas = "1"
    template {
      metadata {
        labels = {
          app = "promfiber"
        }
        annotations = {
          "prometheus.io/path"   = "/metrics"
          "prometheus.io/scrape" = "true"
          "prometheus.io/port"   = 3000
        }
      }
      spec {
        container {
          name = "pf1"
          image = "kokizzu/pf1:v0001" # from promfiber.go
          port {
            container_port = 3000
          }
        }
      }
    }
}
}
resource "kubernetes_service_v1" "pf1svc" {
metadata {
    name      = "pf1svc"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
}
spec {
    selector = {
      app = kubernetes_deployment_v1.promfiberdeploy.spec.0.template.0.metadata.0.labels.app
    }
    port {
      port        = 33000 # no effect in minikube, will forwarded to random port anyway
      target_port = kubernetes_deployment_v1.promfiberdeploy.spec.0.template.0.spec.0.container.0.port.0.container_port
    }
    type = "NodePort"
}
}
resource "kubernetes_ingress_v1" "pf1ingress" {
metadata {
    name        = "pf1ingress"
    namespace   = kubernetes_namespace_v1.pf1ns.metadata.0.name
    annotations = {
      "kubernetes.io/ingress.class" = "nginx"
    }
}
spec {
    rule {
      host = "pf1svc.pf1ns.svc.cluster.local"
      http {
        path {
          path = "/"
          backend {
            service {
              name = kubernetes_service_v1.pf1svc.metadata.0.name
              port {
                number = kubernetes_service_v1.pf1svc.spec.0.port.0.port
              }
            }
          }
        }
      }
    }
}
}
resource "kubernetes_config_map_v1" "prom1conf" {
metadata {
    name      = "prom1conf"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
}
data = {
    # from https://github.com/techiescamp/kubernetes-prometheus/blob/master/config-map.yaml
    "prometheus.yml" : <<EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093
rule_files:
#- /etc/prometheus/prometheus.rules
scrape_configs:
- job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
- job_name: "pf1"
    static_configs:
      - targets: [
          "${kubernetes_ingress_v1.pf1ingress.spec.0.rule.0.host}:${kubernetes_service_v1.pf1svc.spec.0.port.0.port}"
        ]
EOF
    # need to delete stateful set if this changed after terraform apply
    # or kubectl rollout restart statefulset prom1stateful -n pf1ns
    # because statefulset pod not restarted automatically when changed
    # if configmap set as env or config file
}
}
resource "kubernetes_persistent_volume_v1" "prom1datavol" {
metadata {
    name = "prom1datavol"
}
spec {
    access_modes = ["ReadWriteOnce"]
    capacity     = {
      storage = "1Gi"
    }
    # do not add storage_class_name or it would stuck
    persistent_volume_source {
      host_path {
        path = "/tmp/prom1data" # mkdir first?
      }
    }
}
}
resource "kubernetes_persistent_volume_claim_v1" "prom1dataclaim" {
metadata {
    name      = "prom1dataclaim"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
}
spec {
    # do not add storage_class_name or it would stuck
    access_modes = ["ReadWriteOnce"]
    resources {
      requests = {
        storage = "1Gi"
      }
    }
}
}
resource "kubernetes_stateful_set_v1" "prom1stateful" {
metadata {
    name      = "prom1stateful"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
    labels    = {
      app = "prom1"
    }
}
spec {
    selector {
      match_labels = {
        app = "prom1"
      }
    }
    template {
      metadata {
        labels = {
          app = "prom1"
        }
      }
      # example: https://github.com/mateothegreat/terraform-kubernetes-monitoring-prometheus/blob/main/deployment.tf
      spec {
        container {
          name = "prometheus"
          image = "prom/prometheus:latest"
          args = [
            "--config.file=/etc/prometheus/prometheus.yml",
            "--storage.tsdb.path=/prometheus/",
            "--web.console.libraries=/etc/prometheus/console_libraries",
            "--web.console.templates=/etc/prometheus/consoles",
            "--web.enable-lifecycle",
            "--web.enable-admin-api",
            "--web.listen-address=:10902"
          ]
          port {
            name           = "http1"
            container_port = 10902
          }
          volume_mount {
            name       = kubernetes_config_map_v1.prom1conf.metadata.0.name
            mount_path = "/etc/prometheus/"
          }
          volume_mount {
            name       = "prom1datastorage"
            mount_path = "/prometheus/"
          }
          #security_context {
          # run_as_group = "1000" # because /tmp/prom1data is owned by 1000
          #}
        }
        volume {
          name = kubernetes_config_map_v1.prom1conf.metadata.0.name
          config_map {
            default_mode = "0666"
            name         = kubernetes_config_map_v1.prom1conf.metadata.0.name
          }
        }
        volume {
          name = "prom1datastorage"
          persistent_volume_claim {
            claim_name = kubernetes_persistent_volume_claim_v1.prom1dataclaim.metadata.0.name
          }
        }
      }
    }
    service_name = ""
}
}
resource "kubernetes_service_v1" "prom1svc" {
metadata {
    name      = "prom1svc"
    namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
}
spec {
    selector = {
      app = kubernetes_stateful_set_v1.prom1stateful.spec.0.template.0.metadata.0.labels.app
    }
    port {
      port        = 10902 # no effect in minikube, will forwarded to random port anyway
      target_port = kubernetes_stateful_set_v1.prom1stateful.spec.0.template.0.spec.0.container.0.port.0.container_port
    }
    type = "NodePort"
}
}
resource "helm_release" "pf1keda" {
name       = "pf1keda"
repository = "https://kedacore.github.io/charts"
chart      = "keda"
namespace = kubernetes_namespace_v1.pf1ns.metadata.0.name
# uninstall: https://keda.sh/docs/2.11/deploy/#helm
}
# run with this commented first, then uncomment
## from: https://www.youtube.com/watch?v=1kEKrhYMf_g
#resource "kubernetes_manifest" "scaled_object" {
# manifest = {
#    "apiVersion" = "keda.sh/v1alpha1"
#    "kind"       = "ScaledObject"
#    "metadata"   = {
#      "name"      = "pf1scaledobject"
#      "namespace" = kubernetes_namespace_v1.pf1ns.metadata.0.name
#    }
#    "spec" = {
#      "scaleTargetRef" = {
#        "apiVersion" = "apps/v1"
#        "name"       = kubernetes_deployment_v1.promfiberdeploy.metadata.0.name
#        "kind"       = "Deployment"
#      }
#      "minReplicaCount" = 1
#      "maxReplicaCount" = 5
#      "triggers"        = [
#        {
#          "type"     = "prometheus"
#          "metadata" = {
#            "serverAddress" = "http://prom1svc.pf1ns.svc.cluster.local:10902"
#            "threshold"     = "100"
#            "query"         = "sum(irate(http_requests_total[1m]))"
#            # with or without {service=\"promfiber\"} is the same since 1 service 1 pod in our case
#          }
#        }
#      ]
#    }
# }
#}

terraform init # download dependencies
terraform plan # check changes
terraform apply # deploy
terraform apply # uncomment first scaled_object part

k get pods --all-namespaces -w # check deployment
NAMESPACE NAME                    READY STATUS RESTARTS AGE
keda-admission-webhooks-xzkp4    1/1    Running 0    2m2s
keda-operator-r6hsh                    1/1    Running 1        2m2s
keda-operator-metrics-apiserver-xjp4d 1/1    Running 0        2m2s
promfiberdeploy-868697d555-8jh6r       1/1    Running 0        3m40s
prom1stateful-0                        1/1    Running 0        22s

k get services --all-namespaces
NAMESP NAME     TYPE     CLUSTER-IP     EXTERNAL-IP PORT(S) AGE
pf1ns pf1svc   NodePort 10.111.141.44 <none>      33000:30308/TCP 2s
pf1ns prom1svc NodePort 10.109.131.196 <none>      10902:30423/TCP 6s

minikube service list
|------------|--------------|-------------|------------------------|
| NAMESPACE |    NAME      | TARGET PORT |          URL           |
|------------|--------------|-------------|------------------------|
| pf1ns      | pf1svc       |       33000 | http://240.1.0.2:30308 |
| pf1ns      | prom1service |       10902 | http://240.1.0.2:30423 |
|------------|--------------|-------------|------------------------|

To debug if something goes wrong, you can use something like this:

# debug inside container, replace with pod name
k exec -it pf1deploy-77bf69d7b6-cqqwq -n pf1ns -- bash

# or use dedicated debug pod
k apply -f debug-pod.yml
k exec -it debug-pod -n pf1ns -- bash
# delete if done using
k delete pod debug-pod -n pf1ns

# install debugging tools
apt update
apt install curl iputils-ping dnsutils net-tools # dig and netstat

To check metrics that we want to use as autoscaler we can check from multiple place:

# check metrics inside pod
curl http://pf1svc.pf1ns.svc.cluster.local:33000/metrics

# check metrics from outside
curl http://240.1.0.2:30308/metrics

# or open from prometheus UI: http://240.1.0.2:30423

# get metrics
k get --raw "/apis/external.metrics.k8s.io/v1beta1"                                                        1 ↵
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[{"name":"externalmetrics","singularName":"","namespaced":true,"kind":"ExternalMetricValueList","verbs":["get"]}]}

# get scaled object
k get scaledobject pf1keda -n pf1ns
NAME      SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
pf1keda   apps/v1.Deployment   pf1deploy         1     5     prometheus                    True    False    False      Unknown   3d20h

# get metric name
k get scaledobject pf1keda -n pf1ns -o 'jsonpath={.status.externalMetricNames}'
["s0-prometheus-prometheus"]

Next we can do loadtest while watching pods:

# do loadtest
hey -c 100 -n 100000 http://240.1.0.2:30308

# check with kubectl get pods -w -n pf1ns, it would spawn:
promfiberdeploy-96qq9 0/1     Pending             0 0s
promfiberdeploy-j5qw9 0/1     Pending             0 0s
promfiberdeploy-96qq9 0/1     Pending             0 0s
promfiberdeploy-76pvt 0/1     Pending             0 0s
promfiberdeploy-76pvt 0/1     Pending             0 0s
promfiberdeploy-j5qw9 0/1     Pending             0 0s
promfiberdeploy-96qq9 0/1     ContainerCreating   0 0s
promfiberdeploy-76pvt 0/1     ContainerCreating   0 0s
promfiberdeploy-j5qw9 0/1     ContainerCreating   0 0s
promfiberdeploy-96qq9 1/1     Running             0 1s
promfiberdeploy-j5qw9 1/1     Running             0 1s
promfiberdeploy-76pvt 1/1     Running             0 1s
...
promfiberdeploy-j5qw9 1/1     Terminating 0 5m45s
promfiberdeploy-96qq9 1/1     Terminating       0 5m45s
promfiberdeploy-gt2h5 1/1     Terminating       0 5m30s
promfiberdeploy-76pvt 1/1     Terminating       0 5m45s

# all events includes scale up event
k get events -n pf1ns -w
21m    Normal ScalingReplicaSet deployment/promfiberdeploy Scaled up replica set promfiberdeploy-868697d555 to 1
9m20s Normal ScalingReplicaSet deployment/promfiberdeploy Scaled up replica set promfiberdeploy-868697d555 to 4 from 1
9m5s   Normal ScalingReplicaSet deployment/promfiberdeploy Scaled up replica set promfiberdeploy-868697d555 to 5 from 4
3m35s Normal ScalingReplicaSet deployment/promfiberdeploy Scaled down replica set promfiberdeploy-868697d555 to 1 from 5

That's it, that's how you use KEDA and terraform to autoscale deployments. The key parts on the .tf files are:

terraform - needed to let terraform know what plugins being used on terraform init
kubernetes and helm - needed to know which config being used, and which cluster being contacted
kubernetes_namespace_v1 - to create a namespace (eg. per tenant)
kubernetes_deployment_v1 - to set what pod being used and which docker container to be used
kubernetes_service_v1 - to expose port on the node (in this case only NodePort), to loadbalance between pods
kubernetes_ingress_v1 - should be used to redirect request to proper services, but since we only have 1 service and we use minikube (that it have it's own forwarding) this one not used in our case
kubernetes_config_map_v1 - used to bind a config file (volume) for prometheus deployment, this sets where to scrape the service, this is NOT a proper way to do this, the proper way is on the latest commit on that repository, using PodMonitor from prometheus-operator:

kubernetes_service_v1 - to expose global prometheus (that monitor whole kubernetes, not per namespace)
kubernetes_service_account_v1 - crates service account so prometheus on namespace can retrieve pods list
kubernetes_cluster_role_v1 - role to allow list pods
kubernetes_cluster_role_binding_v1 - bind service account with the role above
kubernetes_manifest - creates podmonitor kubernetes manifest, this is the rules generated for prometheus on namespace to match specific pod
kubernetes_manifest - creates prometheus manifest that deploys prometheus on specific namespace

kubernetes_persistent_volume_v1 and kubernetes_persistent_volume_claim_v1 - used to bind data diectory (volume) to prometheus deployment
kubernetes_stateful_set_v1 - to deploy the prometheus, since it's not a stateless service, we have to bind data volume to prevent data loss
kubernetes_service_v1 - to expose port of prometheus to outside
helm_release - to deploy keda
kubernetes_manifest - to create custom manifest since scaled object is not supported by kubernetes terraform provider, this configures which service that able to be autoscaled

If you need the source code, you can take a look at terraform1 repo, the latest one is using podmonitor.

Programming Rants

2023-07-02

KEDA Kubernetes Event-Driven Autoscaling

No comments:

Post a Comment