Multiple Schedulers

We can have different Schedulers in the cluster. What we need to do is specify which Scheduler we want to use in the Pod. This Scheduler will be responsible for reading the Pod configuration and choosing the correct Node.

It's necessary that Schedulers have different names.

Let's take any Pod we have in our cluster.

k get pods busybox -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-03-18T03:06:56Z"
  labels:
    run: busybox
  name: busybox
  namespace: default
  resourceVersion: "2879304"
  uid: b055b81d-96d4-4fe7-8666-635a9d03f4bf
spec:
  containers:
  - command:
    - sleep
    - "1000"
    image: busybox
    imagePullPolicy: Always
    name: busybox
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xxwlk
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: cka-cluster-worker3
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler #<<<
  ...

We can see that even though a Scheduler wasn't defined, we have the default-scheduler. We could pass another Scheduler in the spec.

The simplest way to create a Scheduler is by deploying another one.

The kind cluster creates the Scheduler as a Pod instead of a service, what can we do? Create another one, but passing specific configurations.

Let's analyze the kube-scheduler configuration. This is the Scheduler that kind itself created with kubeadm.

kubectl get pods -n kube-system kube-scheduler-cka-cluster-control-plane -o=jsonpath='{.spec.containers[*].command}' | jq
[
  "kube-scheduler",
  "--authentication-kubeconfig=/etc/kubernetes/scheduler.conf",
  "--authorization-kubeconfig=/etc/kubernetes/scheduler.conf",
  "--bind-address=127.0.0.1",
  "--kubeconfig=/etc/kubernetes/scheduler.conf",
  "--leader-elect=true"
]

We can add another Scheduler and pass --config with the configuration file we'll create.

If we're going to create this Scheduler on all masters, we need to define --leader-elect=true. We could create the Scheduler using Static Pods by going inside the master and placing the manifest in /etc/kubernetes/manifest.

In the documentation, we have a model using the official Scheduler itself.

In this set of manifests we have:

Service account
2 ClusterRoleBindings that will give permission to the ServiceAccount in the same group as kube-scheduler system:kube-scheduler and in system:volume-scheduler
1 RoleBinding that will give permission to the ServiceAccount in extension-apiserver-authentication-reader
The ConfigMap that will be mounted as a volume to provide a configuration file. With this, we can put all parameters inside it and reduce the number of entries above.

Create a file with the content below and apply it to the cluster.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-scheduler-extension-apiserver-authentication-reader
  namespace: kube-system
roleRef:
  kind: Role
  name: extension-apiserver-authentication-reader
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
---
## Here we can define a ConfigMap that will be mounted as a volume to deliver this file
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-scheduler-config
  namespace: kube-system
data:
  my-scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: my-scheduler
    leaderElection:
      leaderElect: true  # Was false, changed to true to test with 3 replicas
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: my-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 3 # Was 1, changed to 3
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: my-scheduler
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
        image: registry.k8s.io/kube-scheduler:v1.29.1 # Changed because we don't have the previous image
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts:
          - name: config-volume
            mountPath: /etc/kubernetes/my-scheduler
      hostNetwork: false
      hostPID: false
      volumes:
        - name: config-volume
          configMap:
            name: my-scheduler-config

kubectl apply -f my-scheduler.yaml
serviceaccount/my-scheduler created
clusterrolebinding.rbac.authorization.k8s.io/my-scheduler-as-kube-scheduler created
clusterrolebinding.rbac.authorization.k8s.io/my-scheduler-as-volume-scheduler created
rolebinding.rbac.authorization.k8s.io/my-scheduler-extension-apiserver-authentication-reader created
configmap/my-scheduler-config created
deployment.apps/my-scheduler created

k get pods -n kube-system | grep my-scheduler
my-scheduler-8ffc64976-449zr                        1/1     Running   0              92s
my-scheduler-8ffc64976-6hn6t                        1/1     Running   0              92s
my-scheduler-8ffc64976-nj8xp                        1/1     Running   0              92s

Some things we can analyze:

The master has Taints and we didn't define any Toleration for it to go to the master. It won't go.
We didn't define any nodeAffinity or podAntiAffinity for it to distribute the 3 replicas on different Nodes. If it happens, it's mere coincidence.
The default Scheduler that scheduled these Pods!

Now let's use this Scheduler to create a Pod

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  schedulerName:  my-scheduler
EOF

k describe pod nginx |  grep scheduler
  Normal  Scheduled  56s   my-scheduler  Successfully assigned default/nginx to cka-cluster-worker3

Scheduler Profiles

A Scheduler has a queue to schedule Pods. If they all have the same priority, then they enter at the end of the queue and follow the flow. First come, first served to be scheduled.

It's possible to configure different priorities and link a Pod to this priority. This way, it's possible for high-priority Pods to jump the queue. This is quite useful if a service really can't wait.

To define a priority, we can create as defined below. This is not a namespace-level resource, but cluster-level. A Scheduler looks at all Pods regardless of namespace.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false # Whether it will be the default or not
description: "This priority class should be used for XYZ service pods only."

There are already two defined in the cluster.

k get priorityclasses.scheduling.k8s.io -o wide
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            18d
system-node-critical      2000001000   false            18d

When it's the Pod's turn to be scheduled, it goes to the filtering stage. Nodes that can't receive the Pod, either due to lack of resources or user choice using nodeSelector, affinities, Taints and Tolerations will be eliminated, and only those that passed filtering can schedule the Pod.

Finally, we have the scoring stage. This stage will select which of the remaining Nodes is best to start the Pod. How is this score defined? If a Pod needs 2 CPUs and we have two possible Nodes, the first with 2 available CPUs and the second with 6, obviously the one with 6 already has a higher score. In this case, both would have sufficient memory to run the Pod, otherwise they wouldn't have passed the filtering stage.

Finally, we have the binding process which is when the Node is actually defined and the kubelet of that Node will be responsible for starting the Pod.

The stages are:

Scheduling Queue
Filtering
Scoring
Binding

All these operations are performed with certain plugins.

alt text

A plugin can be associated with more than one stage.

A curiosity is that in the scoring stage, we have a plugin called ImageLocality. A Node that already has the image available may have a higher score.

These stages can be further subdivided using different plugins that analyze in different ways.

alt text

You can write your own Scheduler methods.

A situation that can occur when having several different Schedulers is that each can have completely different rules, reading and interpreting Nodes differently, creating a race condition.

To avoid this, it's interesting to have one Scheduler with multiple profiles. This way, the same binary would be used for everyone.

    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: my-scheduler
        plugins:
          score:
            disabled:
            - name: TaintToleration
            enabled:
            - name: MyCustomPluginA
            - name: MyCustomPluginB
      - schedulerName: my-scheduler2
        plugins:
          preScore:
            disabled:
            - name: '*'
      - schedulerName: my-scheduler3
    leaderElection:
      leaderElect: true  # Was false, changed to true to test with 3 replicas

Scheduler Profiles​

Scheduler Profiles