Ollama on K8s

Once we have a Kubernetes cluster, ideally the nodes that will run the pods should have a GPU. Using Taints, Tolerations and Affinity we can direct Ollama to specific nodes.

To test this installation we'll create it in a kind cluster without GPU, but let's imagine that one of the nodes has a GPU. In this cluster we'll put an nginx ingress and map port 80 and 443 of the host. This pod will run alongside the master node.

kind create cluster --config - << EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
name: kind-cluster-ia
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker
EOF

kubectl get nodes
NAME                            STATUS   ROLES           AGE   VERSION
kind-cluster-ia-control-plane   Ready    control-plane   13m   v1.29.2
kind-cluster-ia-worker          Ready    <none>          12m   v1.29.2
kind-cluster-ia-worker2         Ready    <none>          12m   v1.29.2

kubectl taint nodes kind-cluster-ia-worker2 gpu=true:NoSchedule
node/kind-cluster-ia-worker2 tainted

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=90s

Let's create an ingress to be able to have access by a name.

To install Ollama we'll use its helm chart.

kubectl create ns ollama
namespace/ollama created

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm show values ollama-helm/ollama > values.yaml

Here is the default values.yaml. I'll already mark where we'll make changes.

# Default values for ollama-helm.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# -- Number of replicas
replicaCount: 1

# Docker image
image:
  # -- Docker image registry
  repository: ollama/ollama

  # -- Docker pull policy
  pullPolicy: IfNotPresent

  # -- Docker image tag, overrides the image tag whose default is the chart appVersion.
  tag: ""

# -- Docker registry secret names as an array
imagePullSecrets: []

# -- String to partially override template  (will maintain the release name)
nameOverride: ""

# -- String to fully override template
fullnameOverride: ""

# Ollama parameters
ollama:
  gpu:
    # -- Enable GPU integration
    enabled: false ### KEEPING FALSE

    # -- GPU type: 'nvidia' or 'amd'
    # If 'ollama.gpu.enabled', default value is nvidia
    # If set to 'amd', this will add 'rocm' suffix to image tag if 'image.tag' is not override
    # This is due cause AMD and CPU/CUDA are different images
    type: 'nvidia'

    # -- Specify the number of GPU
    number: 1

  # -- List of models to pull at container startup
  # The more you add, the longer the container will take to start if models are not present
  # models:
  #  - llama2
  #  - mistral
  models: ### ADDED SOME MODELS
    - codellama
    - gemma
    - llama2

  # -- Add insecure flag for pulling at container startup
  insecure: false

# Service account
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount:
  # -- Specifies whether a service account should be created
  create: true

  # -- Automatically mount a ServiceAccount's API credentials?
  automount: true

  # -- Annotations to add to the service account
  annotations: {}

  # -- The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

# -- Map of annotations to add to the pods
podAnnotations: {}

# -- Map of labels to add to the pods
podLabels: {}

# -- Pod Security Context
podSecurityContext: {}
  # fsGroup: 2000

# -- Container Security Context
securityContext: {}
  # capabilities:
  #  drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

# -- Specify runtime class
runtimeClassName: ""

# Configure Service
service:

  # -- Service type
  type: ClusterIP

  # -- Service port
  port: 11434

# Configure the ingress resource that allows you to access the
ingress:
  # -- Enable ingress controller resource
  enabled: true ### CHANGED TO TRUE

  # -- IngressClass that will be used to implement the Ingress (Kubernetes 1.18+)
  className: "nginx" ### CHANGED

  # -- Additional annotations for the Ingress resource.
  annotations: {}
    # kubernetes.io/ingress.class: traefik
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"

  # The list of hostnames to be covered with this ingress record.
  hosts:
    - host: ollama.local
      paths:
        - path: /
          pathType: Prefix

  # --  The tls configuration for hostnames to be covered with this ingress record.
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

# Configure resource requests and limits
# ref: http://kubernetes.io/docs/user-guide/compute-resources/
resources:
  # Pod requests
  requests: {}
    # -- Memory request
    # memory: 4096Mi

    # -- CPU request
    # cpu: 2000m

  # Pod limit
  limits: {}
    # -- Memory limit
    # memory: 8192Mi

    # -- CPU limit
    # cpu: 4000m

# Configure extra options for liveness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
livenessProbe:
  # -- Enable livenessProbe
  enabled: true

  # -- Request path for livenessProbe
  path: /

  # -- Initial delay seconds for livenessProbe
  initialDelaySeconds: 60

  # -- Period seconds for livenessProbe
  periodSeconds: 10

  # -- Timeout seconds for livenessProbe
  timeoutSeconds: 5

  # -- Failure threshold for livenessProbe
  failureThreshold: 6

  # -- Success threshold for livenessProbe
  successThreshold: 1

# Configure extra options for readiness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
readinessProbe:
  # -- Enable readinessProbe
  enabled: true

  # -- Request path for readinessProbe
  path: /

  # -- Initial delay seconds for readinessProbe
  initialDelaySeconds: 30

  # -- Period seconds for readinessProbe
  periodSeconds: 5

  # -- Timeout seconds for readinessProbe
  timeoutSeconds: 3

  # -- Failure threshold for readinessProbe
  failureThreshold: 6

  # -- Success threshold for readinessProbe
  successThreshold: 1

# Configure autoscaling
autoscaling:
  # -- Enable autoscaling
  enabled: false

  # -- Number of minimum replicas
  minReplicas: 1

  # -- Number of maximum replicas
  maxReplicas: 100

  # -- CPU usage to target replica
  targetCPUUtilizationPercentage: 80

  # -- targetMemoryUtilizationPercentage: 80

# -- Additional volumes on the output Deployment definition.
volumes: []
# -- - name: foo
#   secret:
#     secretName: mysecret
#     optional: false

# -- Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# -- - name: foo
#   mountPath: "/etc/foo"
#   readOnly: true

# -- Additional arguments on the output Deployment definition.
extraArgs: []

# -- Additional environments variables on the output Deployment definition.
extraEnv: []
#  - name: OLLAMA_DEBUG
#    value: "1"


# Enable persistence using Persistent Volume Claims
# ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
persistentVolume:
  # -- Enable persistence using PVC
  enabled: false

  # -- Ollama server data Persistent Volume access modes
  # Must match those of existing PV or dynamic provisioner
  # Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
  accessModes:
    - ReadWriteOnce

  # -- Ollama server data Persistent Volume annotations
  annotations: {}

  # -- If you'd like to bring your own PVC for persisting Ollama state, pass the name of the
  # created + ready PVC here. If set, this Chart will not create the default PVC.
  # Requires server.persistentVolume.enabled: true
  existingClaim: ""

  # -- Ollama server data Persistent Volume size
  size: 30Gi

  # -- Ollama server data Persistent Volume Storage Class
  # If defined, storageClassName: <storageClass>
  # If set to "-", storageClassName: "", which disables dynamic provisioning
  # If undefined (the default) or set to null, no storageClassName spec is
  # set, choosing the default provisioner.  (gp2 on AWS, standard on
  # GKE, AWS & OpenStack)
  storageClass: ""

  # -- Ollama server data Persistent Volume Binding Mode
  # If defined, volumeMode: <volumeMode>
  # If empty (the default) or set to null, no volumeBindingMode spec is
  # set, choosing the default mode.
  volumeMode: ""

  # -- Subdirectory of Ollama server data Persistent Volume to mount
  # Useful if the volume's root directory is not empty
  subPath: ""

# -- Node labels for pod assignment.
nodeSelector: {}

# -- Tolerations for pod assignment
tolerations: []

# -- Affinity for pod assignment
affinity: {}

To install we'll use the values.yaml we defined above

helm install ollama ollama-helm/ollama --namespace ollama --values values.yaml

The ingress will filter by ollama.local and direct to the ollama service. To make it easier, let's edit our /etc/hosts to translate ollama.local to our own machine.

sudo echo "127.0.0.1 ollama.local" >> /etc/hosts

We already have Ollama running in the cluster

kubectl get all -n ollama
NAME                          READY   STATUS    RESTARTS   AGE
pod/ollama-84f88484dc-4m8v2   1/1     Running   0          14m

NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
service/ollama   ClusterIP   10.96.122.73   <none>        11434/TCP   14m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ollama   1/1     1            1           14m

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/ollama-84f88484dc   1         1         1       14m

kubectl get ingress -n ollama
NAME     CLASS   HOSTS          ADDRESS     PORTS   AGE
ollama   nginx   ollama.local   localhost   80      15m

curl http://ollama.local
Ollama is running%

It's possible to use the ollama CLI pointing to this ollama.

❯ export OLLAMA_HOST=http://ollama.local

  ~/Desktop/David/devsecops   main ?2                                                                                                      15:46:43
❯ ollama list
NAME                    ID              SIZE    MODIFIED
codellama:latest        8fdf8f752f6e    3.8 GB  25 minutes ago
gemma:latest            a72c7f4d0a15    5.0 GB  23 minutes ago
llama2:latest           78e26419b446    3.8 GB  21 minutes ago

❯ ollama run codellama
>>> Do you speak Portuguese?
I can speak a little Portuguese, but my proficiency varies. It may be that you have some difficulties understanding me or asking me
things in Portuguese, as I am a translation tool and don't have access to the Internet or more updated information about the language. However,
I am available to translate messages in Portuguese and help as possible.
>>>

❯ ollama run llama2
>>> do you speak Portuguese?

Yes, I speak Portuguese! Can I help you with any question or information in Portuguese?

>>> Send a message (/? for help)

Let's install open-webui

kubectl create namespace open-webui
namespace/open-webui created