Ollama on K8s
Once we have a Kubernetes cluster, ideally the nodes that will run the pods should have a GPU. Using Taints, Tolerations and Affinity we can direct Ollama to specific nodes.
To test this installation we'll create it in a kind cluster without GPU, but let's imagine that one of the nodes has a GPU. In this cluster we'll put an nginx ingress and map port 80 and 443 of the host. This pod will run alongside the master node.
kind create cluster --config - << EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
name: kind-cluster-ia
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- role: worker
- role: worker
EOF
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-cluster-ia-control-plane Ready control-plane 13m v1.29.2
kind-cluster-ia-worker Ready <none> 12m v1.29.2
kind-cluster-ia-worker2 Ready <none> 12m v1.29.2
kubectl taint nodes kind-cluster-ia-worker2 gpu=true:NoSchedule
node/kind-cluster-ia-worker2 tainted
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=90s
Let's create an ingress to be able to have access by a name.
To install Ollama we'll use its helm chart.
kubectl create ns ollama
namespace/ollama created
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm show values ollama-helm/ollama > values.yaml
Here is the default values.yaml. I'll already mark where we'll make changes.
# Default values for ollama-helm.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# -- Number of replicas
replicaCount: 1
# Docker image
image:
# -- Docker image registry
repository: ollama/ollama
# -- Docker pull policy
pullPolicy: IfNotPresent
# -- Docker image tag, overrides the image tag whose default is the chart appVersion.
tag: ""
# -- Docker registry secret names as an array
imagePullSecrets: []
# -- String to partially override template (will maintain the release name)
nameOverride: ""
# -- String to fully override template
fullnameOverride: ""
# Ollama parameters
ollama:
gpu:
# -- Enable GPU integration
enabled: false ### KEEPING FALSE
# -- GPU type: 'nvidia' or 'amd'
# If 'ollama.gpu.enabled', default value is nvidia
# If set to 'amd', this will add 'rocm' suffix to image tag if 'image.tag' is not override
# This is due cause AMD and CPU/CUDA are different images
type: 'nvidia'
# -- Specify the number of GPU
number: 1
# -- List of models to pull at container startup
# The more you add, the longer the container will take to start if models are not present
# models:
# - llama2
# - mistral
models: ### ADDED SOME MODELS
- codellama
- gemma
- llama2
# -- Add insecure flag for pulling at container startup
insecure: false
# Service account
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount:
# -- Specifies whether a service account should be created
create: true
# -- Automatically mount a ServiceAccount's API credentials?
automount: true
# -- Annotations to add to the service account
annotations: {}
# -- The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
# -- Map of annotations to add to the pods
podAnnotations: {}
# -- Map of labels to add to the pods
podLabels: {}
# -- Pod Security Context
podSecurityContext: {}
# fsGroup: 2000
# -- Container Security Context
securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
# -- Specify runtime class
runtimeClassName: ""
# Configure Service
service:
# -- Service type
type: ClusterIP
# -- Service port
port: 11434
# Configure the ingress resource that allows you to access the
ingress:
# -- Enable ingress controller resource
enabled: true ### CHANGED TO TRUE
# -- IngressClass that will be used to implement the Ingress (Kubernetes 1.18+)
className: "nginx" ### CHANGED
# -- Additional annotations for the Ingress resource.
annotations: {}
# kubernetes.io/ingress.class: traefik
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
# The list of hostnames to be covered with this ingress record.
hosts:
- host: ollama.local
paths:
- path: /
pathType: Prefix
# -- The tls configuration for hostnames to be covered with this ingress record.
tls: []
# - secretName: chart-example-tls
# hosts:
# - chart-example.local
# Configure resource requests and limits
# ref: http://kubernetes.io/docs/user-guide/compute-resources/
resources:
# Pod requests
requests: {}
# -- Memory request
# memory: 4096Mi
# -- CPU request
# cpu: 2000m
# Pod limit
limits: {}
# -- Memory limit
# memory: 8192Mi
# -- CPU limit
# cpu: 4000m
# Configure extra options for liveness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
livenessProbe:
# -- Enable livenessProbe
enabled: true
# -- Request path for livenessProbe
path: /
# -- Initial delay seconds for livenessProbe
initialDelaySeconds: 60
# -- Period seconds for livenessProbe
periodSeconds: 10
# -- Timeout seconds for livenessProbe
timeoutSeconds: 5
# -- Failure threshold for livenessProbe
failureThreshold: 6
# -- Success threshold for livenessProbe
successThreshold: 1
# Configure extra options for readiness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
readinessProbe:
# -- Enable readinessProbe
enabled: true
# -- Request path for readinessProbe
path: /
# -- Initial delay seconds for readinessProbe
initialDelaySeconds: 30
# -- Period seconds for readinessProbe
periodSeconds: 5
# -- Timeout seconds for readinessProbe
timeoutSeconds: 3
# -- Failure threshold for readinessProbe
failureThreshold: 6
# -- Success threshold for readinessProbe
successThreshold: 1
# Configure autoscaling
autoscaling:
# -- Enable autoscaling
enabled: false
# -- Number of minimum replicas
minReplicas: 1
# -- Number of maximum replicas
maxReplicas: 100
# -- CPU usage to target replica
targetCPUUtilizationPercentage: 80
# -- targetMemoryUtilizationPercentage: 80
# -- Additional volumes on the output Deployment definition.
volumes: []
# -- - name: foo
# secret:
# secretName: mysecret
# optional: false
# -- Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# -- - name: foo
# mountPath: "/etc/foo"
# readOnly: true
# -- Additional arguments on the output Deployment definition.
extraArgs: []
# -- Additional environments variables on the output Deployment definition.
extraEnv: []
# - name: OLLAMA_DEBUG
# value: "1"
# Enable persistence using Persistent Volume Claims
# ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
persistentVolume:
# -- Enable persistence using PVC
enabled: false
# -- Ollama server data Persistent Volume access modes
# Must match those of existing PV or dynamic provisioner
# Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
accessModes:
- ReadWriteOnce
# -- Ollama server data Persistent Volume annotations
annotations: {}
# -- If you'd like to bring your own PVC for persisting Ollama state, pass the name of the
# created + ready PVC here. If set, this Chart will not create the default PVC.
# Requires server.persistentVolume.enabled: true
existingClaim: ""
# -- Ollama server data Persistent Volume size
size: 30Gi
# -- Ollama server data Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner. (gp2 on AWS, standard on
# GKE, AWS & OpenStack)
storageClass: ""
# -- Ollama server data Persistent Volume Binding Mode
# If defined, volumeMode: <volumeMode>
# If empty (the default) or set to null, no volumeBindingMode spec is
# set, choosing the default mode.
volumeMode: ""
# -- Subdirectory of Ollama server data Persistent Volume to mount
# Useful if the volume's root directory is not empty
subPath: ""
# -- Node labels for pod assignment.
nodeSelector: {}
# -- Tolerations for pod assignment
tolerations: []
# -- Affinity for pod assignment
affinity: {}
To install we'll use the values.yaml we defined above
helm install ollama ollama-helm/ollama --namespace ollama --values values.yaml
The ingress will filter by ollama.local and direct to the ollama service. To make it easier, let's edit our /etc/hosts to translate ollama.local to our own machine.
sudo echo "127.0.0.1 ollama.local" >> /etc/hosts
We already have Ollama running in the cluster
kubectl get all -n ollama
NAME READY STATUS RESTARTS AGE
pod/ollama-84f88484dc-4m8v2 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ollama ClusterIP 10.96.122.73 <none> 11434/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ollama 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/ollama-84f88484dc 1 1 1 14m
kubectl get ingress -n ollama
NAME CLASS HOSTS ADDRESS PORTS AGE
ollama nginx ollama.local localhost 80 15m
curl http://ollama.local
Ollama is running%
It's possible to use the ollama CLI pointing to this ollama.
❯ export OLLAMA_HOST=http://ollama.local
~/Desktop/David/devsecops main ?2 15:46:43
❯ ollama list
NAME ID SIZE MODIFIED
codellama:latest 8fdf8f752f6e 3.8 GB 25 minutes ago
gemma:latest a72c7f4d0a15 5.0 GB 23 minutes ago
llama2:latest 78e26419b446 3.8 GB 21 minutes ago
❯ ollama run codellama
>>> Do you speak Portuguese?
I can speak a little Portuguese, but my proficiency varies. It may be that you have some difficulties understanding me or asking me
things in Portuguese, as I am a translation tool and don't have access to the Internet or more updated information about the language. However,
I am available to translate messages in Portuguese and help as possible.
>>>
❯ ollama run llama2
>>> do you speak Portuguese?
Yes, I speak Portuguese! Can I help you with any question or information in Portuguese?
>>> Send a message (/? for help)
Let's install open-webui
kubectl create namespace open-webui
namespace/open-webui created