Ollama no K8s

Uma vez que tenhamos um cluster Kubernetes, o ideal é que os nodes que vão rodar os pods tenham uma GPU. Usando Taints, Tolerations e Affinity podemos direcionar o Ollama para os nodes específicos.

Para testar essa instalação vamos criar em um cluster kind e sem gpu, mas vamos imaginar que um dos nodes tenha GPU. Nesse cluster vamos colocar um nginx ingress e mapear a porta 80 e 443 do host. Este pod irá rodar junto no node master.

kind create cluster --config - << EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
name: kind-cluster-ia
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker
EOF

kubectl get nodes  
NAME                            STATUS   ROLES           AGE   VERSION
kind-cluster-ia-control-plane   Ready    control-plane   13m   v1.29.2
kind-cluster-ia-worker          Ready    <none>          12m   v1.29.2
kind-cluster-ia-worker2         Ready    <none>          12m   v1.29.2

kubectl taint nodes kind-cluster-ia-worker2 gpu=true:NoSchedule
node/kind-cluster-ia-worker2 tainted

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=90s

Vamos criar um ingress para poder ter acesso por um nome.

Para instalar o Ollama vamos utilizar o helm chart dele.

kubectl create ns ollama  
namespace/ollama created

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm show values ollama-helm/ollama > values.yaml

Aqui o values.yaml default. Já vou fazer a marcação onde vamos alterar.

# Default values for ollama-helm.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# -- Number of replicas
replicaCount: 1

# Docker image
image:
  # -- Docker image registry
  repository: ollama/ollama

  # -- Docker pull policy
  pullPolicy: IfNotPresent

  # -- Docker image tag, overrides the image tag whose default is the chart appVersion.
  tag: ""

# -- Docker registry secret names as an array
imagePullSecrets: []

# -- String to partially override template  (will maintain the release name)
nameOverride: ""

# -- String to fully override template
fullnameOverride: ""

# Ollama parameters
ollama:
  gpu:
    # -- Enable GPU integration
    enabled: false ### VAMOS MANTER NO FALSE MESMO

    # -- GPU type: 'nvidia' or 'amd'
    # If 'ollama.gpu.enabled', default value is nvidia
    # If set to 'amd', this will add 'rocm' suffix to image tag if 'image.tag' is not override
    # This is due cause AMD and CPU/CUDA are different images
    type: 'nvidia'

    # -- Specify the number of GPU
    number: 1

  # -- List of models to pull at container startup
  # The more you add, the longer the container will take to start if models are not present
  # models:
  #  - llama2
  #  - mistral
  models: ### ADICIONADO ALGUNS MODELS
    - codellama
    - gemma
    - llama2

  # -- Add insecure flag for pulling at container startup
  insecure: false

# Service account
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount:
  # -- Specifies whether a service account should be created
  create: true

  # -- Automatically mount a ServiceAccount's API credentials?
  automount: true

  # -- Annotations to add to the service account
  annotations: {}

  # -- The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

# -- Map of annotations to add to the pods
podAnnotations: {}

# -- Map of labels to add to the pods
podLabels: {}

# -- Pod Security Context
podSecurityContext: {}
  # fsGroup: 2000

# -- Container Security Context
securityContext: {}
  # capabilities:
  #  drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

# -- Specify runtime class
runtimeClassName: ""

# Configure Service
service:

  # -- Service type
  type: ClusterIP

  # -- Service port
  port: 11434

# Configure the ingress resource that allows you to access the
ingress:
  # -- Enable ingress controller resource
  enabled: true ### ALTERADO PARA TRUE

  # -- IngressClass that will be used to implement the Ingress (Kubernetes 1.18+)
  className: "nginx" ### ALTERADO

  # -- Additional annotations for the Ingress resource.
  annotations: {}
    # kubernetes.io/ingress.class: traefik
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"

  # The list of hostnames to be covered with this ingress record.
  hosts:
    - host: ollama.local
      paths:
        - path: /
          pathType: Prefix

  # --  The tls configuration for hostnames to be covered with this ingress record.
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

# Configure resource requests and limits
# ref: http://kubernetes.io/docs/user-guide/compute-/docs/ia/Ollama/resources/
resources:
  # Pod requests
  requests: {}
    # -- Memory request
    # memory: 4096Mi

    # -- CPU request
    # cpu: 2000m

  # Pod limit
  limits: {}
    # -- Memory limit
    # memory: 8192Mi

    # -- CPU limit
    # cpu: 4000m

# Configure extra options for liveness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
livenessProbe:
  # -- Enable livenessProbe
  enabled: true

  # -- Request path for livenessProbe
  path: /

  # -- Initial delay seconds for livenessProbe
  initialDelaySeconds: 60

  # -- Period seconds for livenessProbe
  periodSeconds: 10

  # -- Timeout seconds for livenessProbe
  timeoutSeconds: 5

  # -- Failure threshold for livenessProbe
  failureThreshold: 6

  # -- Success threshold for livenessProbe
  successThreshold: 1

# Configure extra options for readiness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
readinessProbe:
  # -- Enable readinessProbe
  enabled: true

  # -- Request path for readinessProbe
  path: /

  # -- Initial delay seconds for readinessProbe
  initialDelaySeconds: 30

  # -- Period seconds for readinessProbe
  periodSeconds: 5

  # -- Timeout seconds for readinessProbe
  timeoutSeconds: 3

  # -- Failure threshold for readinessProbe
  failureThreshold: 6

  # -- Success threshold for readinessProbe
  successThreshold: 1

# Configure autoscaling
autoscaling:
  # -- Enable autoscaling
  enabled: false

  # -- Number of minimum replicas
  minReplicas: 1

  # -- Number of maximum replicas
  maxReplicas: 100

  # -- CPU usage to target replica
  targetCPUUtilizationPercentage: 80

  # -- targetMemoryUtilizationPercentage: 80

# -- Additional volumes on the output Deployment definition.
volumes: []
# -- - name: foo
#   secret:
#     secretName: mysecret
#     optional: false

# -- Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# -- - name: foo
#   mountPath: "/etc/foo"
#   readOnly: true

# -- Additional arguments on the output Deployment definition.
extraArgs: []

# -- Additional environments variables on the output Deployment definition.
extraEnv: []
#  - name: OLLAMA_DEBUG
#    value: "1"


# Enable persistence using Persistent Volume Claims
# ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
persistentVolume:
  # -- Enable persistence using PVC
  enabled: false

  # -- Ollama server data Persistent Volume access modes
  # Must match those of existing PV or dynamic provisioner
  # Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
  accessModes:
    - ReadWriteOnce

  # -- Ollama server data Persistent Volume annotations
  annotations: {}

  # -- If you'd like to bring your own PVC for persisting Ollama state, pass the name of the
  # created + ready PVC here. If set, this Chart will not create the default PVC.
  # Requires server.persistentVolume.enabled: true
  existingClaim: ""

  # -- Ollama server data Persistent Volume size
  size: 30Gi

  # -- Ollama server data Persistent Volume Storage Class
  # If defined, storageClassName: <storageClass>
  # If set to "-", storageClassName: "", which disables dynamic provisioning
  # If undefined (the default) or set to null, no storageClassName spec is
  # set, choosing the default provisioner.  (gp2 on AWS, standard on
  # GKE, AWS & OpenStack)
  storageClass: ""

  # -- Ollama server data Persistent Volume Binding Mode
  # If defined, volumeMode: <volumeMode>
  # If empty (the default) or set to null, no volumeBindingMode spec is
  # set, choosing the default mode.
  volumeMode: ""

  # -- Subdirectory of Ollama server data Persistent Volume to mount
  # Useful if the volume's root directory is not empty
  subPath: ""

# -- Node labels for pod assignment.
nodeSelector: {}

# -- Tolerations for pod assignment
tolerations: []

# -- Affinity for pod assignment
affinity: {}

Para instalar vamos usar o values.yaml que definimos acima

helm install ollama ollama-helm/ollama --namespace ollama --values values.yaml

O ingress irá filtrar por ollama.local e direcionar para o serviço do ollama. Para facilitar, vamos editar o nosso /etc/hosts para traduzir ollama.local para a nossa própria máquina.

sudo echo "127.0.0.1 ollama.local" >> /etc/hosts

Já temos o Ollama rodando no cluster

kubectl get all -n ollama  
NAME                          READY   STATUS    RESTARTS   AGE
pod/ollama-84f88484dc-4m8v2   1/1     Running   0          14m

NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
service/ollama   ClusterIP   10.96.122.73   <none>        11434/TCP   14m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ollama   1/1     1            1           14m

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/ollama-84f88484dc   1         1         1       14m

kubectl get ingress -n ollama
NAME     CLASS   HOSTS          ADDRESS     PORTS   AGE
ollama   nginx   ollama.local   localhost   80      15m

curl http://ollama.local  
Ollama is running%

É possível utilizar o ollama cli apontando para esse ollama.

❯ export OLLAMA_HOST=http://ollama.local  

  ~/Desktop/David/devsecops   main ?2                                                                                                      15:46:43
❯ ollama list
NAME                    ID              SIZE    MODIFIED  
codellama:latest        8fdf8f752f6e    3.8 GB  25 minutes ago
gemma:latest            a72c7f4d0a15    5.0 GB  23 minutes ago
llama2:latest           78e26419b446    3.8 GB  21 minutes ago

❯ ollama run codellama
>>> Voce fala português?
Eu posso falar um pouco de português, mas minha proficiência varia. Pode ser que você tenha algumas dificuldades para entender ou me perguntar
coisas em português, pois sou uma ferramenta de tradução e não tenho acesso à Internet ou a informações mais atualizadas sobre o idioma. No
entanto, estou disponível para traduzir mensagens em português e ajudar como possível.
>>>

❯ ollama run llama2  
>>> voce fala portugues?

Sim, eu falo português! Posso ajudá-lo com qualquer pergunta ou informação em português?

>>> Send a message (/? for help)

Vamos instalar o open-webui

kubectl create namespace open-webui
namespace/open-webui created