Ollama no K8s
Uma vez que tenhamos um cluster Kubernetes, o ideal é que os nodes que vão rodar os pods tenham uma GPU. Usando Taints, Tolerations e Affinity podemos direcionar o Ollama para os nodes específicos.
Para testar essa instalação vamos criar em um cluster kind e sem gpu, mas vamos imaginar que um dos nodes tenha GPU. Nesse cluster vamos colocar um nginx ingress e mapear a porta 80 e 443 do host. Este pod irá rodar junto no node master.
kind create cluster --config - << EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
name: kind-cluster-ia
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- role: worker
- role: worker
EOF
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-cluster-ia-control-plane Ready control-plane 13m v1.29.2
kind-cluster-ia-worker Ready <none> 12m v1.29.2
kind-cluster-ia-worker2 Ready <none> 12m v1.29.2
kubectl taint nodes kind-cluster-ia-worker2 gpu=true:NoSchedule
node/kind-cluster-ia-worker2 tainted
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=90s
Vamos criar um ingress para poder ter acesso por um nome.
Para instalar o Ollama vamos utilizar o helm chart dele.
kubectl create ns ollama
namespace/ollama created
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm show values ollama-helm/ollama > values.yaml
Aqui o values.yaml default. Já vou fazer a marcação onde vamos alterar.
# Default values for ollama-helm.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# -- Number of replicas
replicaCount: 1
# Docker image
image:
# -- Docker image registry
repository: ollama/ollama
# -- Docker pull policy
pullPolicy: IfNotPresent
# -- Docker image tag, overrides the image tag whose default is the chart appVersion.
tag: ""
# -- Docker registry secret names as an array
imagePullSecrets: []
# -- String to partially override template (will maintain the release name)
nameOverride: ""
# -- String to fully override template
fullnameOverride: ""
# Ollama parameters
ollama:
gpu:
# -- Enable GPU integration
enabled: false ### VAMOS MANTER NO FALSE MESMO
# -- GPU type: 'nvidia' or 'amd'
# If 'ollama.gpu.enabled', default value is nvidia
# If set to 'amd', this will add 'rocm' suffix to image tag if 'image.tag' is not override
# This is due cause AMD and CPU/CUDA are different images
type: 'nvidia'
# -- Specify the number of GPU
number: 1
# -- List of models to pull at container startup
# The more you add, the longer the container will take to start if models are not present
# models:
# - llama2
# - mistral
models: ### ADICIONADO ALGUNS MODELS
- codellama
- gemma
- llama2
# -- Add insecure flag for pulling at container startup
insecure: false
# Service account
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount:
# -- Specifies whether a service account should be created
create: true
# -- Automatically mount a ServiceAccount's API credentials?
automount: true
# -- Annotations to add to the service account
annotations: {}
# -- The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
# -- Map of annotations to add to the pods
podAnnotations: {}
# -- Map of labels to add to the pods
podLabels: {}
# -- Pod Security Context
podSecurityContext: {}
# fsGroup: 2000
# -- Container Security Context
securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
# -- Specify runtime class
runtimeClassName: ""
# Configure Service
service:
# -- Service type
type: ClusterIP
# -- Service port
port: 11434
# Configure the ingress resource that allows you to access the
ingress:
# -- Enable ingress controller resource
enabled: true ### ALTERADO PARA TRUE
# -- IngressClass that will be used to implement the Ingress (Kubernetes 1.18+)
className: "nginx" ### ALTERADO
# -- Additional annotations for the Ingress resource.
annotations: {}
# kubernetes.io/ingress.class: traefik
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
# The list of hostnames to be covered with this ingress record.
hosts:
- host: ollama.local
paths:
- path: /
pathType: Prefix
# -- The tls configuration for hostnames to be covered with this ingress record.
tls: []
# - secretName: chart-example-tls
# hosts:
# - chart-example.local
# Configure resource requests and limits
# ref: http://kubernetes.io/docs/user-guide/compute-/docs/ia/Ollama/resources/
resources:
# Pod requests
requests: {}
# -- Memory request
# memory: 4096Mi
# -- CPU request
# cpu: 2000m
# Pod limit
limits: {}
# -- Memory limit
# memory: 8192Mi
# -- CPU limit
# cpu: 4000m
# Configure extra options for liveness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
livenessProbe:
# -- Enable livenessProbe
enabled: true
# -- Request path for livenessProbe
path: /
# -- Initial delay seconds for livenessProbe
initialDelaySeconds: 60
# -- Period seconds for livenessProbe
periodSeconds: 10
# -- Timeout seconds for livenessProbe
timeoutSeconds: 5
# -- Failure threshold for livenessProbe
failureThreshold: 6
# -- Success threshold for livenessProbe
successThreshold: 1
# Configure extra options for readiness probe
# ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
readinessProbe:
# -- Enable readinessProbe
enabled: true
# -- Request path for readinessProbe
path: /
# -- Initial delay seconds for readinessProbe
initialDelaySeconds: 30
# -- Period seconds for readinessProbe
periodSeconds: 5
# -- Timeout seconds for readinessProbe
timeoutSeconds: 3
# -- Failure threshold for readinessProbe
failureThreshold: 6
# -- Success threshold for readinessProbe
successThreshold: 1
# Configure autoscaling
autoscaling:
# -- Enable autoscaling
enabled: false
# -- Number of minimum replicas
minReplicas: 1
# -- Number of maximum replicas
maxReplicas: 100
# -- CPU usage to target replica
targetCPUUtilizationPercentage: 80
# -- targetMemoryUtilizationPercentage: 80
# -- Additional volumes on the output Deployment definition.
volumes: []
# -- - name: foo
# secret:
# secretName: mysecret
# optional: false
# -- Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# -- - name: foo
# mountPath: "/etc/foo"
# readOnly: true
# -- Additional arguments on the output Deployment definition.
extraArgs: []
# -- Additional environments variables on the output Deployment definition.
extraEnv: []
# - name: OLLAMA_DEBUG
# value: "1"
# Enable persistence using Persistent Volume Claims
# ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
persistentVolume:
# -- Enable persistence using PVC
enabled: false
# -- Ollama server data Persistent Volume access modes
# Must match those of existing PV or dynamic provisioner
# Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
accessModes:
- ReadWriteOnce
# -- Ollama server data Persistent Volume annotations
annotations: {}
# -- If you'd like to bring your own PVC for persisting Ollama state, pass the name of the
# created + ready PVC here. If set, this Chart will not create the default PVC.
# Requires server.persistentVolume.enabled: true
existingClaim: ""
# -- Ollama server data Persistent Volume size
size: 30Gi
# -- Ollama server data Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner. (gp2 on AWS, standard on
# GKE, AWS & OpenStack)
storageClass: ""
# -- Ollama server data Persistent Volume Binding Mode
# If defined, volumeMode: <volumeMode>
# If empty (the default) or set to null, no volumeBindingMode spec is
# set, choosing the default mode.
volumeMode: ""
# -- Subdirectory of Ollama server data Persistent Volume to mount
# Useful if the volume's root directory is not empty
subPath: ""
# -- Node labels for pod assignment.
nodeSelector: {}
# -- Tolerations for pod assignment
tolerations: []
# -- Affinity for pod assignment
affinity: {}
Para instalar vamos usar o values.yaml que definimos acima
helm install ollama ollama-helm/ollama --namespace ollama --values values.yaml
O ingress irá filtrar por ollama.local e direcionar para o serviço do ollama. Para facilitar, vamos editar o nosso /etc/hosts para traduzir ollama.local para a nossa própria máquina.
sudo echo "127.0.0.1 ollama.local" >> /etc/hosts
Já temos o Ollama rodando no cluster
kubectl get all -n ollama
NAME READY STATUS RESTARTS AGE
pod/ollama-84f88484dc-4m8v2 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ollama ClusterIP 10.96.122.73 <none> 11434/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ollama 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/ollama-84f88484dc 1 1 1 14m
kubectl get ingress -n ollama
NAME CLASS HOSTS ADDRESS PORTS AGE
ollama nginx ollama.local localhost 80 15m
curl http://ollama.local
Ollama is running%
É possível utilizar o ollama cli apontando para esse ollama.
❯ export OLLAMA_HOST=http://ollama.local
~/Desktop/David/devsecops main ?2 15:46:43
❯ ollama list
NAME ID SIZE MODIFIED
codellama:latest 8fdf8f752f6e 3.8 GB 25 minutes ago
gemma:latest a72c7f4d0a15 5.0 GB 23 minutes ago
llama2:latest 78e26419b446 3.8 GB 21 minutes ago
❯ ollama run codellama
>>> Voce fala português?
Eu posso falar um pouco de português, mas minha proficiência varia. Pode ser que você tenha algumas dificuldades para entender ou me perguntar
coisas em português, pois sou uma ferramenta de tradução e não tenho acesso à Internet ou a informações mais atualizadas sobre o idioma. No
entanto, estou disponível para traduzir mensagens em português e ajudar como possível.
>>>
❯ ollama run llama2
>>> voce fala portugues?
Sim, eu falo português! Posso ajudá-lo com qualquer pergunta ou informação em português?
>>> Send a message (/? for help)
Vamos instalar o open-webui
kubectl create namespace open-webui
namespace/open-webui created