Skip to main content

Secuencia de Verificación de Fallos de Nodos

En caso de que un nodo falle tenemos algunas situaciones para tener en cuenta.

  1. Verifique el estado de los nodos con el comando kubectl get nodes para ver el estado.

  2. Busque los eventos de los nodos que están not-ready con el comando describe para este nodo. Verifique las conditions.

    kubectl describe nodes kind-cluster-worker
    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message
    ---- ------ ----------------- ------------------ ------ -------
    # Si MemoryPressure está como true estamos con falta de memoria para ejecutar los pods. Probablemente los pods deben estar crasheando.
    MemoryPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasSufficientMemory kubelet has sufficient memory available
    # Si DiskPressure está como true entonces estamos con falta de capacidad de disco
    DiskPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasNoDiskPressure kubelet has no disk pressure
    # PIDPressure será seteado como true si tiene muchos pods en este nodo
    PIDPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasSufficientPID kubelet has sufficient PID available
    Ready True Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:49 -0300 KubeletReady kubelet is posting ready status

    Si alguna de esas pressures está seteada como true ya sabemos que es alguna falta de recurso. Si está como Unknown probablemente algún accidente ocurrió y perdió el estado.

  3. Compruebe los procesos y consumos en el nodo con el comando top y df -h

    top - 12:11:32 up 22:41,  0 user,  load average: 3.25, 2.79, 2.56
    Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 8.0 us, 0.5 sy, 0.0 ni, 91.2 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st
    MiB Mem : 64001.3 total, 41471.9 free, 11956.6 used, 13210.8 buff/cache
    MiB Swap: 1952.0 total, 1952.0 free, 0.0 used. 52044.7 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    116 root 20 0 2470452 66056 36480 S 0.7 0.1 13:23.44 containerd
    223 root 20 0 2999176 86524 53376 S 0.3 0.1 8:36.64 kubelet
    1 root 20 0 20392 11648 8704 S 0.0 0.0 0:01.13 systemd
    97 root 20 0 24792 11008 10240 S 0.0 0.0 0:00.08 systemd-journal
    271 root 20 0 722648 13824 9856 S 0.0 0.0 0:07.79 containerd-shim
    287 root 20 0 722648 13852 9856 S 0.0 0.0 0:07.95 containerd-shim
    317 65535 20 0 996 512 512 S 0.0 0.0 0:00.00 pause
    324 65535 20 0 996 512 512 S 0.0 0.0 0:00.01 pause
    358 root 20 0 1284848 49360 36608 S 0.0 0.1 0:07.92 kube-proxy
    446 root 20 0 743928 27448 19072 S 0.0 0.0 0:15.96 kindnetd
    14316 root 20 0 722392 13184 9600 S 0.0 0.0 0:00.01 containerd-shim
    14336 65535 20 0 996 512 512 S 0.0 0.0 0:00.00 pause
    14373 root 20 0 2484 1280 1280 S 0.0 0.0 0:00.01 sleep
    14400 root 20 0 2576 1408 1408 S 0.0 0.0 0:00.00 sh
    14406 root 20 0 2576 128 128 S 0.0 0.0 0:00.00 sh
    14407 root 20 0 4192 3328 2816 S 0.0 0.0 0:00.00 bash
    14412 root 20 0 8568 4736 2688 R 0.0 0.0 0:00.00 top

    root@kind-cluster-worker:/# df -h
    Filesystem Size Used Avail Use% Mounted on
    overlay 1.8T 571G 1.2T 33% /
    tmpfs 64M 0 64M 0% /dev
    shm 64M 0 64M 0% /dev/shm
    /dev/mapper/vgubuntu-root 1.8T 571G 1.2T 33% /var
    tmpfs 32G 8.6M 32G 1% /run
    tmpfs 32G 0 32G 0% /tmp
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 63G 12K 63G 1% /var/lib/kubelet/pods/5a2bf15d-36fa-4c73-94a3-b491f4774e72/volumes/kubernetes.io~projected/kube-api-access-tpjjt
    tmpfs 50M 12K 50M 1% /var/lib/kubelet/pods/92c3fe67-ccb9-437c-8d18-c16008dfa93b/volumes/kubernetes.io~projected/kube-api-access-cxt56
    shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/b6087483482422390d1fad0ec6726dfd98aba0d990b3f7f5a6d8224c15c4a4a3/shm
    shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/c13daa616a7ee7a7144b2acf39476a6e36fd454c1ebf345c26d3834703d11756/shm
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/b6087483482422390d1fad0ec6726dfd98aba0d990b3f7f5a6d8224c15c4a4a3/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/c13daa616a7ee7a7144b2acf39476a6e36fd454c1ebf345c26d3834703d11756/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/0e6c2021f2b349bb0a16e5e5ecedb44a364566413ddfac25d09dd0538bf1de3b/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/970340bd3152b21a503b9e8fbc0b6af4948bed0bc9581f03f7140cbad18b8015/rootfs
    tmpfs 63G 12K 63G 1% /var/lib/kubelet/pods/1af287b4-b519-4956-995a-5cf7403e0699/volumes/kubernetes.io~projected/kube-api-access-h9vz9
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/36a2786c5693be823e1cd178341a794744583fe1d67548132f4a364933d54967/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/27c42f370c029cf965538366fd6f310cb9408ef6b305442ce21f32ec8947e2a6/rootfs
  4. Compruebe el estado del kubelet y los logs con systemd status kubelet.service y journalctl -xeu kubelet

    root@kind-cluster-worker:/# systemctl status kubelet
    ● kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
    └─10-kubeadm.conf, 11-kind.conf
    Active: active (running) since Sun 2024-02-25 13:30:16 UTC; 22h ago
    Docs: http://kubernetes.io/docs/
    Process: 214 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then /kind/bin/create-kubelet-cgroup-v2.sh; fi (code=exited, status=0/SUCCESS)
    Process: 222 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
    Main PID: 223 (kubelet)
    Tasks: 24 (limit: 11496)
    Memory: 35.9M
    CPU: 8min 38.699s
    CGroup: /kubelet.slice/kubelet.service
    └─223 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.4 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.9 --provider-id=kind://docker/kind-cluster/kind-cluster-worker --runtime-cgroups=/system.slice/containerd.service

    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.211689 223 topology_manager.go:215] "Topology Admit Handler" podUID="5a2bf15d-36fa-4c73-94a3-b491f4774e72" podNamespace="kube-system" podName="kube-proxy-9zhh2"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.307285 223 desired_state_of_world_populator.go:159] "Finished populating initial desired state of world"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.340272 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"xtables-lock\" (UniqueName: \"kubernetes.io/host-path/5a2bf15d-36fa-4c73-94a3-b491f4774e72-xtables-lock\") pod \"kube-proxy-9zhh2\" (UID: \"5a2bf15d-36fa-4c73-94a3-b491f4774e72\") " pod="kube-system/kube-proxy-9zhh2"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.340288 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"lib-modules\" (UniqueName: \"kubernetes.io/host-path/5a2bf15d-36fa-4c73-94a3-b491f4774e72-lib-modules\") pod \"kube-proxy-9zhh2\" (UID: \"5a2bf15d-36fa-4c73-94a3-b491f4774e72\") " pod="kube-system/kube-proxy-9zhh2"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.340304 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"cni-cfg\" (UniqueName: \"kubernetes.io/host-path/92c3fe67-ccb9-437c-8d18-c16008dfa93b-cni-cfg\") pod \"kindnet-wnzds\" (UID: \"92c3fe67-ccb9-437c-8d18-c16008dfa93b\") " pod="kube-system/kindnet-wnzds"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.340316 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"xtables-lock\" (UniqueName: \"kubernetes.io/host-path/92c3fe67-ccb9-437c-8d18-c16008dfa93b-xtables-lock\") pod \"kindnet-wnzds\" (UID: \"92c3fe67-ccb9-437c-8d18-c16008dfa93b\") " pod="kube-system/kindnet-wnzds"
    Feb 25 13:30:19 kind-cluster-worker kubelet[223]: I0225 13:30:19.340476 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"lib-modules\" (UniqueName: \"kubernetes.io/host-path/92c3fe67-ccb9-437c-8d18-c16008dfa93b-lib-modules\") pod \"kindnet-wnzds\" (UID: \"92c3fe67-ccb9-437c-8d18-c16008dfa93b\") " pod="kube-system/kindnet-wnzds"
    Feb 26 12:11:12 kind-cluster-worker kubelet[223]: I0226 12:11:12.480201 223 topology_manager.go:215] "Topology Admit Handler" podUID="1af287b4-b519-4956-995a-5cf7403e0699" podNamespace="kube-system" podName="node-shell-2a728d18-d3d7-4c59-ad22-3a763f34b1c9"
    Feb 26 12:11:12 kind-cluster-worker kubelet[223]: I0226 12:11:12.567123 223 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-h9vz9\" (UniqueName: \"kubernetes.io/projected/1af287b4-b519-4956-995a-5cf7403e0699-kube-api-access-h9vz9\") pod \"node-shell-2a728d18-d3d7-4c59-ad22-3a763f34b1c9\" (UID: \"1af287b4-b519-4956-995a-5cf7403e0699\") " pod="kube-system/node-shell-2a728d18-d3d7-4c59-ad22-3a763f34b1c9"
    Feb 26 12:11:16 kind-cluster-worker kubelet[223]: I0226 12:11:16.957965 223 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/node-shell-2a728d18-d3d7-4c59-ad22-3a763f34b1c9" podStartSLOduration=1.652541497 podStartE2EDuration="4.957928607s" podCreationTimestamp="2024-02-26 12:11:12 +0000 UTC" firstStartedPulling="2024-02-26 12:11:12.853842258 +0000 UTC m=+81656.677152456" lastFinishedPulling="2024-02-26 12:11:16.159229367 +0000 UTC m=+81659.982539566" observedRunningTime="2024-02-26 12:11:16.957817629 +0000 UTC m=+81660.781127839" watchObservedRunningTime="2024-02-26 12:11:16.957928607 +0000 UTC m=+81660.781238814"

    # Y en caso de que no sea posible analizar con el comando anterior, vamos a ver más detallados
    root@kind-cluster-worker:/# journalctl -u kubelet
    # (Salida similar de journalctl omitida por brevedad)
  5. Compruebe también los certificados y vea si no expiraron

    root@kind-cluster-worker:/# openssl x509 -in /var/lib/kubelet/pki/kubelet.crt
    -----BEGIN CERTIFICATE-----
    MIIDTTCCAjWgAwIBAgIIdyIAO9Z5gVAwDQYJKoZIhvcNAQELBQAwLDEqMCgGA1UE
    Awwha2luZC1jbHVzdGVyLXdvcmtlci1jYUAxNzA3NDMzMzY1MB4XDTI0MDIwODIy
    MDI0NVoXDTI1MDIwNzIyMDI0NVowKTEnMCUGA1UEAwwea2luZC1jbHVzdGVyLXdv
    cmtlckAxNzA3NDMzMzY1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    nXOMHMXoQiRePWMKnI6NN0VI7lhy6Te2Ia2y+QZ+qeDfMM9mi62kwbHcnCnFsptJ
    8CBqv1mYpzNJaCDDiOrtB9Fv6gs6k0xARF+Tdw+CC2Mo7UJEVh4S5A1BnYTJUctm
    tWA9jzUqbh3cxaubmN2AmzlmTk2+A6FZX+fR/bdNs9Gh+zrrkhF2irfs8Sxbp68f
    KMB6HsgZOSdt014Dz9J5xB37Hh0R3KS0FYLcJ4TVaPGJrCypL26GezfZWjCRFm7q
    wB/t7vbSNV/gFNt533Vdr6AxF8IZEVzdB2fxJ6/ofNDbsioFQ1iDhv4wQECu6jCH
    6NkbzCZrPDF4KJrLXGjkNwIDAQABo3YwdDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0l
    BAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBT53Jnk+X3i
    R7lLud6Q3HnbydB0azAeBgNVHREEFzAVghNraW5kLWNsdXN0ZXItd29ya2VyMA0G
    CSqGSIb3DQEBCwUAA4IBAQAnNioBu6agqKH/kDgjGfut865x8ufWw2wlmyunx5CS
    njAdP/csErsSrVXlzlYhdNaXHvCYZcwXCjUpL8wNYHJqT5aRhuMr4w6ZYACWY50o
    jyepzZFA8BNxA7FH5SnQbr+JZP1y+bXlF3JbfYPNAEHZBRSuayw3WdU9iSuGghnG
    pQA0OjOjZ7MwYXF3NKPuS/rPi6NERjykT8VYW6G2kIJDgPf4EaJ5lEKM3ifxjW+n
    vu7XpnjG+Ff48Gq47BBwxhE9p/YTFLzyGZnbArx+u6V2yui3Q3agi7f0oJT1fqkp
    RfbxkFBrCCuiVbswcaf4eBFwyMNqyg9mhn8r4Wo4N2z8
    -----END CERTIFICATE-----
    root@kind-cluster-worker:/# openssl x509 -in /var/lib/kubelet/pki/kubelet.crt --text
    Certificate:
    Data:
    Version: 3 (0x2)
    Serial Number: 8584424096722944336 (0x7722003bd6798150)
    Signature Algorithm: sha256WithRSAEncryption
    Issuer: CN = kind-cluster-worker-ca@1707433365
    Validity
    Not Before: Feb 8 22:02:45 2024 GMT
    Not After : Feb 7 22:02:45 2025 GMT #OK
    Subject: CN = kind-cluster-worker@1707433365
    Subject Public Key Info:
    Public Key Algorithm: rsaEncryption
    Public-Key: (2048 bit)
    Modulus:
    00:9d:73:8c:1c:c5:e8:42:24:5e:3d:63:0a:9c:8e:
    8d:37:45:48:ee:58:72:e9:37:b6:21:ad:b2:f9:06:
    7e:a9:e0:df:30:cf:66:8b:ad:a4:c1:b1:dc:9c:29:
    c5:b2:9b:49:f0:20:6a:bf:59:98:a7:33:49:68:20:
    c3:88:ea:ed:07:d1:6f:ea:0b:3a:93:4c:40:44:5f:
    93:77:0f:82:0b:63:28:ed:42:44:56:1e:12:e4:0d:
    41:9d:84:c9:51:cb:66:b5:60:3d:8f:35:2a:6e:1d:
    dc:c5:ab:9b:98:dd:80:9b:39:66:4e:4d:be:03:a1:
    59:5f:e7:d1:fd:b7:4d:b3:d1:a1:fb:3a:eb:92:11:
    76:8a:b7:ec:f1:2c:5b:a7:af:1f:28:c0:7a:1e:c8:
    19:39:27:6d:d3:5e:03:cf:d2:79:c4:1d:fb:1e:1d:
    11:dc:a4:b4:15:82:dc:27:84:d5:68:f1:89:ac:2c:
    a9:2f:6e:86:7b:37:d9:5a:30:91:16:6e:ea:c0:1f:
    ed:ee:f6:d2:35:5f:e0:14:db:79:df:75:5d:af:a0:
    31:17:c2:19:11:5c:dd:07:67:f1:27:af:e8:7c:d0:
    db:b2:2a:05:43:58:83:86:fe:30:40:40:ae:ea:30:
    87:e8:d9:1b:cc:26:6b:3c:31:78:28:9a:cb:5c:68:
    e4:37
    Exponent: 65537 (0x10001)
    X509v3 extensions:
    X509v3 Key Usage: critical
    Digital Signature, Key Encipherment
    X509v3 Extended Key Usage:
    TLS Web Server Authentication
    X509v3 Basic Constraints: critical
    CA:FALSE
    X509v3 Authority Key Identifier:
    F9:DC:99:E4:F9:7D:E2:47:B9:4B:B9:DE:90:DC:79:DB:C9:D0:74:6B
    X509v3 Subject Alternative Name:
    DNS:kind-cluster-worker
    Signature Algorithm: sha256WithRSAEncryption
    Signature Value:
    27:36:2a:01:bb:a6:a0:a8:a1:ff:90:38:23:19:fb:ad:f3:ae:
    71:f2:e7:d6:c3:6c:25:9b:2b:a7:c7:90:92:9e:30:1d:3f:f7:
    2c:12:bb:12:ad:55:e5:ce:56:21:74:d6:97:1e:f0:98:65:cc:
    17:0a:35:29:2f:cc:0d:60:72:6a:4f:96:91:86:e3:2b:e3:0e:
    99:60:00:96:63:9d:28:8f:27:a9:cd:91:40:f0:13:71:03:b1:
    47:e5:29:d0:6e:bf:89:64:fd:72:f9:b5:e5:17:72:5b:7d:83:
    cd:00:41:d9:05:14:ae:6b:2c:37:59:d5:3d:89:2b:86:82:19:
    c6:a5:00:34:3a:33:a3:67:b3:30:61:71:77:34:a3:ee:4b:fa:
    cf:8b:a3:44:46:3c:a4:4f:c5:58:5b:a1:b6:90:82:43:80:f7:
    f8:11:a2:79:94:42:8c:de:27:f1:8d:6f:a7:be:ee:d7:a6:78:
    c6:f8:57:f8:f0:6a:b8:ec:10:70:c6:11:3d:a7:f6:13:14:bc:
    f2:19:99:db:02:bc:7e:bb:a5:76:ca:e8:b7:43:76:a0:8b:b7:
    f4:a0:94:f5:7e:a9:29:45:f6:f1:90:50:6b:08:2b:a2:55:bb:
    30:71:a7:f8:78:11:70:c8:c3:6a:ca:0f:66:86:7f:2b:e1:6a:
    38:37:6c:fc
    -----BEGIN CERTIFICATE-----
    MIIDTTCCAjWgAwIBAgIIdyIAO9Z5gVAwDQYJKoZIhvcNAQELBQAwLDEqMCgGA1UE
    Awwha2luZC1jbHVzdGVyLXdvcmtlci1jYUAxNzA3NDMzMzY1MB4XDTI0MDIwODIy
    MDI0NVoXDTI1MDIwNzIyMDI0NVowKTEnMCUGA1UEAwwea2luZC1jbHVzdGVyLXdv
    cmtlckAxNzA3NDMzMzY1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    nXOMHMXoQiRePWMKnI6NN0VI7lhy6Te2Ia2y+QZ+qeDfMM9mi62kwbHcnCnFsptJ
    8CBqv1mYpzNJaCDDiOrtB9Fv6gs6k0xARF+Tdw+CC2Mo7UJEVh4S5A1BnYTJUctm
    tWA9jzUqbh3cxaubmN2AmzlmTk2+A6FZX+fR/bdNs9Gh+zrrkhF2irfs8Sxbp68f
    KMB6HsgZOSdt014Dz9J5xB37Hh0R3KS0FYLcJ4TVaPGJrCypL26GezfZWjCRFm7q
    wB/t7vbSNV/gFNt533Vdr6AxF8IZEVzdB2fxJ6/ofNDbsioFQ1iDhv4wQECu6jCH
    6NkbzCZrPDF4KJrLXGjkNwIDAQABo3YwdDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0l
    BAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBT53Jnk+X3i
    R7lLud6Q3HnbydB0azAeBgNVHREEFzAVghNraW5kLWNsdXN0ZXItd29ya2VyMA0G
    CSqGSIb3DQEBCwUAA4IBAQAnNioBu6agqKH/kDgjGfut865x8ufWw2wlmyunx5CS
    njAdP/csErsSrVXlzlYhdNaXHvCYZcwXCjUpL8wNYHJqT5aRhuMr4w6ZYACWY50o
    jyepzZFA8BNxA7FH5SnQbr+JZP1y+bXlF3JbfYPNAEHZBRSuayw3WdU9iSuGghnG
    pQA0OjOjZ7MwYXF3NKPuS/rPi6NERjykT8VYW6G2kIJDgPf4EaJ5lEKM3ifxjW+n
    vu7XpnjG+Ff48Gq47BBwxhE9p/YTFLzyGZnbArx+u6V2yui3Q3agi7f0oJT1fqkp
    RfbxkFBrCCuiVbswcaf4eBFwyMNqyg9mhn8r4Wo4N2z8
    -----END CERTIFICATE-----

    Compruebe también los endpoints que el kubelet está apuntando para el kube-apiserver.