Skip to main content

Node Failure Verification Sequence

In case of a node failure, we have some situations to keep in mind.

  1. Check the status of nodes with the command kubectl get nodes to see the status.

  2. Look for events of not-ready nodes with the describe command for this node. Check the conditions.

    kubectl describe nodes kind-cluster-worker
    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message
    ---- ------ ----------------- ------------------ ------ -------
    # If MemoryPressure is true, we're low on memory to run pods. Pods are probably crashing.
    MemoryPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasSufficientMemory kubelet has sufficient memory available
    # If DiskPressure is true, then we're low on disk capacity
    DiskPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasNoDiskPressure kubelet has no disk pressure
    # PIDPressure will be set to true if there are too many pods on this node
    PIDPressure False Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:46 -0300 KubeletHasSufficientPID kubelet has sufficient PID available
    Ready True Mon, 26 Feb 2024 08:57:01 -0300 Thu, 08 Feb 2024 20:02:49 -0300 KubeletReady kubelet is posting ready status

    If any of these pressures is set to true, we already know it's some resource shortage. If it's Unknown, probably some accident happened and the status was lost.

  3. Check the processes and consumption on the node with the top and df -h commands

    top - 12:11:32 up 22:41,  0 user,  load average: 3.25, 2.79, 2.56
    Tasks: 17 total, 1 running, 16 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 8.0 us, 0.5 sy, 0.0 ni, 91.2 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st
    MiB Mem : 64001.3 total, 41471.9 free, 11956.6 used, 13210.8 buff/cache
    MiB Swap: 1952.0 total, 1952.0 free, 0.0 used. 52044.7 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    116 root 20 0 2470452 66056 36480 S 0.7 0.1 13:23.44 containerd
    223 root 20 0 2999176 86524 53376 S 0.3 0.1 8:36.64 kubelet
    1 root 20 0 20392 11648 8704 S 0.0 0.0 0:01.13 systemd
    97 root 20 0 24792 11008 10240 S 0.0 0.0 0:00.08 systemd-journal
    271 root 20 0 722648 13824 9856 S 0.0 0.0 0:07.79 containerd-shim
    287 root 20 0 722648 13852 9856 S 0.0 0.0 0:07.95 containerd-shim
    317 65535 20 0 996 512 512 S 0.0 0.0 0:00.00 pause
    324 65535 20 0 996 512 512 S 0.0 0.0 0:00.01 pause
    358 root 20 0 1284848 49360 36608 S 0.0 0.1 0:07.92 kube-proxy
    446 root 20 0 743928 27448 19072 S 0.0 0.0 0:15.96 kindnetd
    14316 root 20 0 722392 13184 9600 S 0.0 0.0 0:00.01 containerd-shim
    14336 65535 20 0 996 512 512 S 0.0 0.0 0:00.00 pause
    14373 root 20 0 2484 1280 1280 S 0.0 0.0 0:00.01 sleep
    14400 root 20 0 2576 1408 1408 S 0.0 0.0 0:00.00 sh
    14406 root 20 0 2576 128 128 S 0.0 0.0 0:00.00 sh
    14407 root 20 0 4192 3328 2816 S 0.0 0.0 0:00.00 bash
    14412 root 20 0 8568 4736 2688 R 0.0 0.0 0:00.00 top

    root@kind-cluster-worker:/# df -h
    Filesystem Size Used Avail Use% Mounted on
    overlay 1.8T 571G 1.2T 33% /
    tmpfs 64M 0 64M 0% /dev
    shm 64M 0 64M 0% /dev/shm
    /dev/mapper/vgubuntu-root 1.8T 571G 1.2T 33% /var
    tmpfs 32G 8.6M 32G 1% /run
    tmpfs 32G 0 32G 0% /tmp
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 63G 12K 63G 1% /var/lib/kubelet/pods/5a2bf15d-36fa-4c73-94a3-b491f4774e72/volumes/kubernetes.io~projected/kube-api-access-tpjjt
    tmpfs 50M 12K 50M 1% /var/lib/kubelet/pods/92c3fe67-ccb9-437c-8d18-c16008dfa93b/volumes/kubernetes.io~projected/kube-api-access-cxt56
    shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/b6087483482422390d1fad0ec6726dfd98aba0d990b3f7f5a6d8224c15c4a4a3/shm
    shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/c13daa616a7ee7a7144b2acf39476a6e36fd454c1ebf345c26d3834703d11756/shm
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/b6087483482422390d1fad0ec6726dfd98aba0d990b3f7f5a6d8224c15c4a4a3/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/c13daa616a7ee7a7144b2acf39476a6e36fd454c1ebf345c26d3834703d11756/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/0e6c2021f2b349bb0a16e5e5ecedb44a364566413ddfac25d09dd0538bf1de3b/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/970340bd3152b21a503b9e8fbc0b6af4948bed0bc9581f03f7140cbad18b8015/rootfs
    tmpfs 63G 12K 63G 1% /var/lib/kubelet/pods/1af287b4-b519-4956-995a-5cf7403e0699/volumes/kubernetes.io~projected/kube-api-access-h9vz9
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/36a2786c5693be823e1cd178341a794744583fe1d67548132f4a364933d54967/rootfs
    overlay 1.8T 571G 1.2T 33% /run/containerd/io.containerd.runtime.v2.task/k8s.io/27c42f370c029cf965538366fd6f310cb9408ef6b305442ce21f32ec8947e2a6/rootfs
  4. Check the kubelet status and logs with systemd status kubelet.service and journalctl -xeu kubelet

  5. Also check certificates and see if they haven't expired

    root@kind-cluster-worker:/# openssl x509 -in /var/lib/kubelet/pki/kubelet.crt
    -----BEGIN CERTIFICATE-----
    MIIDTTCCAjWgAwIBAgIIdyIAO9Z5gVAwDQYJKoZIhvcNAQELBQAwLDEqMCgGA1UE
    Awwha2luZC1jbHVzdGVyLXdvcmtlci1jYUAxNzA3NDMzMzY1MB4XDTI0MDIwODIy
    MDI0NVoXDTI1MDIwNzIyMDI0NVowKTEnMCUGA1UEAwwea2luZC1jbHVzdGVyLXdv
    cmtlckAxNzA3NDMzMzY1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    nXOMHMXoQiRePWMKnI6NN0VI7lhy6Te2Ia2y+QZ+qeDfMM9mi62kwbHcnCnFsptJ
    8CBqv1mYpzNJaCDDiOrtB9Fv6gs6k0xARF+Tdw+CC2Mo7UJEVh4S5A1BnYTJUctm
    tWA9jzUqbh3cxaubmN2AmzlmTk2+A6FZX+fR/bdNs9Gh+zrrkhF2irfs8Sxbp68f
    KMB6HsgZOSdt014Dz9J5xB37Hh0R3KS0FYLcJ4TVaPGJrCypL26GezfZWjCRFm7q
    wB/t7vbSNV/gFNt533Vdr6AxF8IZEVzdB2fxJ6/ofNDbsioFQ1iDhv4wQECu6jCH
    6NkbzCZrPDF4KJrLXGjkNwIDAQABo3YwdDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0l
    BAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBT53Jnk+X3i
    R7lLud6Q3HnbydB0azAeBgNVHREEFzAVghNraW5kLWNsdXN0ZXItd29ya2VyMA0G
    CSqGSIb3DQEBCwUAA4IBAQAnNioBu6agqKH/kDgjGfut865x8ufWw2wlmyunx5CS
    njAdP/csErsSrVXlzlYhdNaXHvCYZcwXCjUpL8wNYHJqT5aRhuMr4w6ZYACWY50o
    jyepzZFA8BNxA7FH5SnQbr+JZP1y+bXlF3JbfYPNAEHZBRSuayw3WdU9iSuGghnG
    pQA0OjOjZ7MwYXF3NKPuS/rPi6NERjykT8VYW6G2kIJDgPf4EaJ5lEKM3ifxjW+n
    vu7XpnjG+Ff48Gq47BBwxhE9p/YTFLzyGZnbArx+u6V2yui3Q3agi7f0oJT1fqkp
    RfbxkFBrCCuiVbswcaf4eBFwyMNqyg9mhn8r4Wo4N2z8
    -----END CERTIFICATE-----
    root@kind-cluster-worker:/# openssl x509 -in /var/lib/kubelet/pki/kubelet.crt --text
    Certificate:
    Data:
    Version: 3 (0x2)
    Serial Number: 8584424096722944336 (0x7722003bd6798150)
    Signature Algorithm: sha256WithRSAEncryption
    Issuer: CN = kind-cluster-worker-ca@1707433365
    Validity
    Not Before: Feb 8 22:02:45 2024 GMT
    Not After : Feb 7 22:02:45 2025 GMT #OK
    Subject: CN = kind-cluster-worker@1707433365
    Subject Public Key Info:
    Public Key Algorithm: rsaEncryption
    Public-Key: (2048 bit)
    Modulus:
    00:9d:73:8c:1c:c5:e8:42:24:5e:3d:63:0a:9c:8e:
    8d:37:45:48:ee:58:72:e9:37:b6:21:ad:b2:f9:06:
    7e:a9:e0:df:30:cf:66:8b:ad:a4:c1:b1:dc:9c:29:
    c5:b2:9b:49:f0:20:6a:bf:59:98:a7:33:49:68:20:
    c3:88:ea:ed:07:d1:6f:ea:0b:3a:93:4c:40:44:5f:
    93:77:0f:82:0b:63:28:ed:42:44:56:1e:12:e4:0d:
    41:9d:84:c9:51:cb:66:b5:60:3d:8f:35:2a:6e:1d:
    dc:c5:ab:9b:98:dd:80:9b:39:66:4e:4d:be:03:a1:
    59:5f:e7:d1:fd:b7:4d:b3:d1:a1:fb:3a:eb:92:11:
    76:8a:b7:ec:f1:2c:5b:a7:af:1f:28:c0:7a:1e:c8:
    19:39:27:6d:d3:5e:03:cf:d2:79:c4:1d:fb:1e:1d:
    11:dc:a4:b4:15:82:dc:27:84:d5:68:f1:89:ac:2c:
    a9:2f:6e:86:7b:37:d9:5a:30:91:16:6e:ea:c0:1f:
    ed:ee:f6:d2:35:5f:e0:14:db:79:df:75:5d:af:a0:
    31:17:c2:19:11:5c:dd:07:67:f1:27:af:e8:7c:d0:
    db:b2:2a:05:43:58:83:86:fe:30:40:40:ae:ea:30:
    87:e8:d9:1b:cc:26:6b:3c:31:78:28:9a:cb:5c:68:
    e4:37
    Exponent: 65537 (0x10001)
    X509v3 extensions:
    X509v3 Key Usage: critical
    Digital Signature, Key Encipherment
    X509v3 Extended Key Usage:
    TLS Web Server Authentication
    X509v3 Basic Constraints: critical
    CA:FALSE
    X509v3 Authority Key Identifier:
    F9:DC:99:E4:F9:7D:E2:47:B9:4B:B9:DE:90:DC:79:DB:C9:D0:74:6B
    X509v3 Subject Alternative Name:
    DNS:kind-cluster-worker
    Signature Algorithm: sha256WithRSAEncryption
    Signature Value:
    27:36:2a:01:bb:a6:a0:a8:a1:ff:90:38:23:19:fb:ad:f3:ae:
    71:f2:e7:d6:c3:6c:25:9b:2b:a7:c7:90:92:9e:30:1d:3f:f7:
    2c:12:bb:12:ad:55:e5:ce:56:21:74:d6:97:1e:f0:98:65:cc:
    17:0a:35:29:2f:cc:0d:60:72:6a:4f:96:91:86:e3:2b:e3:0e:
    99:60:00:96:63:9d:28:8f:27:a9:cd:91:40:f0:13:71:03:b1:
    47:e5:29:d0:6e:bf:89:64:fd:72:f9:b5:e5:17:72:5b:7d:83:
    cd:00:41:d9:05:14:ae:6b:2c:37:59:d5:3d:89:2b:86:82:19:
    c6:a5:00:34:3a:33:a3:67:b3:30:61:71:77:34:a3:ee:4b:fa:
    cf:8b:a3:44:46:3c:a4:4f:c5:58:5b:a1:b6:90:82:43:80:f7:
    f8:11:a2:79:94:42:8c:de:27:f1:8d:6f:a7:be:ee:d7:a6:78:
    c6:f8:57:f8:f0:6a:b8:ec:10:70:c6:11:3d:a7:f6:13:14:bc:
    f2:19:99:db:02:bc:7e:bb:a5:76:ca:e8:b7:43:76:a0:8b:b7:
    f4:a0:94:f5:7e:a9:29:45:f6:f1:90:50:6b:08:2b:a2:55:bb:
    30:71:a7:f8:78:11:70:c8:c3:6a:ca:0f:66:86:7f:2b:e1:6a:
    38:37:6c:fc
    -----BEGIN CERTIFICATE-----
    MIIDTTCCAjWgAwIBAgIIdyIAO9Z5gVAwDQYJKoZIhvcNAQELBQAwLDEqMCgGA1UE
    Awwha2luZC1jbHVzdGVyLXdvcmtlci1jYUAxNzA3NDMzMzY1MB4XDTI0MDIwODIy
    MDI0NVoXDTI1MDIwNzIyMDI0NVowKTEnMCUGA1UEAwwea2luZC1jbHVzdGVyLXdv
    cmtlckAxNzA3NDMzMzY1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
    nXOMHMXoQiRePWMKnI6NN0VI7lhy6Te2Ia2y+QZ+qeDfMM9mi62kwbHcnCnFsptJ
    8CBqv1mYpzNJaCDDiOrtB9Fv6gs6k0xARF+Tdw+CC2Mo7UJEVh4S5A1BnYTJUctm
    tWA9jzUqbh3cxaubmN2AmzlmTk2+A6FZX+fR/bdNs9Gh+zrrkhF2irfs8Sxbp68f
    KMB6HsgZOSdt014Dz9J5xB37Hh0R3KS0FYLcJ4TVaPGJrCypL26GezfZWjCRFm7q
    wB/t7vbSNV/gFNt533Vdr6AxF8IZEVzdB2fxJ6/ofNDbsioFQ1iDhv4wQECu6jCH
    6NkbzCZrPDF4KJrLXGjkNwIDAQABo3YwdDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0l
    BAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAfBgNVHSMEGDAWgBT53Jnk+X3i
    R7lLud6Q3HnbydB0azAeBgNVHREEFzAVghNraW5kLWNsdXN0ZXItd29ya2VyMA0G
    CSqGSIb3DQEBCwUAA4IBAQAnNioBu6agqKH/kDgjGfut865x8ufWw2wlmyunx5CS
    njAdP/csErsSrVXlzlYhdNaXHvCYZcwXCjUpL8wNYHJqT5aRhuMr4w6ZYACWY50o
    jyepzZFA8BNxA7FH5SnQbr+JZP1y+bXlF3JbfYPNAEHZBRSuayw3WdU9iSuGghnG
    pQA0OjOjZ7MwYXF3NKPuS/rPi6NERjykT8VYW6G2kIJDgPf4EaJ5lEKM3ifxjW+n
    vu7XpnjG+Ff48Gq47BBwxhE9p/YTFLzyGZnbArx+u6V2yui3Q3agi7f0oJT1fqkp
    RfbxkFBrCCuiVbswcaf4eBFwyMNqyg9mhn8r4Wo4N2z8
    -----END CERTIFICATE-----

    Also check the endpoints that kubelet is pointing to kube-apiserver.