I spent a happy few hours over the weekend, trying to work out why my Kubernetes 1.21 cluster wasn't behaving as expected.
I was seeing a bunch o' weirdness whereby certain pods weren't able to access certain services, manifested specifically when I was trying/failing to create a DataVolume using the KubeVirtContainerised Data Importer (CDI) capability.
I was seeing exceptions such as: -
Error from server (InternalError): error when creating "create_volume.yaml": Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": Post "https://cdi-api.cdi.svc:443/datavolume-mutate?timeout=30s": dial tcp 10.102.58.243:443: i/o timeout
from: -
kubectl apply -f create_volume.yaml
After much digging and DNS debugging including using BusyBox to resolve various K8s services: -
kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup cdi-api.cdi
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: cdi-api.cdi
Address 1: 10.102.58.243 cdi-api.cdi.svc.cluster.local
pod "busybox" deleted
kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
which looked OK, something inspired me to look at Calico Node, which is the networking layer overlaying my cluster: -
kubectl get pods -A|grep calico
kube-system calico-kube-controllers-bf965bfd8-hg82b 1/1 Running 0 58m
kube-system calico-node-8zkvt 0/1 Running 0 7m24s
kube-system calico-node-srmj6 0/1 Running 0 7m47s
Noticing that both calico-node pods were showing 0/1 rather than 1/1, meaning that they weren't running on the Compute node, I dug further: -
kubectl describe pod `kubectl get pods -A|grep calico-node|awk '{print $2}'` --namespace kube-system
which, in part, showed: -
Warning Unhealthy 9m44s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
Warning Unhealthy 9m42s kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Warning Unhealthy 9m32s kubelet Readiness probe failed: 2021-05-17 11:33:08.203 [INFO][197] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 9m22s kubelet Readiness probe failed: 2021-05-17 11:33:18.195 [INFO][231] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 9m12s kubelet Readiness probe failed: 2021-05-17 11:33:28.278 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 9m2s kubelet Readiness probe failed: 2021-05-17 11:33:38.334 [INFO][301] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 8m52s kubelet Readiness probe failed: 2021-05-17 11:33:48.182 [INFO][319] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 8m42s kubelet Readiness probe failed: 2021-05-17 11:33:58.266 [INFO][356] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 8m32s kubelet Readiness probe failed: 2021-05-17 11:34:08.185 [INFO][377] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Warning Unhealthy 4m42s (x23 over 8m22s) kubelet (combined from similar events): Readiness probe failed: 2021-05-17 11:37:58.234 [INFO][1014] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137
Knowing that my firewall configuration - iptables - was clean n' green, in that I'd opened up the Border Gateway Protocol (BGP) port 179 on both the Control Plane and Compute nodes: -
iptables -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
I looked back through my notes, and remembered the issue with IP_AUTODETECTION_METHOD and the Calico Node daemonset.
I checked the daemonset: -
kubectl get daemonset -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system calico-node 2 2 0 2 0 kubernetes.io/os=linux 64m
kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 112m
kubevirt virt-handler 1 1 1 1 1 kubernetes.io/os=linux 30m
and noticed that the calico-node daemonset was, like the pods, showing as unready ( 0 instead of 1+ in the READY column )
I inspected the offending daemonset: -
kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'
{
"name": "IP",
"value": "autodetect"
}
noting that the IP_AUTODETECTION_METHOD environment variable wasn't specifically set.
Given that the VMs that host my K8s nodes have TWO network adapters, eth0 and eth1, and that I want Calico Node to use eth0 which is the private IP, I explicitly set that: -
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth0
daemonset.apps/calico-node env updated
and then validated the change: -
kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'
{
"name": "IP",
"value": "autodetect"
}
{
"name": "IP_AUTODETECTION_METHOD",
"value": "interface=eth0"
}
More importantly, the Calico Node pods are happy: -
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cdi cdi-apiserver-6b87945b8d-dww25 1/1 Running 0 120m
cdi cdi-deployment-86c6d76d98-7cxlv 1/1 Running 0 120m
cdi cdi-operator-5757c84894-xhw6r 1/1 Running 0 120m
cdi cdi-uploadproxy-79dd97b4d5-lvd72 1/1 Running 0 120m
kube-system calico-kube-controllers-bf965bfd8-hg82b 1/1 Running 0 154m
kube-system calico-node-8llqm 1/1 Running 0 75m
kube-system calico-node-j9rdb 1/1 Running 0 75m
kube-system coredns-558bd4d5db-fml9w 1/1 Running 0 3h21m
kube-system coredns-558bd4d5db-gg8dm 1/1 Running 0 3h21m
kube-system etcd-grouched1.fyre.ibm.com 1/1 Running 0 3h22m
kube-system kube-apiserver-grouched1.fyre.ibm.com 1/1 Running 0 3h22m
kube-system kube-controller-manager-grouched1.fyre.ibm.com 1/1 Running 1 3h22m
kube-system kube-proxy-47txj 1/1 Running 0 3h19m
kube-system kube-proxy-hg7f8 1/1 Running 0 3h21m
kube-system kube-scheduler-grouched1.fyre.ibm.com 1/1 Running 0 3h22m
kubevirt virt-api-58999dff54-c8mch 1/1 Running 0 120m
kubevirt virt-api-58999dff54-gs8pm 1/1 Running 0 120m
kubevirt virt-controller-5c68c56896-l2rp7 1/1 Running 0 120m
kubevirt virt-controller-5c68c56896-phrt9 1/1 Running 0 120m
kubevirt virt-handler-85dhc 1/1 Running 0 120m
kubevirt virt-operator-78f65c88d4-ldtgj 1/1 Running 0 123m
kubevirt virt-operator-78f65c88d4-tmxhs 1/1 Running 0 123m
as is the daemonset: -
kubectl get daemonset -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system calico-node 2 2 2 2 2 kubernetes.io/os=linux 154m
kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 3h23m
kubevirt virt-handler 1 1 1 1 1 kubernetes.io/os=linux 120m
and I can now create my DataVolume: -
kubectl apply -f create_volume.yaml
datavolume.cdi.kubevirt.io/registry-image-datavolume created