In slower setups it can take more time for the existing cluster
to be in a healthy state, so the existing backoff of ~50 seconds
is apparently not sufficient.
The client dial can also fail for similar reasons.
Improve kubeadm's join toleration of adding new etcd members.
Wrap both the client dial and member add in a longer backoff
(up to ~200 seconds).
This particular change should be backported to the support skew.
In a future change for master, all etcd client operations should be
make consistent so that the etcd logic is in a sane state.
kubeadm uses image tags (such as `v3.4.3-0`) to specify the version of
etcd. However, the upgrade code in kubeadm uses the etcd client API to
fetch the currently deployed version. The result contains only the etcd
version without the additional information (such as image revision) that
is normally found in the tag. As a result it would refuse an upgrade
where the etcd versions match and the only difference is the image
revision number (`v3.4.3-0` to `v3.4.3-1`).
To fix the above issue, the following changes are done:
- Replace the existing etcd version querying code, that uses the etcd
client library, with code that returns the etcd image tag from the
local static pod manifest file.
- If an etcd `imageTag` is specified in the ClusterConfiguration during
upgrade, use that tag instead. This is done regardless if the tag was
specified in the configuration stored in the cluster or with a new
configuration supplied by the `--config` command line parameter.
If no custom tag is specified, kubeadm will select one depending on
the desired Kubernetes version.
- `kubeadm upgrade plan` no longer prints upgrade information about
external etcd. It's the user's responsibility to manage it in that
case.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
The selected key type is defined by kubeadm's --feature-gates option:
if it contains PublicKeysECDSA=true then ECDSA keys will be generated
and used.
By default RSA keys are used still.
Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>
When doing the very first upgrade from a cluster that contains the
source of truth in the ClusterStatus struct, the new kubeadm logic
will try to retrieve this information from annotations.
This changeset adds to both etcd and apiserver endpoint retrieval the
special case in which they won't retry if we are in such cases. The
logic will retry if we find any unknown error, but will not retry in
the following cases:
- etcd annotations do not contain etcd endpoints, but the overall list
of etcd pods is greater than 0. This means that we listed at least
one etcd pod, but they are missing the annotation.
- API server annotation is not found on the api server pod for a given
node name, but no errors aside from that one were found. This means
that the API server pod is present, but is missing the annotation.
In both cases there is no point in retrying, and so, this speeds up the
upgrade path when coming from a previous existing cluster.
While `ClusterStatus` will be maintained and uploaded, it won't be
used by the internal `kubeadm` logic in order to determine the etcd
endpoints anymore.
The only exception is during the first upgrade cycle (`kubeadm upgrade
apply`, `kubeadm upgrade node`), in which we will fallback to the
ClusterStatus to let the upgrade path add the required annotations to
the newly created static pods.
- Extend the exponential backoff for add/remove/... retry to
11 steps ~=106 seconds. From experiments for 3 and more members
the race can take more that ~=26 seconds.
- Increase the dialTimeout for client creation to 40 seconds.
20 seconds seems racy for 3 and more members.
For the etcd client, amend AddMember() to handle a very
rare bug when multiple members can end up with the same
name. Match the member peer address and assign it the name of
the member we are adding. For the rest of the members with missing
names use their member IDs as name. The etcd node is not disrupted
by the unknown names.
The important aspects are:
- The number of members of the initial cluster must match
the members in the cluster.
- The member we are current adding is present in the initial cluster.
The warning message
```
[config] WARNING: Ignored YAML document with GroupVersionKind ...
```
is printed for all GVKs that are not part of the kubeadm core types.
This is wrong as the component config types are supported and successfully
parsed and used despite the fact that the warning is printed for them too.
Hence this simple fix first checks if the group of the GVK is a supported
component config group and the warning is printed only if it's not.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
kubeadm deploys the apiserver, controller-manager and the scheduler
using liveness probes.
The bind-address option is used to configure the probe address, in
case this is configured with an unspecified address, the probe
will fail. When using an unspecified address the probe host field is
left empty, otherwise the bind-address is used.
kubeadm removed the deprecated "--address" flag for controller-manager
and scheduler in favor of "--bind-address"
We should use bind-address to configure the manifest probe addresses.
If the user has modified the kubelet.conf post TLS bootstrap
to become invalid, the function getNodeNameFromKubeletConfig() can
panic. This was observed to trigger in "kubeadm reset" use cases.
Add basic validation and unit tests around parsing the kubelet.conf
with the aforementioned function.
CreateOrMutateConfigMap was not resilient when it was trying to Create
the ConfigMap. If this operation returned an unknown error the whole
operation would fail, because it was strict in what error it was
expecting right afterwards: if the error returned by the Create call
was a IsAlreadyExists error, it would work fine. However, if an
unexpected error (such as an EOF) happened, this call would fail.
We are seeing this error specially when running control plane node
joins in an automated fashion, where things happen at a relatively
high speed pace.
It was specially easy to reproduce with kind, with several control
plane instances. E.g.:
```
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
I1130 11:43:42.788952 887 round_trippers.go:443] POST https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps?timeout=10s in 1013 milliseconds
Post https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps?timeout=10s: unexpected EOF
unable to create ConfigMap
k8s.io/kubernetes/cmd/kubeadm/app/util/apiclient.CreateOrMutateConfigMap
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/util/apiclient/idempotency.go:65
```
This change makes this logic more resilient to unknown errors. It will
retry on the light of unknown errors until some of the expected error
happens: either `IsAlreadyExists`, in which case we will mutate the
ConfigMap, or no error, in which case the ConfigMap has been created.