In slower setups it can take more time for the existing cluster
to be in a healthy state, so the existing backoff of ~50 seconds
is apparently not sufficient.
The client dial can also fail for similar reasons.
Improve kubeadm's join toleration of adding new etcd members.
Wrap both the client dial and member add in a longer backoff
(up to ~200 seconds).
This particular change should be backported to the support skew.
In a future change for master, all etcd client operations should be
make consistent so that the etcd logic is in a sane state.
specifically:
- cmd/kubeadm/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
- rm pkg/controller/.import-restrictions
- pkg/client/unversioned was removed in 59042
- pkg/kubectl/.import-restrictions
- pkg/printers is no longer used
- pkg/api was masking all of the pkg/apis prefixes
- rm staging/src/k8s.io/code-generator/cmd/lister-gen/.import-restrictions
- noop / empty file
- test/e2e/framework/.import-restrictions
- we don't need to explicitly allow k8s.io repos (external or published)
yaml has comments, so we can explain why we have certain rules or
certain prefixes
for those files that weren't already commented yaml, I converted them to
yaml and took a best guess at comments based on the PRs that introduced
or updated them
Use an init container that performs the pre-pull of a component
and then start an instance of "pause" as a regular container to
get the DaemonSet Pod in a Running state.
More details on this change in the code comments.
pkg/util/rlimit/rlimit_linux.go:25:1: exported function RlimitNumFiles should have comment or be unexported
pkg/util/rlimit/rlimit_linux.go:25:6: func name will be used as rlimit.RlimitNumFiles by other packages, and that stutters; consider calling this NumFiles
pkg/util/rlimit/rlimit_unsupported.go:25:1: exported function RlimitNumFiles should have comment or be unexported
pkg/util/rlimit/rlimit_unsupported.go:25:6: func name will be used as rlimit.RlimitNumFiles by other packages, and that stutters; consider calling this NumFiles
Ref: https://github.com/kubernetes/kubernetes/issues/68026
The flag "--use-api" for "alpha certs renew" was deprecated in 1.18.
Remove the flag and related logic that executes certificate renewal
using "api/certificates/v1beta1". kubeadm continues to be able
to create CSR files and renew using the local CA on disk.
kubeadm init prints:
W0410 23:02:10.119723 13040 manifests.go:225] the default kube-apiserver
authorization-mode is "Node,RBAC"; using "Node,RBAC"
Add a new function compareAuthzModes() and a unit test for it.
Make sure the warning is printed only if the user modes don't match
the defaults.
Allow overriding the dry-run temporary directory with
an env. variable (KUBEADM_INIT_DRYRUN_DIR).
Use the same variable in test/cmd/init_test.go.
This allows running integration tests as non-root.
Make getKubeadmPath() fetch the KUBEADM_PATH env. variable.
Panic if it's missing. Don't handle the "--kubeadm-path"
flag. Remove the same flag from the BUILD bazel test rule.
Don't handle "--kubeadm-cmd-skip" usage of this flag is missing
from the code base.
Remove usage of "kubeadmCmdSkip" as the flag "--kubeadm-cmd-skip"
is never passed.
If the kube-proxy/dns ConfigMap are missing, show warnings and assume
that these addons were skipped during "kubeadm init",
and that their redeployment on upgrade is not desired.
TODO: remove this once "kubeadm upgrade apply" phases are supported:
https://github.com/kubernetes/kubeadm/issues/1318
The TLS bootstrapping timeout is increased to 5 minutes with a retry
once every 5 seconds. Failing fast if the kubelet is not healthy is also
preserved.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
kubeadm uses image tags (such as `v3.4.3-0`) to specify the version of
etcd. However, the upgrade code in kubeadm uses the etcd client API to
fetch the currently deployed version. The result contains only the etcd
version without the additional information (such as image revision) that
is normally found in the tag. As a result it would refuse an upgrade
where the etcd versions match and the only difference is the image
revision number (`v3.4.3-0` to `v3.4.3-1`).
To fix the above issue, the following changes are done:
- Replace the existing etcd version querying code, that uses the etcd
client library, with code that returns the etcd image tag from the
local static pod manifest file.
- If an etcd `imageTag` is specified in the ClusterConfiguration during
upgrade, use that tag instead. This is done regardless if the tag was
specified in the configuration stored in the cluster or with a new
configuration supplied by the `--config` command line parameter.
If no custom tag is specified, kubeadm will select one depending on
the desired Kubernetes version.
- `kubeadm upgrade plan` no longer prints upgrade information about
external etcd. It's the user's responsibility to manage it in that
case.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
If the user does not provide --config or --control-plane
but provides some other flags such as --certificate-key
kubeadm is supposed to print a warning.
The logic around printing the warning is bogus. Implement
proper checks of when to print the warning.
b117a928 added a new check during "join" whether a Node with
the same name exists in the cluster.
When upgrading from 1.17 to 1.18 make sure the required RBAC
by this check is added. Otherwise "kubeadm join" will complain that
it lacks permissions to GET a Node.
A narrow assumption of what is contained in the `imageID` fields for the
CoreDNS pods causes a panic upon upgrade.
Fix this by using a proper regex to match a trailing SHA256 image digest
in `imageID` or return an error if it cannot find it.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
So multiple instances of kube-apiserver can bind on the same address and
port, to provide seamless upgrades.
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This builds on previous work but only sets the sysctlConnReuse value
if the kernel is known to be above 4.19. To avoid calling GetKernelVersion
twice, I store the value from the CanUseIPVS method and then check the version
constraint at time of expected sysctl call.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
This change removes support for basic authn in v1.19 via the
--basic-auth-file flag. This functionality was deprecated in v1.16
in response to ATR-K8S-002: Non-constant time password comparison.
Similar functionality is available via the --token-auth-file flag
for development purposes.
Signed-off-by: Monis Khan <mok@vmware.com>
Most of these could have been refactored automatically but it wouldn't
have been uglier. The unsophisticated tooling left lots of unnecessary
struct -> pointer -> struct transitions.
The KCM is moving to means of only singing apiserver (kubelet) client
and kubelet serving certificates. See:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/20190607-certificates-api.md#signers
Up until now the experimental kubeadm functionality '--use-api'
under "kubeadm alpha certs renew" was using the KCM to sign *any*
certficate as long as the KCM has the root CA cert/key.
Post discussions with the kubeadm maintainers, it was decided that
this functionality should be removed from kubeadm due to the
requirement to have external signers for renewing the common
control-plane certificates that kubeadm manages.
- Uses correct pause image for Windows
- Omits systemd specific flags
- Common build flags function to be used by Linux and Windows
- Uses user configured image repository for Windows pause image
The old flag name doesn't make sense with the renamed API Priority and
Fairness feature, and it's still safe to change the flag since it hasn't done
anything useful in a released k8s version yet.
After the shift for init phases, GetStaticPodSpecs() from
app/phases/controlplane/manifests.go gets called on each control-plane
component sub-phase. This ends up calling the Printf from
AddExtraHostPathMounts() in app/phases/controlplane/volumes.go
multiple times printing the same volumes for different components.
- Remove the Printf call from AddExtraHostPathMounts().
- Print all volumes for a component in CreateStaticPodFiles() using klog
V(2).
Perhaps in the future a bigger refactor is needed here were a
single control-plane component spec can be requested instead of a
map[string]v1.Pod.
The selected key type is defined by kubeadm's --feature-gates option:
if it contains PublicKeysECDSA=true then ECDSA keys will be generated
and used.
By default RSA keys are used still.
Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>
When doing the very first upgrade from a cluster that contains the
source of truth in the ClusterStatus struct, the new kubeadm logic
will try to retrieve this information from annotations.
This changeset adds to both etcd and apiserver endpoint retrieval the
special case in which they won't retry if we are in such cases. The
logic will retry if we find any unknown error, but will not retry in
the following cases:
- etcd annotations do not contain etcd endpoints, but the overall list
of etcd pods is greater than 0. This means that we listed at least
one etcd pod, but they are missing the annotation.
- API server annotation is not found on the api server pod for a given
node name, but no errors aside from that one were found. This means
that the API server pod is present, but is missing the annotation.
In both cases there is no point in retrying, and so, this speeds up the
upgrade path when coming from a previous existing cluster.
While `ClusterStatus` will be maintained and uploaded, it won't be
used by the internal `kubeadm` logic in order to determine the etcd
endpoints anymore.
The only exception is during the first upgrade cycle (`kubeadm upgrade
apply`, `kubeadm upgrade node`), in which we will fallback to the
ClusterStatus to let the upgrade path add the required annotations to
the newly created static pods.
While kubeadm does not support CA rotation,
the users might still attempt to perform this manually.
For kubeconfig files, updating to a new CA is not reflected
and users need to embed new CA PEM manually.
On kubeconfig cert renewal, always keep the embedded CA
in sync with the one on disk.
Includes a couple of typo fixes.
- Add handlers for service account issuer metadata.
- Add option to manually override JWKS URI.
- Add unit and integration tests.
- Add a separate ServiceAccountIssuerDiscovery feature gate.
Additional notes:
- If not explicitly overridden, the JWKS URI will be based on
the API server's external address and port.
- The metadata server is configured with the validating key set rather
than the signing key set. This allows for key rotation because tokens
can still be validated by the keys exposed in the JWKs URL, even if the
signing key has been rotated (note this may still be a short window if
tokens have short lifetimes).
- The trust model of OIDC discovery requires that the relying party
fetch the issuer metadata via HTTPS; the trust of the issuer metadata
comes from the server presenting a TLS certificate with a trust chain
back to the from the relying party's root(s) of trust. For tests, we use
a local issuer (https://kubernetes.default.svc) for the certificate
so that workloads within the cluster can authenticate it when fetching
OIDC metadata. An API server cannot validly claim https://kubernetes.io,
but within the cluster, it is the authority for kubernetes.default.svc,
according to the in-cluster config.
Co-authored-by: Michael Taufen <mtaufen@google.com>
It turns out that the dual-stack feature enabled doesn't mean that
the cluster MUST be dual-stack, it only indicates that it MAY be
dual-stack but CAN be single-stack.
We should relax the validation to allow single-stack clusters
with dual-stack enabled.
If a Node name in the cluster is already taken and this Node is Ready,
prevent TLS bootsrap on "kubeadm join" and exit early.
This change requires that a new ClusterRole is granted to the
"system:bootstrappers:kubeadm:default-node-token" group to be
able get Nodes in the cluster. The same group already has access
to obtain objects such as the KubeletConfiguration and kubeadm's
ClusterConfiguration.
The motivation of this change is to prevent undefined behavior
and the potential control-plane breakdown if such a cluster
is racing to have two nodes with the same name for long periods
of time.
The following values are validated in the following precedence
from lower to higher:
- actual hostname
- NodeRegistration.Name (or "--node-name") from JoinConfiguration
- "--hostname-override" passed via kubeletExtraArgs
If the user decides to not let kubeadm know about a custom node name
and to instead override the hostname from a kubelet systemd unit file,
kubeadm will not be able to detect the problem.
- Extend the exponential backoff for add/remove/... retry to
11 steps ~=106 seconds. From experiments for 3 and more members
the race can take more that ~=26 seconds.
- Increase the dialTimeout for client creation to 40 seconds.
20 seconds seems racy for 3 and more members.
For the etcd client, amend AddMember() to handle a very
rare bug when multiple members can end up with the same
name. Match the member peer address and assign it the name of
the member we are adding. For the rest of the members with missing
names use their member IDs as name. The etcd node is not disrupted
by the unknown names.
The important aspects are:
- The number of members of the initial cluster must match
the members in the cluster.
- The member we are current adding is present in the initial cluster.
As part of #68522, Switching off the cAdvisor v1 Json API that we expose
directly. These include /stats/, /stats/container, /stats/{podName}/{containerName},
and /stats/{namespace}/{podName}/{uid}/{containerName}
The CoreDNS GA feature-gate in kubeadm was deprecated since 1.13.
The k8s policy is to remove the gate 2 releases after it transitions
to GA:
https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecation
We kept it around for longer to prevent existing setups from breaking
as it caused minimal maintenance overhead.
This creates a new EndpointSliceProxying feature gate to cover EndpointSlice
consumption (kube-proxy) and allow the existing EndpointSlice feature gate to
focus on EndpointSlice production only. Along with that addition, this enables
the EndpointSlice feature gate by default, now only affecting the controller.
The rationale here is that it's really difficult to guarantee all EndpointSlices
are created in a cluster upgrade process before kube-proxy attempts to consume
them. Although masters are generally upgraded before nodes, and in most cases,
the controller would have enough time to create EndpointSlices before a new node
with kube-proxy spun up, there are plenty of edge cases where that might not be
the case. The primary limitation on EndpointSlice creation is the API rate limit
of 20QPS. In clusters with a lot of endpoints and/or with a lot of other API
requests, it could be difficult to create all the EndpointSlices before a new
node with kube-proxy targeting EndpointSlices spun up.
Separating this into 2 feature gates allows for a more gradual rollout with the
EndpointSlice controller being enabled by default in 1.18, and EndpointSlices
for kube-proxy being enabled by default in the next release.
The warning message
```
[config] WARNING: Ignored YAML document with GroupVersionKind ...
```
is printed for all GVKs that are not part of the kubeadm core types.
This is wrong as the component config types are supported and successfully
parsed and used despite the fact that the warning is printed for them too.
Hence this simple fix first checks if the group of the GVK is a supported
component config group and the warning is printed only if it's not.
Signed-off-by: Rostislav M. Georgiev <rostislavg@vmware.com>
kubeadm deploys the apiserver, controller-manager and the scheduler
using liveness probes.
The bind-address option is used to configure the probe address, in
case this is configured with an unspecified address, the probe
will fail. When using an unspecified address the probe host field is
left empty, otherwise the bind-address is used.
The function validateKubeConfig() can end up comparing
a user generated kubeconfig to a kubeconfig generated by kubeadm.
If a user kubeconfig has a CA that is base64 encoded with whitespace,
if said kubeconfig is loaded using clientcmd.LoadFromFile()
the CertificateAuthorityData bytes will be decoded from base64
and placed in the v1.Config raw. On the other hand a kubeconfig
generated by kubeadm will have the ca.crt parsed to a Certificate
object with whitespace ignored in the PEM input.
Make sure that validateKubeConfig() tolerates whitespace differences
when comparing CertificateAuthorityData.
kubeadm removed the deprecated "--address" flag for controller-manager
and scheduler in favor of "--bind-address"
We should use bind-address to configure the manifest probe addresses.