- Name the function GetConfigMapWithShortRetry to be
easier to understand that the function is with a very short timeout.
Add note that this function should be used in cases there is a
fallback to local config.
- Apply custom hardcoded interval of 50ms and timeout of 350ms to it.
Previously the fucntion used exp backoff with 5 steps up to ~340ms.
Switch to PollUntilContextTimeout() everywhere to allow
usage of the exposed timeouts in the kubeadm API. Exponential backoff
options are more difficult to expose in this regard and a bit too
detailed for the common user - i.e. have "steps", "factor" and so on.
Replace the usage of the deprecated wait.Poll() and
wait.PollImmediate() functions with wait.PollUntilContextTimeout().
Since we don't have piping of context around kubeadm,
use context.Background() everywhere.
Some wait.Poll() functions were converted to "immediate" as there
is no point for them to not be. This is done for consistency.
Replace the only instance of wait.JitterUntil with
wait.PollUntilContextTimeout. JitterUntil is not deprecated
but this is also done for consistency.
Don't create admin rolebindings when --kubeconfig is set to a
non-default value.
Fixes: https://github.com/kubernetes/kubeadm/issues/2992
Signed-off-by: Mario Valderrama <mario.valderrama@ionos.com>
If the user deletes the /var/lib/kubelet manually, "reset" will throw
an error that the dir is missing. Instead of handling this error,
print it as a warning and skip unmount of directories inside it.
This allows "reset" to continue to be reentrant and can be called
even even if "init/join" are not called yet and some of the
k8s directories on a node do not exist.
Continue to error on individual unmount errors.
Remove the function absoluteKubeletRunDirectory() and
call filepath.EvalSymlinks() directly.
Currently, timeouts are only accessible if a kubeadm runtime.Object{}
like InitConfiguration is passed around.
Any time a config is loaded or defaulted, store the Timeouts
structure in a thread-safe way in the main kubeadm API package
with SetActiveTimeouts(). Optionally, a deep-copy can be
performed before calling SetActiveTimeouts(). Make this struct
accessible with GetActiveTimeouts(). Ensure these functions
are thread safe.
On init() make sure the struct is defaulted, so that unit
tests can work with these values.
When upconverting from v1beta3 to v1beta4, it appears there is no
easy way to migrate some of the timeout values such as:
ClusterConfiguration.APIServer.TimeoutForControlPlane
to a new location:
InitConfiguration.Timeouts.<some-timeout-field>
Yes, the internal InitConfiguratio does embed a ClusterConfiguration,
but during conversion the ClusterConfiguration is converted from an
empty source.
K8s' API machinery has ways to register custom conversion functions,
such as v1beta3.ClusterConfiguration -> internal.InitConfiguration,
but these must be triggered explicitly with a decoder.
The overall migration of fields seems very awkward.
There might be hacks around that, such as storing intermediate state,
while trying to make the fuzzer rountrip happy, but instead
mutation functions can be implemented for the internal types when
calling kubeadm's migrate code. This seems much cleaner.
The struct is included in InitConfiguration, JoinConfiguration
and ResetConfiguration.
Add conversion and update defaulters and fuzzers.
Include a timeoututils.go that contains a function
to default the internal Timeouts struct.
Add new a v1beta4.ResetConfiguration.UnmountFlags field that
can be used to pass in Linux unmount2() flags such as MNT_FORCE.
Default value continues to be 0 - i.e. no flags.
Instead of warnings when syscall.Unmount() causes errors,
store all the errors in an aggregate. Abort the reset operation if
at least one unmount error was encountered.
The function TryRunCommand() uses an exponential backoff,
which is good, but it's inconsistent and only used in a couple
of places.
Remove its usage in the token.go#UpdateOrCreateTokens()
and switch to using the standard function used in other places -
PollUntilContextTimeout().
Remove wait.go#TryRunCommand(), as there are no other usages.
The function wait.go#WaitForKubeletAndFunc() has been used in
a number of places in kubeadm. It starts a go routine to wait for
the kubelet /healthz and in parallel starts another go routine
to wait for an custom function.
This logic is problematic. If kubeadm is waiting for the kubelet
in parallel with something that requires the kubelet, the right
solution would be to first wait for the kubelet in serial and only
then proceed with the other action. The parallelism here particularly
during "init" required a unwanted "initial timeout" of 40s, before
the kubelet waiting even starts. In most cases, this makes the kubelet
waiter to not even start, while the main point of waiting becomes
the "other action".
- Remove the function WaitForKubeletAndFunc() from the Waiter interface.
- Rename the function WaitForHealthyKubelet() to just WaitForKubelet()
to be consistent with the naming WaitForAPI().
- Update WaitForKubelet() to not use TryRunCommand() and instead
use PollUntilContextTimeout().
- Remove the "initial timeout" of 40s in WaitForKubelet().
- Make both WaitForKubelet() and WaitForAPI() use similar error
handling and output.
- Update all usage of WaitForKubelet() to be a serial call before
any other action, such as another wait* call.
- Make the default wait timeout for the kubelet
/healthz to be 1 minute (kubeadmconstants.DefaultKubeletTimeout).
- Apply updates to all implementations of the Waiter interface.
Explictly show the help msg that `kubeadm upgrade plan` can only run
on the node where "admin.conf" exists, normally, this is the control
plane node.
Signed-off-by: Dave Chen <dave.chen@arm.com>
The notes printed to the user from common.go when
loadConfig fails are outdated and incorrect.
If the config cannot be loaded the user should not be instructed
to re-upload the config with kubeadm commands. Instead they
should do it manually with kubectl.
On loadConfig() error just wrap the error in a simple message
and show it to the user.
The current setup stomps missing IsNotFound errors for Node objects.
The underlying fetching of init configuration uses
the node object to construct an initconfiguration for this
upgrade process, so if the Node is missing the kube-config CM
will be reported as missing, which is incorrect.
The component connection between kube-apiserver and kubelet does not
require the "O" field on the Subject to be set to the
"system:masters" privileged group. It can be a less
privileged group like "kubeadm:cluster-admins".
Change the group in the apiserve-kubelet-client
certificate specification. This cert is passed to
--kubelet-client-certificate.
The addition of the "super-admin.conf" functionality required
init.go's Client() to create RBAC rules on its first creation.
However this created a problem with the "wait-control-plane" phase
of "kubeadm init" where a client is needed to connect to the
API server Discovery API's "/healthz" endpoint. The logic that ensures
the RBAC became the step where the API server wait was polled for.
To avoid this, introduce a new InitData function ClientWithoutBootstrap.
In "wait-control-plane" use this client, which has no permissions
(anonymous), but is sufficient to connect to the "/healthz".
Pending changes here would be:
- Stop using the "/healthz", instead a regular REST client from
the kubelet cert/key can be constructed.
- Make the wait for kubelet / API server linear (not in go routines).
In EnsureAdminClusterRoleBindingImpl() there are a couple of
polls around CRB create calls. When testing the function
a short retry and a timeout are used. These introduce around
2x20 fake client "connections" / poll iterations under a couple
of test cases with 2 seconds overall test increase.
Given the polls in EnsureAdminClusterRoleBindingImpl()
are of type PollUntilContextTimeout() with "immediate" set to "true",
the short retry / time out can be removed when testing,
because one poll iteration is guaranteed and the tested function
is at 100% coverage with reactors and test cases.
Poll CRB create calls for kubeadm:cluster-admins when using the
super-admin.conf credential. The prior create call that uses the
credential admin.conf was already polled. Polling this subsequent
call seems advisable to ensure that momentary errors in between
cannot trip EnsureAdminClusterRoleBindingImpl().
- Update unit tests in certs_test.go related to the "renew" CLI command.
- In /init, (d *initData) Client(), make sure that the new logic
for bootstrapping an "admin.conf" user is performed, by calling
EnsureAdminClusterRoleBinding() from the phases backend. Add a
"adminKubeConfigBootstrapped" flag that helps call this logic only
once per "kubeadm init" binary execution.
- In /phases/init include a new subphase for generating
the "super-admin.conf" file.
- In /phases/reset make sure the file "super-admin.conf" is
cleaned if present. Update unit tests.
- Register the new file in /certs/renewal, so that the
file is renewed if present. If not present the common message "MISSING"
is shown. Same for other certs/kubeconfig files.
- In /kubeconfig, update the spec for admin.conf to use
the "kubeadm:cluster-admins" Group. A new spec is added for
the "super-admin.conf" file that uses the "system:masters" Group.
- Add a new function EnsureAdminClusterRoleBinding() that includes
logic to ensure that admin.conf contains a User that is properly
bound on the "cluster-admin" built-in ClusterRole. This requires
bootstrapping using the "system:masters" containing "super-admin.conf".
Add detailed unit tests for this new logic.
- In /upgrade#PerformPostUpgradeTasks() add logic to create the
"admin.conf" and "super-admin.conf" with the new, updated specs.
Add detailed unit tests for this new logic.
- In /upgrade#StaticPodControlPlane() ensure that renewal of
"super-admin.conf" is performed if the file exists.
Update unit tests.
- Add the new file name: super-admin.conf and a function
to return its default path GetSuperAdminKubeConfigPath()
- Add the ClusterAdminsGroupAndClusterRoleBinding object name.
These were found with a modified klog that enables "go vet" to check klog call
parameters:
cmd/kubeadm/app/features/features.go:149:4: printf: k8s.io/klog/v2.Warningf format %t has arg v of wrong type string (govet)
klog.Warningf("Setting deprecated feature gate %s=%t. It will be removed in a future release.", k, v)
test/images/sample-device-plugin/sampledeviceplugin.go:147:5: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
klog.Errorf("error: %w", err)
test/images/sample-device-plugin/sampledeviceplugin.go:155:3: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
klog.Errorf("Failed to add watch to %q: %w", triggerPath, err)
staging/src/k8s.io/code-generator/cmd/prerelease-lifecycle-gen/prerelease-lifecycle-generators/status.go:207:5: printf: k8s.io/klog/v2.Fatalf does not support error-wrapping directive %w (govet)
klog.Fatalf("Package %v: unsupported %s value: %q :%w", i, tagEnabledName, ptag.value, err)
staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:286:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
klog.V(4).Infof("Node %s missing in vSphere cloud provider cache, trying node informer")
staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:302:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
klog.V(4).Infof("Node %s missing in vSphere cloud provider caches, trying the API server")
Turn on FeatureGate MergeCLIArgumentsWithConfig to keep the legacy way of management of
ignorePreflightErrors, which means the value defined by the flag `ignore-preflight-errors`
will be merged with the value `ignorePreflightErrors` defined in the config file.
Otherwise, the value defined by the flag will replace the value from the config file if set.
Signed-off-by: Dave Chen <dave.chen@arm.com>