Old stored services will not have the `clusterIPs` field when read back
without this.
This includes some renaming for clarity and expanded comments, and a new
test for default on read.
When using a HTTP probe, the request will now have a "Accept" header by default with the "*/*" (accept all)
Most tools do add this header (see curl) so it's a reasonable expectation that http probe has it as well
With the graduation of seccomp to GA we automatically convert the
deprecated seccomp profile annotation `docker/default` to
`runtime/default`. This means that we now have to automatically allow
`runtime/default` if a user specifies `docker/default` and vice versa in
an allowed PSP seccomp profile.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Service has had a problem since forever:
- User creates a service type=LoadBalancer
- We silently allocate them a NodePort
- User changes type to ClusterIP
- We fail the operation because they did not clear NodePort
They never asked for or used the NodePort!
Dual-stack introduced some dependent fields that get auto-wiped on
updates. This carries it further.
If you squint, you can see Service as a big, messy discriminated union,
with type as the discriminator. Ignoring fields for non-selected
union-modes seems right.
This introduces the potential for an apply loop. Specifically, we will
accept YAML that we did not previously accept. Apply could see the
field in local YAML and not in the server and repeatedly try to patch it
in. But since that YAML is currently an error, it seems like a very low
risk. Almost nobody actually specifies their own NodePort values.
To mitigate this somewhat, we only auto-wipe on updates. The same YAML
would fail to create. This is a little inconsistent. We could
auto-wipe on create, too, at the risk of more potential impact.
To do this properly, we need to know the old and new values, which means
we can not do it in defaulting or conversion. So we do it in strategy.
This change also adds unit tests and updates e2e tests to rely on and
verify this behavior.
The main goal was to cover retrieval of a PVC from the apiserver when
it isn't known yet. This is achieved by adding PVCs and (for the sake
of completeness) PVs to the reactor, but not the controller, when a
special annotation is set. The approach with a special annotation was
chosen because it doesn't affect other tests.
The other test cases were added while checking the existing tests
because (at least at first glance) the situations seemed to be not
covered.
Generally try to waive away folks who see a particular event stream
and feel tempted to extrapolate and build tooling that expects the
same underlying resource transition chain to continue to produce a
similar event stream as the underlying components evolve and are
updated. New controllers should not be constrained to be
backwards-compatible with previous versions with regard to Event
emission. This is distinct from the Event type itself, which has the
usual Kubernetes-API compatibility commitments for versioned types.
The EventTTL default has been 1h since 7e258b85bd (Reduce TTL for
events in etcd from 48hrs to 1hr, 2015-03-11, #5315), and remains so
today:
$ git --no-pager log -1 --format='%h %s' origin/master
8e5c02255c Merge pull request #90942 from ii/ii-create-pod%2Bpodstatus-resource-lifecycle-test
$ git --no-pager grep EventTTL: 8e5c02255c cmd/kube-apiserver/app/options/options.go
8e5c02255cc:cmd/kube-apiserver/app/options/options.go: EventTTL: 1 * time.Hour,
In this space [1,2]:
To avoid filling up master's disk, a retention policy is enforced:
events are removed one hour after the last occurrence. To provide
longer history and aggregation capabilities, a third party solution
should be installed to capture events.
...
Note: It is not guaranteed that all events happening in a cluster
will be exported to Stackdriver. One possible scenario when events
will not be exported is when event exporter is not running
(e.g. during restart or upgrade). In most cases it's fine to use
events for purposes like setting up metrics and alerts, but you
should be aware of the potential inaccuracy.
...
To prevent disturbing your workloads, event exporter does not have
resources set and is in the best effort QOS class, which means that
it will be the first to be killed in the case of resource
starvation.
Although that's talking more about export from etcd -> external
storage, and not about cluster components submitting events to etcd.
[1]: https://kubernetes.io/docs/tasks/debug-application-cluster/events-stackdriver/
[2]: https://github.com/kubernetes/website/pull/4155/files#diff-d8eb69c5436aa38b396d4f3ed75e4792R10
As part of externalizing this function to the k8s.io/component-helpers repo,
this commit simplifies the function signature and makes its 2 helpers private
(nodeSelectorRequirementsAsSelector and nodeSelectorRequirementsAsFieldSelector).
Normally, the PV controller knows about the PVC that triggers the
creation of a PV before it sees the PV, because the PV controller must
set the volume.beta.kubernetes.io/storage-provisioner annotation that
tells an external provisioner to create the PV.
When restarting, the PV controller first syncs its caches, so that
case is also covered.
However, the creator of a PVC might decided to set that annotation
itself to speed up volume creation. While unusual, it's not forbidden
and thus part of the external Kubernetes API. Whether it makes sense
depends on the intentions of the user.
When that is done and there is heavy load, an external provisioner
might see the PVC and create a PV before the PV controller sees the
PVC. If the PV controller then encounters the PV before the PVC, it
incorrectly concludes that the PV needs to be deleted instead of being
bound.
The same issue occurred earlier for external binding and the existing
code for looking up a PVC in the cache or in the apiserver solves the
issue also for volume provisioning, it just needs to be enabled also
for PVs without the pv.kubernetes.io/bound-by-controller annotation.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
Currently, if you have a PDB with 0 disruptions
available and you attempt to evict a non-healthy
pod, the eviction request will always fail. This
is because the eviction API does not currently
take in to account that the pod you are removing
is the unhealthy one.
This commit accounts for trying to evict an
unhealthy pod as long as there are enough healthy
pods to satisfy the PDB's requirements. To
protect against race conditions, a ResourceVersion
constraint is enforced. This will ensure that
the target pod does not go healthy and allow
any race condition to occur which might disrupt
too many pods at once.
This commit also eliminates superfluous class to
DeepCopy for the deleteOptions struct.
In #56164, we had split the reject rules for non-ep existing services
into KUBE-EXTERNAL-SERVICES chain in order to avoid calling KUBE-SERVICES
from INPUT. However in #74394 KUBE-SERVICES was re-added into INPUT.
As noted in #56164, kernel is sensitive to the size of INPUT chain. This
patch refrains from calling the KUBE-SERVICES chain from INPUT and FORWARD,
instead adds the lb reject rule to the KUBE-EXTERNAL-SERVICES chain which will be
called from INPUT and FORWARD.
when DefaultPodTopologySpread feature is enabled
If SelectorSpreadPriority is in use, PodTopologySpread gets inevitably enabled.
When only EvenPodsSpreadPriority is in use, PodTopologySpread is configured without system defaults.
Change-Id: I2389a585cd8ad0bd35b0d2acae1665cd46908b3e
When a pod is deleted, it is given a deletion timestamp. However the
pod might still run for some time during graceful shutdown. During
this time it might still produce CPU utilization metrics and be in a
Running phase.
Currently the HPA replica calculator attempts to ignore deleted pods
by skipping over them. However by not adding them to the ignoredPods
set, their metrics are not removed from the average utilization
calculation. This allows pods in the process of shutting down to drag
down the recommmended number of replicas by producing near 0%
utilization metrics.
In fact the ignoredPods set is misnomer. Those pods are not fully
ignored. When the replica calculator recommends to scale up, 0%
utilization metrics are filled in for those pods to limit the scale
up. This prevents overscaling when pods take some time to startup. In
fact, there should be 4 sets considered (readyPods, unreadyPods,
missingPods, ignoredPods) not just 3.
This change renames ignoredPods as unreadyPods and leaves the scaleup
limiting semantics. Another set (actually) ignoredPods is added to
which delete pods are added instead of being skipped during
grouping. Both ignoredPods and unreadyPods have their metrics removed
from consideration. But only unreadyPods have 0% utilization metrics
filled in upon scaleup.
When trying to fix a dockershim issue, there were not any unit tests
for dockershim/exec.go and it was difficult to add the corresponding
unit test for the bug.
This adds the unit tests for avoiding such situation in the future.
Fix stupid golang loop variable closure thing.
Also, if we fail to initially set up the rules for one family, don't
try to set up a canary. eg, on the CI hosts, the kernel ip6tables
modules are not loaded, so any attempt to call ip6tables will fail.
Just log those errors once at startup rather than once a minute.