When using the eviction API, if a pod already has
a non-zero DeletionTimestamp, we don't need to check
PDBs as it has already been marked for deletion.
Several of the functions in pkg/registry/core/service/ipallocator were
moved to k8s.io/utils/net, but then the original code was never
updated to used to the vendored versions.
(utilnet's version of RangeSize does not have the IPv6 special case
that the original code did, so we need to move that to
NewAllocatorCIDRRange now.)
If the dual-stack flag is enabled and the cluster is single stack IPv6,
the allocator logic for service clusterIP does not properly handle rejecting
a request for an IPv4 family. Return a 422 Invalid on the ipFamily field
when the dual stack flag is on (as it would when it hits beta) and the
cluster is configured for single-stack IPv6.
The family is now defaulted or cleared in BeforeCreate/BeforeUpdate,
and is either inherited from the previous object (if nil or unchanged),
or set to the default strategy's family as necessary. The existing
family defaulting when cluster ip is provided remains in the api
section. We add additonal family defaulting at the time we allocate
the IP to ensure that IPFamily is a consequence of the ClusterIP
and prevent accidental reversion. This defaulting also ensures that
old clients that submit a nil IPFamily for non ClusterIP services
receive a default.
To properly handle validation, make the strategy and the validation code
path condition on which configuration options are passed to service
storage. Move validation and preparation logic inside the strategy where
it belongs. Service validation is now dependent on the configuration of
the server, and as such ValidateConditionService needs to know what the
allowed families are.
Currently, if you have a PDB set, it is possible for
a pod stuck in pending state to be prevented from
deletion even though there are in fact enough healthy
replicas.
This commit allows pods in Pending state to be removed.
This commit also adds associated unit tests.
related-bug: #80389
The service allocator is used to allocate ip addresses for the
Service IP allocator and NodePorts for the Service NodePort
allocator. It uses a bitmap backed by etcd to store the allocation
and tries to allocate the resources directly from the local memory
instead from etcd, that can cause issues in environment with
high concurrency.
It may happen, in deployments with multiple apiservers, that the
resource allocation information is out of sync, this is more
sensible with NodePorts, per example:
1. apiserver A create a service with NodePort X
2. apiserver B deletes the service
3. apiserver A creates the service again
If the allocation data of apiserver A wasn't refreshed with the
deletion of apiserver B, apiserver A fails the allocation because
the data is out of sync. The Repair loops solve the problem later,
but there are some use cases that require to improve the concurrency
in the allocation logic.
We can try to not do the Allocation and Release operations locally,
and try instead to check if the local data is up to date with etcd,
and operate over the most recent version of the data.
This implementation allows Pod to request multiple hugepage resources
of different size and mount hugepage volumes using storage medium
HugePage-<size>, e.g.
spec:
containers:
resources:
requests:
hugepages-2Mi: 2Mi
hugepages-1Gi: 2Gi
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- mountPath: /hugepages-1Gi
name: hugepage-1gi
...
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1gi
emptyDir:
medium: HugePages-1Gi
NOTE: This is an alpha feature.
Feature gate HugePageStorageMediumSize must be enabled for it to work.
This combines container names into a single list because separating them
into a long, variable length string isn't particularly useful in the
context of an streaming error message.
Remove the validation for pre-allocated hugepages on node level.
Validation is currently the only thing making it impossible to use
pre-allocated huge pages in more than one size.
We have now quite a few reports from real users that this feature is
welcome.
Such declarations will make using the package exported identifiers impossible after the declaration or create confusion when reading the code.
Signed-off-by: clarklee92 <clarklee1992@hotmail.com>
All objects with graceful deletion allow multiple DELETE calls in the pending
state. Namespace is the one outlier, and the error here predates graceful
deletion and finalizers. We should make this behavior consistent with other
calls and merely indicate success and return the state of the object, the
same as if there were pending metadata finalizers.
Clients that previously checked for a conflict error during delete to know
that the server is already deleting will now no longer receive an error
(as if the object were rapidly deleted). There is a small chance that some
clients have error checking for this state, but a much larger chance that
clients that want to trigger a delete of the namespace do not handle this
error case.
Discovered in an e2e test that used the framework namespace and triggered
deletion of that ns itself, and then the AfterEach step in e2e failed
because the namespace was already pending deletion.