Fixes instances of #98213 (to ultimately complete #98213 linting is
required).
This commit fixes a few instances of a common mistake done when writing
parallel subtests or Ginkgo tests (basically any test in which the test
closure is dynamically created in a loop and the loop doesn't wait for
the test closure to complete).
I'm developing a very specific linter that detects this king of mistake
and these are the only violations of it it found in this repo (it's not
airtight so there may be more).
In the case of Ginkgo tests, without this fix, only the last entry in
the loop iteratee is actually tested. In the case of Parallel tests I
think it's the same problem but maybe a bit different, iiuc it depends
on the execution speed.
Waiting for the CI to confirm the tests are still passing, even after
this fix - since it's likely it's the first time those test cases are
executed - they may be buggy or testing code that is buggy.
Another instance of this is in `test/e2e/storage/csi_mock_volume.go` and
is still failing so it has been left out of this commit and will be
addressed in a separate one
We now partly drop the support for seccomp annotations which is planned
for v1.25 as part of the KEP:
https://github.com/kubernetes/enhancements/issues/135
Pod security policies are not touched by this change and therefore we
have to keep the annotation key constants.
This means we only allow the usage of the annotations for backwards
compatibility reasons while the synchronization of the field to
annotation is no longer supported. Using the annotations for static pods
is also not supported any more.
Making the annotations fully non-functional will be deferred to a
future release.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
NPD can run in "standalone" or "daemonset" mode in cluster GCE tests.
For standalone mode, NPD runs as a systemd unit, daemonset mode it runs
as pod.
If NPD is running in standalone mode, as soon as we detect that the npd
service does not exist with `systemctl status`, we should skip the rest
of the checks such as looking at the journalctl logs or pid.
Signed-off-by: David Porter <david@porter.me>
Besides, the using of method might lead to a `concurrent map writes`
issue per the discussion here: https://github.com/onsi/ginkgo/issues/970
Signed-off-by: Dave Chen <dave.chen@arm.com>
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters
Signed-off-by: Dave Chen <dave.chen@arm.com>
Exploring termination revealed we have race conditions in certain
parts of pod initialization and termination. To better catch these
issues refactor the existing test so it can be reused, and then test
a number of alternate scenarios.
Some storage tests deploy DaemonSets which hard-code /var/lib/kubelet as root
directory for kubelet registration and pod directory. There was already a
parameter which allowed specifying the root directory, just with a very
confusing name ("--volume-dir") and matching field name. A --kubelet-root-dir
parameters gets added because this may make it easier to find the parameter,
with the old name preserved as an alias for the same field for backwards
compatibility.
This test has been part of the Conformance suite since at least
Kubernetes 1.2 (2015-10-xx). Some years later, around 2018-10-xx, we
drafted a rigorous set of rules for tests to follow in order to be
eligible for promotion to Conformance. We explicitly disallowed any
tests that check for specific Events, since they are not an API, and we
make no guarantees about their contents nor their delivery.
Unfortunately, we neglected to go through the existing corpus of
Conformance tests with a fine-toothed comb after drafting these rules.
The very nature of what this test is attempting to exercise and verify
is specific Events, and their delivery, thus making it ineligible for
Conformance. We should have caught and demoted this test back then.
Better late than never?
* De-share the Handler struct in core API
An upcoming PR adds a handler that only applies on one of these paths.
Having fields that don't work seems bad.
This never should have been shared. Lifecycle hooks are like a "write"
while probes are more like a "read". HTTPGet and TCPSocket don't really
make sense as lifecycle hooks (but I can't take that back). When we add
gRPC, it is EXPLICITLY a health check (defined by gRPC) not an arbitrary
RPC - so a probe makes sense but a hook does not.
In the future I can also see adding lifecycle hooks that don't make
sense as probes. E.g. 'sleep' is a common lifecycle request. The only
option is `exec`, which requires having a sleep binary in your image.
* Run update scripts
This test case requires special test-handler setup which is only done
for gce clusters created by kube-up scripts. Let's skip the test when
run under other providers.
We now use a host local exec instead of SSH commands to simplify the
test and make the result more robust.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
This assumes that SSH via bastion works if the `KUBE_SSH_BASTION`
environment variable is set, which is the case for
`pull-kubernetes-e2e-gce-correctness`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
The e2e that create/deletes pods rapidly and verifies their status
was reporting a very long timing:
timings total=12.211347385s t=491ms run=2s execute=450402h8m25s
in a few scenarios. Add error checks that clarify when this happens
and why. Report p50/75/90/99 latencies on teardown as observed from
the test for baseline for future changes.
Before assuming that a certain host runs an SSH server, we now test its
`SSHPort` for connectivity. This means that the test `should be able to
run crictl on the node` can be now more failure proof by checking only
hosts where SSH actually runs. Beside that, we can also test all hosts
and not only the first one.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
This replaces the check of mount propagation to/from the host OS mount
namespace to a similar check about the mount namespace where kubelet is
running (which may or may not be the same mount namespace as the host
OS).
This addresses issue #100259