The device plugin test in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd
has been flaky for a while now when it runs on the test infrastructure.
Locally running this test resulted in test passing without issues.
Based on the existing logs, it is not clear why podresource
API endpoint is returning 3 pods rather than the expected
two pods (device plugin pod and the test pod requesting
devices). For more clarity and debugaability on why an
addtional pod seems to be appearing we expose the output
from podresource API endpoint.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The change made in https://github.com/kubernetes/kubernetes/pull/112644
resulted in an update to the rejection message. In the memory manager
node e2e test, we still checked against the old expected error message
giving the impression that the pod succeeded to run even though it failed
as expected mainly because the check wasn't performed correctly.
In this patch, we update to the correct rejection message to make sure
that the memory manager is no longer failing.
NOTE: This test is supposed to run on multi NUMA systems and if the
underlying node does not have multi NUMA nodes, the test is skipped
which is what happens in upstream test infrastructure as it is mainly
composed of single NUMA nodes. Because of this, this test failure
wasn't evident via testgrid.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
add tests to ensure the podresources metrics are exposed,
and basic sanity tests for their values.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The "todo" packages were necessary while moving code around to avoid hitting
cyclic dependencies. Now that any sub package can depend on the framework, they
are no longer needed and the code can be moved into the normal sub packages.
This reduces the size of the test/e2e/framework itself. Because it does not
gather metrics data anymore by default, E2E test suites must set their
callbacks function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/metrics/init".
This reduces the size of the test/e2e/framework itself. Because it does not
check nodes anymore by default, E2E test suites must set their own check
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/todo/node/init".
This reduces the size of the test/e2e/framework itself. Because it does not
dump anything anymore by default, E2E test suites must set their own dump
function or set the original one by importing
"k8s.io/kubernetes/test/e2e/framework/debug/init".
Some scripts and tools still relied on the deprecated flags, the ones
which are about to be removed.
This is intentionally not a complete removal of all those flags in the entire
repo. This would lead to much more code churn also in places where commands
still accept the flags because they use klog directly.
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Ginkgo is now writing the JUnit file itself. The -report-dir parameter is used
as fallback for enabling JUnit output in case that users haven't migrated to
the new -junit-report parameter.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: Dave Chen <dave.chen@arm.com>
Besides, the using of method might lead to a `concurrent map writes`
issue per the discussion here: https://github.com/onsi/ginkgo/issues/970
Signed-off-by: Dave Chen <dave.chen@arm.com>
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters
Signed-off-by: Dave Chen <dave.chen@arm.com>
As outlined in the KEP, we now graduate the Kubelet feature to beta
which means that it is enabled by default. The corresponding Kubelet
flag still defaults to `false`, but we now have the chance to e2e test
the feature by using a new serial test case.
KEP: https://github.com/kubernetes/enhancements/issues/2413
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Use a watch to detect invalid pod status updates in graceful node
shutdown node e2e test. By using a watch, all pod updates will be
captured while the previous logic required polling the api-server which
could miss some intermediate updates.
Signed-off-by: David Porter <david@porter.me>
Previously, the e2e test was overriding the plugins socket directory to
"/var/lib/kubelet/plugins_registry". This seems wrong, and with that
setting the e2e test was already failing, because the registration
process was timing out, in turn because the kubelet was trying to call
back the device plugin in the wrong place (see below for details).
I can't explain why it worked before - or it if worked at all - but
it really seems that `pluginapi.DevicePluginPath` is the right
setting here.
+++
In a nutshell, the device plugin registration process works like this:
1. The kubelet runs and creates the device plugin socket registration
endpoint:
KubeletSocket = DevicePluginPath + "kubelet.sock"
DevicePluginPath = "/var/lib/kubelet/device-plugins/"
2. Each device plugin will listen to an ENDPOINT the kubelet will connect
backk to. IOW the kubelet will act like a client to each device plugin,
to perform allocation requests (and more)
Each device plugin will serve from a endpoint.
The endpoint name is plugin-specific, but they all must be inside a
well-known directory: pluginapi.DevicePluginPath
3. The kubelet creates the device plugin pod, like any other pod
4. During the startup, each device plugin wants to register itself in the
kubelet. So it sends a request through
the registration endpoint. Key details:
grpc.Dial(kubelet registration socket)
registration request
reqt := &pluginapi.RegisterRequest{
Version: pluginapi.Version,
Endpoint: endpointSocket, <- socket relative to pluginapi.DevicePluginPath
ResourceName: resourceName, <- resource name to be exposed
}
5. While handling the registration request, kubelet dial back the
device plugin on socketDir + req.Endpoint.
But socketDir is hardcoded in the device manager code to
pluginapi.KubeletSocket
Signed-off-by: Francesco Romani <fromani@redhat.com>
In the AfterEach check of the e2e node device plugin tests,
the tests want really bad to clean up after themselves:
- delete the sample device plugin
- restart again the kubelet
- ensure that after the restart, no stale sample devices
(provided by the sample device plugin) are reported anymore.
We observed that in the AfterEach block of these e2e tests
we have quite reliably a flip/flop of the kubelet readiness
state, possibly related to a race with/ a slow runtime/PLEG check.
What happens is that the kubelet readiness state is true,
but goes false for a quick interval and then goes true again
and it's pretty stable after that (observed adding more logs
to the check loop).
The key factor here is the function `getLocalNode` aborts the
test (as in `framework.ExpectNoError`) if the node state is
not ready. So any occurrence of this scenario, even if it
is transient, will cause a test failure. I believe this will
make the e2e test unnecessarily fragile without making it more
correct.
For the purpose of the test we can tolerate this kind of glitches,
with kubelet flip/flopping the ready state, granted that we meet
eventually the final desired condition on which the node reports
ready AND reports no sample devices present - which was the condition
the code was trying to check.
So, we add a variant of `getLocalNode`, which just fetches the
node object the e2e_node framework created, alongside to a flag
reporting the node readiness. The new helper does not make
implicitly the test abort if the node is not ready, just bubbles
up this information.
Signed-off-by: Francesco Romani <fromani@redhat.com>