This replaces embedding of JavaScript code into the mock driver that
runs inside the cluster with Go callbacks which run inside the
e2e.test suite itself. In contrast to the JavaScript hooks, they have
direct access to all parameters and can fabricate arbitrary responses,
not just error codes.
Because the callbacks run in the same process as the test itself, it
is possible to set up two-way communication via shared variables or
channels. This opens the door for writing better tests. Some of the
existing tests that poll mock driver output could be simplified, but
that can be addressed later.
For now, only tests using hooks use embedding. How gRPC calls are
retrieved is abstracted behind the CSIMockTestDriver interface, so
tests don't need to be modified when switching between embedding
and remote mock driver.
Extract TestSuite, TestDriver, TestPattern, TestConfig
and VolumeResource, SnapshotVolumeResource from testsuite
package and put them into a new package called api.
The ultimate goal here is to make the testsuites as clean
as possible. And only testsuites in the package.
As discussed in https://github.com/kubernetes/kubernetes/pull/93658,
relying on a watch to deliver all events is not acceptable, not even
for a test, because it can and did (at least for OpenShift testing) to
test flakes.
The solution in #93658 was to replace log flooding with a clear test
failure, but that didn't solve the flakiness.
A better solution is to use a RetryWatcher which "under normal
circumstances" (https://github.com/kubernetes/kubernetes/pull/93777#discussion_r467932080)
should always deliver all changes.
In our current mock CSI driver e2e test, we are not waiting
for the CSI driver register successfully to perform test
including provision PVC. This can lead to timeout when the
csi driver takes longer to register the socket.
This change adds the waiting part so that the system will
wait for up to 10 minutes for the driver to be ready. This
normally won't take this long. However, under a resource
constraint environment it can take longer than expected time.
https://github.com/kubernetes/kubernetes/issues/93358
By creating CSIStorageCapacity objects in advance, we get the
FailedScheduling pod event if (and only if!) the test is expected to
fail because of insufficient or missing capacity. We can use that as
indicator that waiting for pod start can be stopped early. However,
because we might not get to see the event under load, we still need
the timeout.
Setting testParameters.scName had no effect because
StorageClassTest.StorageClassName isn't used anywhere. Instead, the
storage class name is generated dynamically.
kubelet sometimes calls NodeStageVolume an NodePublishVolume too
often, which breaks this test and leads to flakiness. The test isn't
about that, so we can relax the checking and it still covers what it
was meant to cover.
The timeout for the two loops inside the test itself are now bounded
by an upper limit for the duration of the entire test instead of
having their own, rather arbitrary timeouts.
The "error waiting for expected CSI calls" is redundant because it's
immediately followed by checking that error with:
framework.ExpectNoError(err, "while waiting for all CSI calls")
The mock driver gets instructed to return a ResourceExhausted error
for the first CreateVolume invocation via the storage class
parameters.
How this should be handled depends on the situation: for normal
volumes, we just want external-scheduler to retry. For late binding,
we want to reschedule the pod. It also depends on topology support.
The code became obsolete with the introduction of parseMockLogs
because that will retrieve the log itself. For debugging of a running
test the normal pod output logging is sufficient.
parseMockLogs is called potentially multiple times while waiting for
output. Dumping all CSI calls each time is quite verbose and
repetitive. To verify what the driver has done already, the normal
capturing of the container log can be used instead:
csi-mockplugin-0/mock@127.0.0.1: gRPCCall: {"Method":"/csi.v1.Node/NodePublishVolume","Request"...
As seen in some test
runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041),
retrieving output can fail with "the server rejected our request for
an unknown reason (get pods csi-mockplugin-0)".
If this truly an intermittent error, then the existing retry logic in
the callers can deal with this.
Especially related to "uncertain" global mounts. A large refactoring of CSI
mock tests were necessary:
- to be able to script the driver to return errors as required by the test
- to parse the CSI driver logs to check kubelet called the right CSI calls