As its name, DeprecatedMightBeMasterNode is deprecated.
In e2e metrics, the function was used for knowing master node name to
get metrics from kube-scheduler and kube-controller-manager pods.
This make e2e metrics get these metrics directly by getting those pod
names without calling DeprecatedMightBeMasterNode().
There were nits in invokeStaleDummyVMTestWithStoragePolicy() like
- The error message didn't contain necessary space
- IsVMPresent() can return an error, but lack of the error handling
- IsVMPresent() returns true/false, but didn't use ExpectEqual() and
less code readability
This fixes those things.
WaitForStableCluster() checks all pods run on worker nodes, and the
function used to refer master nodes to skip checking controller plane
pods.
GetMasterAndWorkerNodes() was used for getting master nodes, but the
implementation is not good because it usesDeprecatedMightBeMasterNode().
This makes WaitForStableCluster() refer worker nodes directly to avoid
using GetMasterAndWorkerNodes().
The behavior for exec'ing a backgrounded command is not specified
with CRI so modify the test to run the command directly instead
of using exec.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
The way gingko handles interrupts is:
- It starts running AfterSuite hooks in a separate goroutine (this includes cleanupAction hooks)
- Once AfterSuite hook is done executing it calls
os.Exit(1) on test suite.
So how cleanupFunc() that runs via defer in test can be interrupted
is:
- cleanupFunc starts running via defer (or AfterEach hook) but first
thing that function does is to remove cleanupHandle from
framework.RemoveCleanupAction.
- Test suite receives interrupt from user and AfterSuite block
starts executing
- remember that while cleanupFunc is running in goroutine#1,
AfterSuite is running concurrently in goroutine#2.
- AfterSuite hook has bunch of CleanupActions it needs to run which
were registered via framework.AddCleanupAction(cleanupFunc) but
once cleanupFunc starts executing via defer in the test, it will
remove the cleanupHandle from framework's aftersuite hooks.
- So if AfterSuite did not had anything to run (because
those actions were removed via framework.RemoveCleanupAction
then it will simply go to the last framework.AfterEach action and call os.Exit(1)
- So if os.Exit(1) is called before cleanupFunc has a chance to finish in defer, it will not complete.
ginkgo has a weird bug that - AfterEach does not get called when
testsuite exits with certain kind of interrupt (Ctrl-C for example).
More info - https://github.com/onsi/ginkgo/issues/222
We workaround this issue in Kubernetes by adding a special hook into
AfterSuite call, but AfterSuite can not be used to peforms certain
kind of cleanup because it can race with AfterEach hook and
framework.AfterEach hook will set framework.ClientSet to nil.
This presents a problem in cleaning up CSI driver and testpods. This
PR removes cleanup of driver manifest via CleanupAction because that
is not safe and racy (such as f.ClientSet may disappear!) and makes
AfterSuite hooks run in a ordered fashion
kubelet sometimes calls NodeStageVolume an NodePublishVolume too
often, which breaks this test and leads to flakiness. The test isn't
about that, so we can relax the checking and it still covers what it
was meant to cover.
When a test pattern or storage class uses late binding, the cleanup
code didn't know about the PV that may have been created for the PVC
since setting it up and thus then also didn't wait for PV deletion.
This is problematic for test isolation because the next test was
allowed to be started before fully cleaning up. Worse, it the driver
gets removed after the test, the volume might never get deleted.
As seen in one case (https://github.com/intel/pmem-csi/issues/587), a
pod can reach the "not running" state although its ephemeral volumes
are still being torn down by kubelet and the CSI driver. What happens
then is that the test returns too early and even deleting the
namespace and thus the pod succeeds before the NodeVolumeUnpublish
really finishes.
To avoid this, StopPod now waits for the pod to really disappear.
The timeout for the two loops inside the test itself are now bounded
by an upper limit for the duration of the entire test instead of
having their own, rather arbitrary timeouts.
The "error waiting for expected CSI calls" is redundant because it's
immediately followed by checking that error with:
framework.ExpectNoError(err, "while waiting for all CSI calls")