The status error was embedded inside the new error constructed by
WaitForPodsResponding's get function, but not wrapped. Therefore
`apierrors.IsServiceUnavailable(err)` didn't find it and returned false -> no
retries.
Wrapping fixes this and Gomega formatting of the error remains useful:
err := &errors.StatusError{}
err.ErrStatus.Code = 503
err.ErrStatus.Message = "temporary failure"
err2 := fmt.Errorf("Controller %s: failed to Get from replica pod %s:\n%w\nPod status:\n%s",
"foo", "bar",
err, "some status")
fmt.Println(format.Object(err2, 1))
fmt.Println(errors.IsServiceUnavailable(err2))
=>
<*fmt.wrapError | 0xc000139340>:
Controller foo: failed to Get from replica pod bar:
temporary failure
Pod status:
some status
{
msg: "Controller foo: failed to Get from replica pod bar:\ntemporary failure\nPod status:\nsome status",
err: <*errors.StatusError | 0xc0001a01e0>{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "",
Message: "temporary failure",
Reason: "",
Details: nil,
Code: 503,
},
},
}
true
This uses the generic ptr.To in k8s.io/utils to replace functions and
code constructs which only serve to return pointers to intstr
values. Other uses of the deprecated pointer package are updated in
modified files.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
Add and use more facilities to the *internal* podresources client.
Checking e2e test runs, we have quite some
```
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused"
```
This is likely caused by kubelet restarts, which we do plenty in e2e tests,
combined with the fact gRPC does lazy connection AND we don't really
check the errors in client code - we just bubble them up.
While it's arguably bad we don't check properly error codes, it's also
true that in the main case, e2e tests, the functions should just never
fail besides few well known cases, we're connecting over a
super-reliable unix domain socket after all.
So, we centralize the fix adding a function (alongside with minor
cleanups) which wants to trigger and ensure the connection happens,
localizing the changes just here. The main advantage is this approach
is opt-in, composable, and doesn't leak gRPC details into the client
code.
Signed-off-by: Francesco Romani <fromani@redhat.com>
This deflakes the "Containers Lifecycle should not launch second
container before PostStart of the first container completed" test by
assigning enough time to finish the postStart hook.
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.