In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
The containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
Signed-off-by: Michael Crosby <michael@thepasture.io>
In linux platform, the shim server always listens on the socket before
the containerd task manager dial it. It is unlikely that containerd task
manager should handle reconnect because the shim can't restart. For this
case, the containerd task manager should fail fast if there is ENOENT or
ECONNREFUSED error.
And if the socket file is deleted during cleanup the exited task, it
maybe cause that containerd task manager takes long time to reload the
dead shim. For that task.v2 manager, the race case is like:
```
TaskService.Delete
TaskManager.Delete(runtime/v2/manager.go)
shim.delete(runtime/v2/shim.go)
shimv2api.Shutdown(runtime/v2/task/shim.pb.go)
<- containerd has been killed or restarted somehow
bundle.Delete
```
The shimv2api.Shutdown will cause that the shim deletes socket file
(containerd-shim-runc-v2 does). But the bundle is still there. During
reloading, the containerd will wait for the socket file appears again
in 100 seconds. It is not reasonable. The Reconnect should prevent this
case by fast fail.
Closes: #5648.
Fixes: #5597.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Since the /run directory on macOS is read-only, darwin containerd should
use a different directory. Use the pre-defined default values instead
to avoid this issue.
Fixes: bd908acab ("Use path based unix socket for shims")
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Refactor shim v2 to load and register plugins.
Update init shim interface to not require task service implementation on
returned service, but register as plugin if it is.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
For the abstract socket adress there's no need to chmod
the address's file, cause the file didn't exist actually.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
The shim.SetScore() utility was no longer used since 7dfc605fc6.
Checking for uses outside of this repository, I found only one external use of
this in gVisor; a9441aea27/pkg/shim/service.go (L262-L264)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When runC shimv2 starts, the StartShim interface will re-exec itself as
long-running process, which will read the `address` during initializing.
```happycase
Process
containerd-shim-runc-v1/v2 start containerd-shim-runc-v1/v2
initializing socket
reexec containerd-shim-runc-v1/v2
write address into file
initializing
read address
write back to containerd daemon
serving
...
remove address in Shutdown call
```
However, there is no synchronization after reexec. Then the data race is
like:
```leaking-case
Process
containerd-shim-runc-v1/v2 start containerd-shim-runc-v1/v2
initializing socket
reexec containerd-shim-runc-v1/v2
initializing
read address
write address into file
write back to containerd daemon
serving
...
fail to remove address
because of empty address
```
The `address` should be writen into file first before reexec.
And if shutdown the whole service before cleanup temporary
resource (like socket file), the Shutdown caller will receive `ttrpc: closed`
sometime, which depends on go runtime scheduler. Then it also causes leaking
socket files.
Since the shimV2-Delete binary API must be called to cleanup shim temporary
resource and shimV2-runC-v1 doesn't support grouping multi containers in one,
it is safe to remove the socket file in the binary call for shimV2-runC-v1.
But for the shimV2-runC-v2 shim, we still cleanup socket in Shutdown.
Hopefully we can find a way to cleanup socket in shimV2-Delete binary
call.
Fix: #5173
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Previously a typo was introduced that caused the wrong error to be
checked against when calling exec.LookPath. This had the effect that
containerd would never locate the shim binary if it was in the same
directory as containerd's binary, but not in PATH.
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
The current code simply ignores the full binary path when starting the
shimv2 process, and instead fallbacks to a binary in the path, and this
is problematic (and confusing) for those using CRI-O, which has this
bits vendored.
The reason it's problematic with CRI-O is because the user can simply
set the full binary path and, instead of having that executed, CRI-O
will simply fail to create the container unless that binary is part of
the path, which may not be case in a few different scenarios (testing
being the most common one).
Fixes: #5006
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
oom_score_adj must be in the range -1000 to 1000. In AdjustOOMScore if containerd's score is already at the maximum value we should set that value for the shim instead of trying to set 1001 which is invalid.
Signed-off-by: Simon Kaegi <simon_kaegi@ca.ibm.com>
This allows filesystem based ACLs for configuring access to the socket of a
shim.
Co-authored-by: Samuel Karp <skarp@amazon.com>
Signed-off-by: Samuel Karp <skarp@amazon.com>
Signed-off-by: Michael Crosby <michael@thepasture.io>
Signed-off-by: Michael Crosby <michael.crosby@apple.com>
Before this change, if an event fails to send on the first attempt,
subsequent attempts will fail with context.Cancelled because the the
caller of publish passes a cancellable timeout, which the publisher uses
to send the event.
The publisher returns immediately if the send fails, but adds the event
to an async queue to try again.
Meanwhile the caller will return cancelling the context.
Additionally, subsequent attempts may fail to send because the timeout
was expected to be for a single request but the queue sleeps for
`attempt*time.Second`.
In the shim service, the timeout was set to 5s, which means the send
will fail with context.DeadlineExceeded before it reaches `maxRequeue`
(which is currently 5).
This change moves the timeout to the publisher so each send attempt gets
its own timeout.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
TestRuntimeWithEmptyMaxEnvProcs should restore the GoMaxProcs after
test so that the temporary change of GoMaxProcs will not impact other
case, like TestRuntimeWithNonEmptyMaxEnvProcs.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Request came from a slack message that shims do not output their versions making
it hard for users and operators to know what version of a shim they have on the
system. This adds a `-v` flag to the shims so that users can see if a shim is
in sync with containerd or what versions of shims that they are running.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Instead of having several dialer implementations, leave only one in
`pkg/dialer` and call it from `pkg/ttrpcutil`, `runtime/v(1|2)/shim`
which had their own
Closes#3471.
Signed-off-by: Kiril Vladimiroff <kiril@vladimiroff.org>
- Our out of tree shim would like to publish events with ttrpc. These
functions should be exposed so our shim doesn't need to reimplement
publisher logic.
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
Because of the way go handles flags, passing a flag that is not defined
will cause an error. In our case, if we kept this as a flag, then
third-party shims would break when they see this new flag. To fix this,
I moved this new configuration option to an env var. We should use env
vars from here on out to avoid breaking shim compat.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Previously the TTRPC address was generated as "<GRPC address>.ttrpc".
This change now allows explicit configuration of the TTRPC address, with
the default still being the old format if no value is specified.
As part of this change, a new configuration section is added for TTRPC
listener options.
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
AnonDialer will now return a "not found" error if the pipe is not found
before the timeout is reached. If the pipe exists but the timeout is
reached while attempting to connect, the timeout error will still be
returned.
This will allow the error handling logic to work properly when
connecting to the shim log pipe. An error message is only logged if the
error is not "not found", so now log noise from log pipes that were
never intended to be created by the shim will be hidden.
This change also cleans up the control flow for AnonDialer on Windows.
The new code should be more easily readable, but the only semantic
change is the error return value change.
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
This changes the shim's OOM score from a static max killable of -999 to
be +1 of the containerd daemon's score. This should allow the shim's to
be killed first in an OOM condition but leave the daemon alone for a bit
to help cleanup and manage the containers during this situation.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>