This commit upgrades github.com/containerd/typeurl to use typeurl.Any.
The interface hides gogo/protobuf/types.Any from containerd's Go client.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
All occurrences only passed a PID, so we can use this utility to make
the code more symmetrical with their cgroups v2 counterparts.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
runc option --criu is now ignored (with a warning), and the option will be
removed entirely in a future release. Users who need a non- standard criu
binary should rely on the standard way of looking up binaries in $PATH.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This commit removes gogoproto.enumvalue_customname,
gogoproto.goproto_enum_prefix and gogoproto.enum_customname.
All of them make proto-generated Go code more idiomatic, but we already
don't use these enums in our external-surfacing types and they are anyway
not supported by Google's official toolchain (see #6564).
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
According to https://github.com/protocolbuffers/protobuf/issues/9184
> Weak fields are an old and deprecated internal-only feature that we never
> open sourced.
This blocks us to upgrade protoc.
Fixes#6232.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
If containerd-shim-runc-v1 process dead abnormally, such as received
kill -s 9 signal, panic or other unkown reasons, the containerd-shim-runc-v1
server can not reap runc container and forward init process exit event.
This will lead the container leaked in dockerd. When shim dead, containerd
will clean dead shim, here read init process pid and forward exit event
with pid at the same time.
Related to: #6402
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
If containerd-shim-runc-v2 process dead abnormally, such as received
kill 9 signal, panic or other unkown reasons, the containerd-shim-runc-v2
server can not reap runc container and forward init process exit event.
This will lead the container leaked in dockerd. When shim dead, containerd
will clean dead shim, here read init process pid and forward exit event
with pid at the same time.
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
The containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
Signed-off-by: Michael Crosby <michael@thepasture.io>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
cgroupsv2.LoadManager() already performs VerifyGroupPath(), and returns
an error if the path is invalid, so this check is redundant.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- process.Init#io could be nil
- Make sure CreateTaskRequest#Options is not empty before unmarshaling
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
When runC shimv2 starts, the StartShim interface will re-exec itself as
long-running process, which will read the `address` during initializing.
```happycase
Process
containerd-shim-runc-v1/v2 start containerd-shim-runc-v1/v2
initializing socket
reexec containerd-shim-runc-v1/v2
write address into file
initializing
read address
write back to containerd daemon
serving
...
remove address in Shutdown call
```
However, there is no synchronization after reexec. Then the data race is
like:
```leaking-case
Process
containerd-shim-runc-v1/v2 start containerd-shim-runc-v1/v2
initializing socket
reexec containerd-shim-runc-v1/v2
initializing
read address
write address into file
write back to containerd daemon
serving
...
fail to remove address
because of empty address
```
The `address` should be writen into file first before reexec.
And if shutdown the whole service before cleanup temporary
resource (like socket file), the Shutdown caller will receive `ttrpc: closed`
sometime, which depends on go runtime scheduler. Then it also causes leaking
socket files.
Since the shimV2-Delete binary API must be called to cleanup shim temporary
resource and shimV2-runC-v1 doesn't support grouping multi containers in one,
it is safe to remove the socket file in the binary call for shimV2-runC-v1.
But for the shimV2-runC-v2 shim, we still cleanup socket in Shutdown.
Hopefully we can find a way to cleanup socket in shimV2-Delete binary
call.
Fix: #5173
Signed-off-by: Wei Fu <fuweid89@gmail.com>
bump version 1.3.2 for gogo/protobuf due to CVE-2021-3121 discovered
in gogo/protobuf version 1.3.1, CVE has been fixed in 1.3.2
Signed-off-by: Aditi Sharma <adi.sky17@gmail.com>
When the shim returns a plain error when a process does not exist,
the server is unable to recognise its GRPC status code and assumes
UnknownError. This is awkward for containerd client users as they are
unable to recognise the actual reason for the error.
When the shim returns a NotFound GRPC error, it is properly translated
by the server and clients receive a proper NotFound error instead of
Unknown
Please note that we (CF Garden) would like to have the eventual fix backported to 1.4 as well.
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
This allows filesystem based ACLs for configuring access to the socket of a
shim.
Co-authored-by: Samuel Karp <skarp@amazon.com>
Signed-off-by: Samuel Karp <skarp@amazon.com>
Signed-off-by: Michael Crosby <michael@thepasture.io>
Signed-off-by: Michael Crosby <michael.crosby@apple.com>
Currently the shims only support starting the logging binary process if the
io.Creator Config does not specify Terminal: true. This means that the program
using containerd will only be able to specify FIFO io when Terminal: true,
rather than allowing the shim to fork the logging binary process. Hence,
containerd consumers face an inconsistent behavior regarding logging binary
management depending on the Terminal option.
Allowing the shim to fork the logging binary process will introduce consistency
between the running container and the logging process. Otherwise, the logging
process may die if its parent process dies whereas the container will keep
running, resulting in the loss of container logs.
Signed-off-by: Akshat Kumar <kshtku@amazon.com>
Before this change, if an event fails to send on the first attempt,
subsequent attempts will fail with context.Cancelled because the the
caller of publish passes a cancellable timeout, which the publisher uses
to send the event.
The publisher returns immediately if the send fails, but adds the event
to an async queue to try again.
Meanwhile the caller will return cancelling the context.
Additionally, subsequent attempts may fail to send because the timeout
was expected to be for a single request but the queue sleeps for
`attempt*time.Second`.
In the shim service, the timeout was set to 5s, which means the send
will fail with context.DeadlineExceeded before it reaches `maxRequeue`
(which is currently 5).
This change moves the timeout to the publisher so each send attempt gets
its own timeout.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Previously shim v2 (`io.containerd.runc.{v1,v2}`) always used `/run/containerd/runc` as the runc root.
Fix#4326
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
How to test (from https://github.com/opencontainers/runc/pull/2352#issuecomment-620834524):
(host)$ sudo swapoff -a
(host)$ sudo ctr run -t --rm --memory-limit $((1024*1024*32)) docker.io/library/alpine:latest foo
(container)$ sh -c 'VAR=$(seq 1 100000000)'
An event `/tasks/oom {"container_id":"foo"}` will be displayed in `ctr events`.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
There's no OOM monitoring for the v2 cgroups yet, so it seems unlikely
that there was a leak in that case.
Signed-off-by: Seth Pellegrino <spellegrino@newrelic.com>
Only start watching the cgroup for OOMs when the first process starts
instead of on every process.
Signed-off-by: Seth Pellegrino <spellegrino@newrelic.com>
Previously, the platform was closed as part of the Delete method when the
process was an init for a task and there were no more tasks after its deletion.
This can create problems if another task is created within the shim right after
the delete runs, which results in the platform being closed but the shim
continuing to run.
This change moves closing the platform to the Shutdown method after the shim's
context is canceled, which ensures the platform is only closed once the shim
is sure its done servicing containers.
Signed-off-by: Erik Sipsma <sipsma@amazon.com>
* only shim v2 runc v2 ("io.containerd.runc.v2") is supported
* only PID metrics is implemented. Others should be implemented in separate PRs.
* lots of code duplication in v1 metrics and v2 metrics. Dedupe should be separate PR.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Reized the I/O buffers to align with the size of the kernel buffers with fifos
and move the close aspect of the console to key off of the stdin closing.
Fixes#3738
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Because of the way go handles flags, passing a flag that is not defined
will cause an error. In our case, if we kept this as a flag, then
third-party shims would break when they see this new flag. To fix this,
I moved this new configuration option to an env var. We should use env
vars from here on out to avoid breaking shim compat.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Previously the TTRPC address was generated as "<GRPC address>.ttrpc".
This change now allows explicit configuration of the TTRPC address, with
the default still being the old format if no value is specified.
As part of this change, a new configuration section is added for TTRPC
listener options.
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
When containerd-shim does reaper, the most processes are not init
process. Since json.Decode consumes more CPU resource, we should check
killall option for init process only.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This changes the shim's OOM score from a static max killable of -999 to
be +1 of the containerd daemon's score. This should allow the shim's to
be killed first in an OOM condition but leave the daemon alone for a bit
to help cleanup and manage the containers during this situation.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Shims no longer call `os.Exit` but close the context on shutdown so that
events and other resources have hit the `defer`s.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
* Rootfs dir is created during container creation not during bundle
creation
* Add support for v2
* UnmountAll is a no-op when the path to unmount (i.e. the rootfs dir)
does not exist or is invalid
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
Closes#603
This adds logging facilities at the shim level to provide minimal I/O
overhead and pluggable logging options. Log handling is done within the
shim so that all I/O, cpu, and memory can be charged to the container.
A sample logging driver setting up logging for a container the systemd
journal looks like this:
```go
package main
import (
"bufio"
"context"
"fmt"
"io"
"sync"
"github.com/containerd/containerd/runtime/v2/logging"
"github.com/coreos/go-systemd/journal"
)
func main() {
logging.Run(log)
}
func log(ctx context.Context, config *logging.Config, ready func() error) error {
// construct any log metadata for the container
vars := map[string]string{
"SYSLOG_IDENTIFIER": fmt.Sprintf("%s:%s", config.Namespace, config.ID),
}
var wg sync.WaitGroup
wg.Add(2)
// forward both stdout and stderr to the journal
go copy(&wg, config.Stdout, journal.PriInfo, vars)
go copy(&wg, config.Stderr, journal.PriErr, vars)
// signal that we are ready and setup for the container to be started
if err := ready(); err != nil {
return err
}
wg.Wait()
return nil
}
func copy(wg *sync.WaitGroup, r io.Reader, pri journal.Priority, vars map[string]string) {
defer wg.Done()
s := bufio.NewScanner(r)
for s.Scan() {
if s.Err() != nil {
return
}
journal.Send(s.Text(), pri, vars)
}
}
```
A `logging` package has been created to assist log developers create
logging plugins for containerd.
This uses a URI based approach for logging drivers that can be expanded
in the future.
Supported URI scheme's are:
* binary
* fifo
* file
You can pass the log url via ctr on the command line:
```bash
> ctr run --rm --runtime io.containerd.runc.v2 --log-uri binary://shim-journald docker.io/library/redis:alpine redis
```
```bash
> journalctl -f -t default:redis
-- Logs begin at Tue 2018-12-11 16:29:51 EST. --
Mar 08 16:08:22 deathstar default:redis[120760]: 1:C 08 Mar 2019 21:08:22.703 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.704 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.704 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.704 # Current maximum open files is 1024. maxclients has been reduced to 992 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 * Running mode=standalone, port=6379.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 # Server initialized
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
Mar 08 16:08:22 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:22.705 * Ready to accept connections
Mar 08 16:08:50 deathstar default:redis[120760]: 1:signal-handler (1552079330) Received SIGINT scheduling shutdown...
Mar 08 16:08:50 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:50.405 # User requested shutdown...
Mar 08 16:08:50 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:50.406 * Saving the final RDB snapshot before exiting.
Mar 08 16:08:50 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:50.452 * DB saved on disk
Mar 08 16:08:50 deathstar default:redis[120760]: 1:M 08 Mar 2019 21:08:50.453 # Redis is now ready to exit, bye bye...
```
The following client side Opts are added:
```go
// LogURI provides the raw logging URI
func LogURI(uri *url.URL) Creator { }
// BinaryIO forwards contianer STDOUT|STDERR directly to a logging binary
func BinaryIO(binary string, args map[string]string) Creator {}
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
support checkpoint without committing a checkpoint dir into a
checkpoint image and restore without untar image into checkpoint
directory. support for both v1 and v2 runtime
Signed-off-by: Ace-Tang <aceapril@126.com>
add ImagePath and WorkPath for checkpoint process, add CriuImagePath
and CriuWorkPath for create process in runtime v2 protobuf
Signed-off-by: Ace-Tang <aceapril@126.com>
add '-id' flag when start container with io.containerd.runc.v1 shim, or user
can not get container-shim relation from 'ps -ef',like
```
/usr/bin/containerd-shim-runc-v1 -namespace default -address
/run/containerd/containerd.sock -publish-binary /usr/bin/containerd
```
Signed-off-by: Ace-Tang <aceapril@126.com>
1. avoid dead lock during kill, fetch allProcesses before handle events
2. use argu's ctx instead of context.Backgroud() in openlog
Signed-off-by: Wei Fu <fuweid89@gmail.com>
- Still KillAll if the task uses the hosts pid namespace
- Test for both host pid namespace and normal cases
Co-authored-by: Oliver Stenbom <ostenbom@pivotal.io>
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
This was found testing other runtime shims that are faster than runc(no
containerization). This is a race that can cause the shim to block
forever. It's not an issue for out/err because we open both sides of
the pipe, but for stdin, it expects the client to have it opened.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>