ref: #1464
This tries to solve issues with races around process state. First it
adds the process mutex around the state call so that any state changes,
deletions, etc will be handled in order.
Second, for IsNoExist errors from the runtime, return a stopped state if
a process has been removed from the underlying OCI runtime but not from
the shim yet. This shouldn't happen with the lock from above but its
hare to verify this issue.
Third, handle shim disconnections and return an ErrNotFound.
Forth, don't abort returning all tasks if one task is unable to return
its state.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This ensure that when using the host pid, we don't let process alive,
preventing Wait() to return until they all die.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
This converts the oom metric to be a const metric so that deleted tasks
do not fill up the metric labels.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This also fix the type used for RuncOptions.SystemCgroup, hence introducing
an API break.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Depends on https://github.com/containerd/go-runc/pull/24
The is currently a race with the reaper where you could miss some exit
events from processes.
The problem before and why the reaper was so complex was because
processes could fork, getting a pid, and then fail on an execve before
we would have time to register the process with the reaper. This could
cause pids to fill up in a map as a way to reduce the race.
This changes makes the reaper handle multiple subscribers so that the
caller can handle locking, for when they want to wait for a specific
pid, without affecting other callers using the reaper code.
Exit events are broadcast to multiple subscribers, in the case, the runc
commands and container pids that we get from a pid-file. Locking while
the entire container stats no longs affects runc commands where you want
to call `runc create` and wait until that has been completed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Use the state pattern to handle process transitions from one state to
another and what actions can be performed on a process in a specific
state.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This reverts commit 06dc87ae59.
Revert "Change oom metric to const"
This reverts commit e800f08f9f.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This removes the metric vec that was holding onto all task id and
namespace combinations forever, until containerd was restarted. This
was causing a memory leak with many task.
This also removes the shim cmd where the `Args` is quite large from the
reaper after the shim has been started cutting down on another leak.
This is the first pass through the reaper but more code is required to
fix all the issues when commands are added.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This adds null IO option for efficient handling of IO.
It provides a container directly with `/dev/null` and does not require
any io.Copy within the shim whenever a user does not want the IO of the
container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Because runc will delete a container after a successful checkpoint we
need to handle a NotFound error from runc on delete.
There is also a race between SIGKILL'ing the shim and it actually
exiting to unmount the tasks rootfs, we need to loop and wait for the
task to actually be reaped before trying to delete the rootfs+bundle.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When we generate protobufs, descriptors outlining all messages and
services are merged into a single file that can be used to identify
unexpected changes to the API that may affect stability. We follow a
similar process to Go's stability guarantees using the protobuf
descriptors to identify changes before they become a problem.
Please see README.md for details.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
This changes Wait() from returning an error whenever you call wait on a
stopped process/task to returning the exit status from the process.
This also adds the exit status to the Status() call on a process/task so
that a user can Wait(), check status, then cancel the wait to avoid
races in event handling.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This greatly reduce the risk that we will hit the unix socket maximum path
length.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
This splits up the create and start of an exec process in the shim to
have two separate steps like the initial process. This will allow
better state reporting for individual process along with a more robust
wait for execs.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This change further plumbs the components required for implementing
event filters. Specifically, we now have the ability to filter on the
`topic` and `namespace`.
In the course of implementing this functionality, it was found that
there were mismatches in the events API that created extra serialization
round trips. A modification to `typeurl.MarshalAny` and a clear
separation between publishing and forwarding allow us to avoid these
serialization issues.
Unfortunately, this has required a few tweaks to the GRPC API, so this
is a breaking change. `Publish` and `Forward` have been clearly separated in
the GRPC API. `Publish` honors the contextual namespace and performs
timestamping while `Forward` simply validates and forwards. The behavior
of `Subscribe` is to propagate events for all namespaces unless
specifically filtered (and hence the relation to this particular change.
The following is an example of using filters to monitor the task events
generated while running the [bucketbench tool](https://github.com/estesp/bucketbench):
```
$ ctr events 'topic~=/tasks/.+,namespace==bb'
...
2017-07-28 22:19:51.78944874 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-6-8","pid":25889}
2017-07-28 22:19:51.791893688 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-4-8","pid":25882}
2017-07-28 22:19:51.792608389 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-2-9","pid":25860}
2017-07-28 22:19:51.793035217 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-5-6","pid":25869}
2017-07-28 22:19:51.802659622 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-0-7","pid":25877}
2017-07-28 22:19:51.805192898 +0000 UTC bb /tasks/start {"container_id":"bb-ctr-3-6","pid":25856}
2017-07-28 22:19:51.832374931 +0000 UTC bb /tasks/exit {"container_id":"bb-ctr-8-6","id":"bb-ctr-8-6","pid":25864,"exited_at":"2017-07-28T22:19:51.832013043Z"}
2017-07-28 22:19:51.84001249 +0000 UTC bb /tasks/exit {"container_id":"bb-ctr-2-9","id":"bb-ctr-2-9","pid":25860,"exited_at":"2017-07-28T22:19:51.839717714Z"}
2017-07-28 22:19:51.840272635 +0000 UTC bb /tasks/exit {"container_id":"bb-ctr-7-6","id":"bb-ctr-7-6","pid":25855,"exited_at":"2017-07-28T22:19:51.839796335Z"}
...
```
In addition to the events changes, we now display the namespace origin
of the event in the cli tool.
This will be followed by a PR to add individual field filtering for the
events API for each event type.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
this adds a `platform` interface for shim service to manage platform-specific
behaviors such as I/O (which uses epoll in linux to work around bugs with applications
that closes all consoles i.e. https://github.com/opencontainers/runc/pull/1434
and https://github.com/moby/moby/issues/27202)
Its expected that we only have 1 epollfd per containerd_shim to manage all processes.
Since all the work are done outside of the container runtime, upgrading of runc
is not required and should be done separately.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>