cgroup2: monitor OOMKill instead of OOM to prevent missing container OOM events

With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.

This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:

a) if one of the containers has no limit, the pod has no limit so OOM events in
   another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
   container events, so a container will be able to hit its limit first.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
This commit is contained in:
Jeremi Piotrowski 2021-12-03 10:26:23 +00:00
parent 106086d65a
commit 7275411ec8

View File

@ -71,15 +71,15 @@ func (w *watcher) Run(ctx context.Context) {
continue continue
} }
lastOOM := lastOOMMap[i.id] lastOOM := lastOOMMap[i.id]
if i.ev.OOM > lastOOM { if i.ev.OOMKill > lastOOM {
if err := w.publisher.Publish(ctx, runtime.TaskOOMEventTopic, &eventstypes.TaskOOM{ if err := w.publisher.Publish(ctx, runtime.TaskOOMEventTopic, &eventstypes.TaskOOM{
ContainerID: i.id, ContainerID: i.id,
}); err != nil { }); err != nil {
logrus.WithError(err).Error("publish OOM event") logrus.WithError(err).Error("publish OOM event")
} }
} }
if i.ev.OOM > 0 { if i.ev.OOMKill > 0 {
lastOOMMap[i.id] = i.ev.OOM lastOOMMap[i.id] = i.ev.OOMKill
} }
} }
} }