nohang/README.md
Alexey Avramov b9d2fed004 fix readme
2019-02-12 01:41:32 +09:00

223 lines
12 KiB
Markdown

# Nohang
Nohang is a highly configurable daemon for Linux which is able to correctly prevent [out of memory](https://en.wikipedia.org/wiki/Out_of_memory) (OOM) and keep system responsiveness in low memory conditions.
## What is the problem?
OOM conditions may cause [freezes](https://en.wikipedia.org/wiki/Hang_(computing)), [livelocks](https://en.wikipedia.org/wiki/Deadlock#Livelock), drop [caches](https://en.wikipedia.org/wiki/Page_cache) and processes to be killed (via sending [SIGKILL](https://en.wikipedia.org/wiki/Signal_(IPC)#SIGKILL)) instead of trying to terminate them correctly (via sending [SIGTERM](https://en.wikipedia.org/wiki/Signal_(IPC)#SIGTERM) or takes other corrective action). Some applications may crash if it's impossible to allocate memory.
![pic](https://i.imgur.com/9yuZOOf.png)
Here are the statements of some users:
> "How do I prevent Linux from freezing when out of memory?
Today I (accidentally) ran some program on my Linux box that quickly used a lot of memory. My system froze, became unresponsive and thus I was unable to kill the offender.
How can I prevent this in the future? Can't it at least keep a responsive core or something running?"
— [serverfault](https://serverfault.com/questions/390623/how-do-i-prevent-linux-from-freezing-when-out-of-memory)
> "With or without swap it still freezes before the OOM killer gets run automatically. This is really a kernel bug that should be fixed (i.e. run OOM killer earlier, before dropping all disk cache). Unfortunately kernel developers and a lot of other folk fail to see the problem. Common suggestions such as disable/enable swap, buy more RAM, run less processes, set limits etc. do not address the underlying problem that the kernel's low memory handling sucks camel's balls."
— [serverfault](https://serverfault.com/questions/390623/how-do-i-prevent-linux-from-freezing-when-out-of-memory#comment417508_390625)
Also look at [Why are low memory conditions handled so badly?](https://www.reddit.com/r/linux/comments/56r4xj/why_are_low_memory_conditions_handled_so_badly/) (discussion with 480+ posts on r/linux).
## Solution
- Use of [earlyoom](https://github.com/rfjakob/earlyoom). This is a simple and very lightweight OOM preventer written in C (the best choice for emedded and old servers). It has a minimum dependencies and can work with oldest kernels.
- Use of [oomd](https://github.com/facebookincubator/oomd). This is a userspace OOM killer for linux systems whitten in C++ and developed by Facebook. Needs Linux 4.20+.
- Use of `nohang` (maybe this is a good choice for modern desktops and servers if you need fine tuning).
![pic](https://i.imgur.com/Hq1JPMs.png)
The tools listed above may work at the same time on one computer.
## Some features
- `SIGKILL` and `SIGTERM` as signals that can be sent to the victim
- the ability to send any signal instead of SIGTERM for processes with certain names ([screenshot](https://i.imgur.com/cs1PRC5.png))
- impact on the badness of processes via matching their names, cmdlines and UIDs with regular expressions
- possibility of restarting processes via command like `systemctl restart something` if the process is selected as a victim (or run any other command)
- GUI notifications:
- OOM prevention results (displays sended signal and displays PID and name of victim)
- Low memory warnings (displays available memory and name of fattest process)
- `zram` support (`mem_used_total` as a trigger)
- [PSI](https://lwn.net/Articles/759658/) support (since Linux 4.20+, using `/proc/pressure/memory` and `some avg10` as a trigger)
- customizable intensity of monitoring
- convenient configuration with a ~~well~~ commented [config file](https://github.com/hakavlad/nohang/blob/master/nohang.conf)
## Requirements
For basic usage:
- `Linux` 3.14+ (since `MemAvailable` appeared in `/proc/meminfo`)
- `Python` 3.3+ (not tested with previous)
To show GUI notifications:
- [notification server](https://wiki.archlinux.org/index.php/Desktop_notifications#Notification_servers) (most of desktop environments use their own implementations)
- `libnotify` (Fedora, Arch Linux) or `libnotify-bin` (Debian GNU/Linux, Ubuntu)
- `sudo` if nohang started with UID=0
To use `PSI` (pressure stall information):
- `Linux` 4.20+
## Memory and CPU usage
- VmRSS is about 10 MiB
- CPU usage depends on the level of available memory and monitoring intensity
## Download, install, uninstall
Please use the latest [release version](https://github.com/hakavlad/nohang/releases). Current version may be unstable.
To download the latest stable version (v0.1):
```bash
$ wget -ct0 https://github.com/hakavlad/nohang/archive/v0.1.tar.gz
$ tar xvzf v0.1.tar.gz
$ cd nohang-0.1
```
or to clone the latest unstable:
```bash
$ git clone https://github.com/hakavlad/nohang.git
$ cd nohang
```
To install:
```bash
$ sudo make install
```
To enable and start on systems with systemd:
```bash
$ sudo make systemd
```
To uninstall:
```bash
$ sudo make uninstall
```
For Arch Linux, there's an [AUR package](https://aur.archlinux.org/packages/nohang-git/). Use your favorite [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). For example,
```bash
$ yay -S nohang-git
$ sudo systemctl start nohang
$ sudo systemctl enable nohang
```
## How to configure nohang
The program can be configured by editing the [config file](https://github.com/hakavlad/nohang/blob/master/nohang.conf). The configuration includes the following sections:
1. Memory levels to respond to as an OOM threat
2. The frequency of checking the level of available memory (and CPU usage)
3. The prevention of killing innocent victims
4. Impact on the badness of processes via matching their names, cmdlines and UIDs with regular expressions
5. The execution of a specific command or sending any signal instead of sending the SIGTERM signal
6. GUI notifications:
- results of preventing OOM
- low memory warnings
7. Output verbosity
Just read the description of the parameters and edit the values. Please restart nohang to apply changes. Default path to the config after installing is `/etc/nohang/nohang.conf`.
## oom-sort
`oom-sort` is an additional diagnostic tool that will be installed with `nohang` package. It sorts the processes in descending order of their `oom_score` and also displays `oom_score_adj`, `Uid`, `Pid`, `Name`, `VmRSS`, `VmSwap` and optionally `cmdline`. Run `oom-sort --help` for more info.
Usage:
```
$ oom-sort
```
Output like follow:
```
oom_score oom_score_adj Uid Pid Name VmRSS VmSwap cmdline
--------- ------------- ----- ----- --------------- -------- -------- -------
314 300 1000 991 chromium 84 M 0 M /usr/lib/chromium/chromium --type=renderer --field-trial-handle=868244496792098610,5765419126773948943,131072 --service-pipe-token=14782672631740123203 --lang=ru --user-data-dir=/tmp/tmp.TJ91B6F0zB --disable-client-side-phishing-detection --enable-offline-auto-reload --enable-offline-auto-reload-visible-only --num-raster-threads=1 --service-request-channel-token=14782672631740123203 --renderer-client-id=4 --no-v8-untrusted-code-mitigations --shared-files=v8_context_snapshot_data:100,v8_natives_data:101
307 300 1000 1124 chromium 44 M 0 M /usr/lib/chromium/chromium --type=renderer --field-trial-handle=868244496792098610,5765419126773948943,131072 --service-pipe-token=10276223625123198448 --lang=ru --user-data-dir=/tmp/tmp.TJ91B6F0zB --disable-client-side-phishing-detection --enable-offline-auto-reload --enable-offline-auto-reload-visible-only --num-raster-threads=1 --service-request-channel-token=10276223625123198448 --renderer-client-id=6 --no-v8-untrusted-code-mitigations --shared-files=v8_context_snapshot_data:100,v8_natives_data:101
217 200 1000 962 chromium 99 M 0 M /usr/lib/chromium/chromium --type=gpu-process --field-trial-handle=868244496792098610,5765419126773948943,131072 --user-data-dir=/tmp/tmp.TJ91B6F0zB --disable-breakpad --gpu-preferences=KAAAAAAAAACAAABAAQAAAAAAAAAAAGAAAAAAAAEAAAAIAAAAAAAAAAgAAAAAAAAA --user-data-dir=/tmp/tmp.TJ91B6F0zB --service-request-channel-token=2848128951654484113
202 200 1000 1032 chromium 16 M 0 M /usr/lib/chromium/chromium --type=-broker
43 0 1000 736 firefox-esr 251 M 0 M /usr/lib/firefox-esr/firefox-esr
21 0 1000 914 chromium 124 M 0 M /usr/lib/chromium/chromium --show-component-extension-options --ignore-gpu-blacklist --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --user-data-dir=/tmp/tmp.TJ91B6F0zB
17 0 1000 844 Web Content 103 M 0 M /usr/lib/firefox-esr/plugin-container -greomni /usr/lib/firefox-esr/omni.ja -appomni /usr/lib/firefox-esr/browser/omni.ja -appdir /usr/lib/firefox-esr/browser 736 true tab
16 0 1000 31555 dolphin 95 M 0 M dolphin
15 0 0 863 Xorg 92 M 0 M /usr/lib/xorg/Xorg :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
8 0 110 860 tor 50 M 0 M /usr/bin/tor --defaults-torrc /usr/share/tor/tor-service-defaults-torrc -f /etc/tor/torrc --RunAsDaemon 0
8 0 1000 918 chromium 48 M 0 M /usr/lib/chromium/chromium --type=zygote --user-data-dir=/tmp/tmp.TJ91B6F0zB
7 0 1000 1106 mate-panel 43 M 0 M mate-panel
6 0 1000 1157 wnck-applet 35 M 0 M /usr/lib/mate-panel/wnck-applet
```
Kthreads, zombies and Pid 1 will not be displayed.
## Logging
To view the latest entries in the log (for systemd users):
```bash
$ sudo journalctl -eu nohang
```
See also `man journalctl`.
## Known problems
- Awful documentation (the problem will be solved gradually in the next releases)
- It is written in Python and is actually a prototype (although the algorithm may be good)
- No tests (by itself this does not make the algorithm bad)
## Todo
- Rewrite all code in Golang with tests and good documentation.
## Nohang don't help you
if you run
```bash
$ while true; do setsid tail /dev/zero; done
```
(although with some settings, nohang can even handle it)
## Contribution
Please create [issues](https://github.com/hakavlad/nohang/issues). Use cases, feature requests and any questions are welcome.
## Changelog
- In progress
- [x] Improve output:
- [x] Display `oom_score`, `oom_score_adj`, `PPID`, `EUID`, `State`, `VmSize`, `RssAnon`, `RssFile`, `RssShmem`, `realpath` and `cmdline` of the victim in corrective action reports
- [x] Print in terminal with colors
- [x] Print statistics on corrective actions after each corrective action
- [x] Improve poll rate algorithm
- [x] Improve victim search algorithm (do it ~30% faster)
- [x] Improve limiting `oom_score_adj`: now it can works with UID != 0
- [x] Improve GUI warnings:
- [x] Find env without run `ps`
- [x] Handle all timeouts when notify-send starts
- [x] Fix conf parsing: use of `line.partition('=')` instead of `line.split('=')`
- [x] Add `oom-sort`
- [x] Reduce memory usage (remove `import argparse`)
- [x] Remove CLI options (need to add it again via `sys.argv`)
- [x] Remove self-defense options from config, use systemd unit scheduling instead
- [x] Add the ability to send any signal instead of SIGTERM for processes with certain names
- [x] Handle `UnicodeDecodeError` if victim name consists of many unicode characters
- [x] Fix `mlockall()` using `MCL_ONFAULT` and lock all memory by default
- [ ] Add `PSI` support (using `/proc/pressure/memory`, need Linux 4.20+)
- [ ] Redesign of the config
- [ ] Improve user input validation
- [ ] Redesign of the GUI notifications
- [ ] Improve modifing badness via matching with regular expressions:
- [x] Adding the ability to set many different `badness_adj` for processes depending on the matching `name`, `cmdline` and `euid` with the specified regular expressions
- [x] Fix: replace `re.fullmatch()` by `re.search()`
- [ ] Validation RE patterns at startup
- [v0.1](https://github.com/hakavlad/nohang/releases/tag/v0.1), 2018-11-23
- 1st release