Adding meeting keyword due to some initial skepticism from Anaconda developers. We might need to have a formal vote here.

I started doing that, but there were too many unknowns for that page to be useful and the deadline is today, and I won't have the answers by that time.

IMO it would actually be perfectly fine to do this an a F32-timeframe change proposal instead of for F31. That way users have plenty of time to report possible fallout. This is a very longstanding problem, so fixing it ASAP doesn't seem urgent.

OK so all of this work is fine by me, but at the same time it totally steps on all the work and coordination I just wrapped up in issue 56. I was expecting that to happen at some point, just not this quickly.

There's no error handling in the anaconda code. I don't know for sure what happens when PR 87 lands, but I suspect on LiveOS we'll see two swap on zram devices, the one with higher priority will get used and the other will be a benign extra.

I'll close this as I'm not interested in working on this for Fedora 32. It was made abundantly clear that I should have coordinated with an effort I knew nothing about and that my work was "[not] even done".

There are bugs already opened for the individual changes though, so coordination should be straight forward.

I very much appreciate @hadess contribution to a generic swap on swap solution so far. A developer once reminded me that Fedora is supposed to be fun. The way I expressed surprise at all of his work actually triggering changes is contrary to having fun. And for that I apologize.

There are actually quite a few things that get touched with this feature and I think it should go through the feature process for future Fedora. I'm happy to negotiate the bureaucracy aspect of this and make sure there's proper coordination and testing. But I can barely bash my way out of a hat, so there is no way I'm going to magically get the systemd zram-generator working on my own.

I think the first step is to establish whether the systemd zram-generator project is going to be the accepted generic swap on zram implementation by Anaconda, Workstation, and IoT folks. And concurrently sort out who, or at least that someone, will respond when it breaks. When it breaks, we'll need a pretty fast fix. Right now it's broken. I don't think there's is a viable feature if there isn't agreement on, a) a generic solution; b) someone to maintain it; c) for it to actually be working. And right now none of those three things is for sure true or known.

Anaconda folks aren't in favor of switching to merely a different swap on zram solution than what they have. I think it's reasonable to want something robust, and is sufficiently upstream that it can and will be used by other distributions. Along those lines, I need to followup on is Arch has recently discussed moving to swap on zram by default too.

@hadess or anyone else experiencing this problem; I'd like a clear simple as possible set of reproduce steps for this statement:

Unfortunately, especially on interactive systems such as the Workstation variants, hitting the disk-based swap under low-memory conditions renders the machine completely unusable. The disk-based swap is not fast enough to free up physical memory to keep the machine's interactivity.https://bugzilla.redhat.com/show_bug.cgi?id=1731978

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes on systems with large amounts of RAM. So that's something else to keep in mind.

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes...

I just mentioned that in bug 1731978. In ancient times swap at 1x RAM was 4MB. And compared to drive performance, 16-32GB is just goofycakes. But it has to be at least 1x RAM if your view is that hibernation is a plausible use case to try to support; really hibernation requires 100% of the total used memory which is RAM+swap. So plausibly hibernation requires 2x RAM. There's a reason swap and hibernation files are decoupled on macOS and Windows, and why Microsoft has effectively abandoned this style of hibernation. But as far as I know, there's no support on Linux for modernizing it and dealing with all the firmware bugs.

something absurd like ninja -j64 might do the trick

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

I think 'ninja' with no -j args should default to -j8 or -j10, it's either nproc or nproc+2, something like that. I'm fairly confident 8GB is not enough, so it should hang without any -j passed. But if not, you can play with -j and see what it takes.

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.

The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.
The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

Even if it works around the worst behavioural problems, it won't fix the fact that hitting the disk swap is bad, and that we should prefer RAM compression to disk based swap.

Summary:
Whether I use 8GiB swap on SSD plain partition, or 8GiB swap on ZRAM (a 1:1 ratio), the system eventually hangs, not even the mouse pointer will move. I gave up after 30 minutes, and forced power off.

The most central problem in this example build and test system, is ninja's default number of jobs. This is autocalculated somehow. Whether N jobs is derived from package configuration, cmake, or ninja itself, the default used for this package on this system is guaranteed to fail to build, and always results in an unresponsive system unmistakable to most users as a totally lost system.

Does swap on ZRAM help? Performance wise, no. The problem happens sooner, and is more abrupt than swap on SSD plain partition. It does help reduce wear on the SSD. A successful build will write 15GiB to disk, and nearly another 15GiB in swap writes. But in both cases, the default build fails so which one fails better or worse is irrelevant for our purposes.

What did work? ninja -j 4 resulted in a responsive system the entire time, except for a few brief periods lasting less than 15s near the end. I was able to use Firefox for browsing with 8-12 tabs, and concurrently running youtube video without any stuttering. And the build did finish. The configuration for this was 8GiB swap on SSD plain partition.

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

Extended cut

What's going on with swap on ZRAM? With a 1:1 ratio on the test system, /dev/zram0 is 8GiB, the same as available RAM. But ZRAM device usage is allocated dynamically. If all of swap gets consumed, and I have screen shots showing it was, at best 2:1 compression ratio, this uses 4GiB of RAM. Pilfering that much memory away from the build process basically wedged the entire build process, comprised of at minimum 20 processes (more on that in a bit). It's just untenable. The system flat out needs more memory to use ninja's defaults on this system. Where with SSD, the system actually wasn't as memory starved, even though the swap was less efficient on SSD than in memory. The memory starvation is why the swap on ZRAM case failed sooner and more abruptly, with no responsivity even via remote ssh connection. In the swap on SSD case, while the GUI became totally unresponsive in the same way, there was partial responsivity via remote ssh connection - but it wasn't good enough to regain control. The oom killer was never invoked in either case.

Interestingly, repeating the test with a 1/4 sized swap on ZRAM, the oom killer was invoked just before the midway point of the build, the build fails, all build processes quit, complete system recovery happens relatively quickly (1-2 minutes). But the build fails. This data point suggests it's possible to overcommit /dev/zram and cause worse memory starvation. And it just underscores the real problem is the application is asking for too many resources.

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]

If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]

I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build. So I think our true task is how to get better defaults. Either convincing upstreams that their build defaults need to be for individual user machiens, and burden build systems with custom build settings, or we're going to have to do the work of containerizing build applications so they can't (basically) sabotage users systems by using inappropriate defaults.

A secondary task, which can be done concurrent to the above, more testing is necessary to figure out the optimal ZRAM device size. I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]
If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]
I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build.

nproc + 2 is a little aggressive, but not unreasonable for C projects or small C++ projects. But clearly it's way too much for WebKit. (It also works extremely well for my high-end system, with ridiculous specs that we should not optimize for. Using a lower value would make my builds drastically slower.) Ideally make and ninja would learn to look at system memory pressure and decide whether to launch a new build process based on that, but it's probably too much to expect. :/ Trying to trigger OOM earlier seems like a better bet.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

While monitoring top and iotop as this happens, it looked like a case of Ouroboros, the snake eating itself. It's not the fault of ZRAM per se, it's just that it has a different and faster failure mode once the system was setup to fail from the outset. Had the resource demand not reached the critical level, ZRAM would have relieved the pressure better than swap on SSD.

As other people have pointed out, swapping anonymous pages out is only one out of two tools that the kernel has to reclaim memory - the other tool is reclaiming pages from the page cache - including things like mapped-in program code.

So adjusting the way we swap by itself is unlikely to make a huge difference. We really need the kernel to be making a decision in some fashion "this process / these pages are less important" - "this process / these pages are more important".

The main tool that seems to be available in this area is the cgroups memory controller. (cgroups v2 has more abilities in this area, like per-cgroup pressure stall information - https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html) I don't think there's any ability to "prioritize" things, but there are various controls over minimum/maximum amounts of memory used. You can even set "swapiness" per-cgroup.

So, vague idea would be try to arrange things so that gnome-shell, gnome-terminal, dbus-broker, and other critical tasks for maintaining interactive performance are in one part of the cgroup hierarchy, and applications and terminal-spawned processes are in another part of the cgroup hierarchy and try to protect a minimum amount of memory for the critical processes.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes! -

Definitely bringing this to a wider audience would be a good idea - there may be more tools that we aren't aware of.

I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

WebKit has always been a pain to get compiled on any system, I'm not sure we need to base ourselves off of this one workload test.

The gist is, perhaps multiple swaps, some of which are not normally active, and are activated upon hibernation. I will ask Lennart some questions about this. And also what happens if e.g. there are two swaps: ZRAM and disk, and the user tries to hibernate.

Also, there is zswap which is a totally different thing. It uses both a definable memory pool for swap, and spills over to a conventional disk based swap, all of which is always compressed. It really fits this use case nicely, the gotcha is, it's flagged experimental in kernel documentation. I just pinged someone upstream about it to see if that's really the current state. (I'm assuming something marked experimental is not something we want to use in default installations.)

@hadess I agree this is a complicated issue (the general and specific problems, as well as this ticket). It's sorta like a dam with holes spouting water and wanting to fill them all while deciding which ones we can work on and when and in what order, etc.

The webkit example is badass because of its simplicity and clarity as an unprivileged task fork bombing the system. I don't mean to indicate that specific case must be solved in this ticket. But I'm also skeptical of desktop specific bandaids that paper over lower level deficiencies.

And I like the low-memory-monitor concept insofar as it can provide feedback to the user and give them some control over a runaway train situation. But I also strongly take the position a non-sysadmin user can never be held responsible for an unprivileged task taking down the system. Perhaps I want my cake and to eat it too, and so be it.

There's this bug which has been bugging many people for many years
already and which is reproducible in less than a few minutes under the
latest and greatest kernel, 5.2.6. All the kernel parameters are set to
defaults.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes!

@hakavlad In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

In the ninja+webkitgtk example, killing off ninja itself would be good, where kernel's existing oom-killer either plays with it like a cat does a mouse and maybe kills it tomorrow, or abruptly kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed. Conversely in the browser example, I would be OK with a tab getting killed off but not my entire browser session.

And then what if I try to reproduce the browser and ninja cases at the time time? I'd like the browser tab (child process) consuming the most CPU+memory to be treated the most harshly and expendable. But it really is a gray area which comes next? A secondary browser tab versus the entire compile?

I definitely agree that unprivileged processes need resource limitations put on them. It would be very cool if these limitations can be retroactively applied, i.e. to change TasksMax and IO/CPU weighting functions.

In fact I've run into that same bt gdb debugging firefox io hell @hadess has. I suppose in some ideal world, my (interactive) actions inform my system "i'm working, give me priority" and if that means gdb is effectively suspended, fine. And then when some trigger happens that indicates I've walked away (like the display goes to powersave), it then lets gdb hog the system of all resources so long as sshd doesn't face plant.

In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

earlyoom andnohang by default select the victim in the same way as the default OOM killer does: the victim is a process with highest oom_score. The default behavior can be modified.

In the ninja+webkitgtk example, killing off ninja itself would be good

The command-line flag --prefer specifies processes to prefer killing; likewise, --avoid specifies processes to avoid killing. The list of processes is specified by a regex expression. For instance, to avoid having foo and bar be killed:

earlyoom --avoid '^(foo|bar)$'

For earlyoom you can set :

earlyoom --prefer '^(ninja)$'

It adds +300 to ninja badness. +300 value is hardcoded for earlyoom.

For nohang it may be more flexible. For example,

@BADNESS_ADJ_RE_NAME 500 /// ^ninja$

or

@BADNESS_ADJ_RE_REALPATH 900 /// ^/usr/bin/ninja$

Prefer chromium tabs (its are already has oom_score_adj=300 by default):

leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

Killing a cgroup as a single unit can help you. I plan to implement it in nohang.

By the way, if you always run ninja via systemd-run --user ninja, you can enjoy cgroup-killing right now:

You can customize corrective action in nohang:

@SOFT_ACTION_RE_NAME ^ninja$ /// systemctl kill -s SIGKILL $SERVICE

If victim's name is ninja: the next command will be corrective action: systemctl kill -s SIGKILL $SERVICE, $SERVICE will be replaced by ninja's unit service, and ninja's cgroup will be killed as a single unit. Now it works only with legacy/mixed cgroup hierarchy, I will fix it to support unified cgroup hierarchy.

P.S. Seems like systemctl kill doesn't work with services in user.slice. It should work with follow::

In 2019, we do not have a good zram manager. We do not have a zram manager that can handle errors and use many zram functions, such as backing_dev, offer new compression algorithms, such as zstd, and offer many options for fine tuning. I am surprised at this fact.

How do you feel about low memory GUI notifications? For example, if the levels of MemAvailable & SwapFree fall below 20%, the user begins to receive periodic notifications, as he does when the battery is low. This will allow the user to stop in time and stop opening new browser tabs to avoid data loss. Should this behavior be enabled on the desktop by default? Argument your position.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

How do you feel about low memory GUI notifications?

With my QA tester hat on, I like it. As a user, the first question that comes to mind is "What year is this?" I don't really see memory management as user domain. Sounds like the operating system is confused and falling over, and what am I supposed to do about that?

And if I merge the two perspectives together, I come up with: an unprivileged process just preempted the GUI, that is a fail whale.

I think it would be badass if we had a way to get Fedora users to opt-in to experiments, and then randomly give them things like nohang and earlyoom and oomd and low-memory-monitor. No documentation, no warning, nothing. They just get one of them. As if it were their default installation. And see what blows up, or not, what complaints they have, or not. If they explicitly install something, instead of random, they end up with bias that actually pollutes the data. Just a thought.

kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

I can add in nohang an option: kill bash (or any other name) sessions (by SID or NSsid) as a single unit (if the victim is part of this group). That is, for example, if the name of the leader of the session is bash, then the entire session will be killed.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

You've done performance tests. According to my benchmarks, zswap is much slower than a swap.

My testing does not indicate a performance difference whether swap is on NVMe or SSD, if the memory pool is the same size. I have no HDD systems to test. The URL you provide involves a hyperspecific workload in a VM, no details of that setup are provided, and responses to the proposal mention that the results don't adequately take the general case into account. If you're experiencing zswap enabled is slower than swap alone, that's unquestionably a bug, and requires a bug report showing the system details and reproduce steps.

In the worse case scenario, I was consistently able to get the test system to totally wedge itself in with swap on ZRAM, essentially "swap thrashing" becomes CPU bound rather than IO bound, but the system was lost (omitting any of earlyoom, nohang, oomd). In the incidental swap usage cases, whatever differences there are between zswap and swap on ZRAM, I can't say are noticeable. Time frame is roughly 18 months for zswap and 3 months for swap on ZRAM.