Rendered at 22:33:06 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
thomashabets2 2 days ago [-]
Hey, that's me! (suggesting an OOM pardon feature)
It's a funny reply. But what was not funny was the OOM killer killing my screen locker.
Joke all you want, but 22 years later I still stand by that I'd rather get a kernel panic than kill the screen lock.
These days you can do oom score adjusting, which is not as strong as a pardon. I may be taking too much credit, and may misremember the timeline, but I feel like someone took my crappy kernel patch and went "fine, I'll do it the right way", merged that oom score adjusting maybe a year or so later.
> These days you can do oom score adjusting, which is not as strong as a pardon.
Writing -1000 to /proc/<pid>/oom_score_adj will cause the OOM killer not to consider the process at all :)
From the man page proc_pid_oom_score_adj(5)
> The value of oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). [...]. The lowest possible value, -1000, is equivalent to disabling OOM-killing entirely for that task, since it will always report a badness score of 0.
creatonez 17 hours ago [-]
The modern desktops seem to have some way to jam themselves if the lock screen fails.
I think this only works because there is top-down integration between the different parts. The compositor knows when it's supposed to be locked. Whereas the old screen lockers were just very aggressive Xorg apps that suffer from "What if two programs did this?" problems (https://devblogs.microsoft.com/oldnewthing/20110310-00/?p=11...)
Muromec 1 days ago [-]
>Joke all you want, but 22 years later I still stand by that I'd rather get a kernel panic than kill the screen lock.
An argument can be made that the kernel should not cover for architectural missteps of the X server and that X server should be the one to crash when it's security-critical component was killed for whatever reason.
thomashabets2 1 days ago [-]
Sure. But that's not where we are.
Also there are other safety and security critical reasons why you'd want to exempt some processes.
Arguably (and it definitely has been argued) the real architectural misstep is the Linux kernel overcommitting by default in the first place.
jkrejcha 1 days ago [-]
It has also created this unfortunate assumption a lot of the time that malloc and friends are (infallible OR crash) and, separately, can sometimes have potentially weird tendencies to force undefined behaviors on otherwise well-defined programs (I think primarily around mmap, although I'm not remembering the details super well).
Agreed though, overcommit is the culprit here. I get why it happened (unfortunate consequences of fork and friends existing as the way to spawn tasks and wanting those to be both performant and not fail in frustrating conditions), but I don't think it was a design that aged particularly well.
I actually like somewhat the notion of how Windows handles these two things
1. For address space reservations, you can reserve address space but in order to touch it you have to commit it. Commits have to be backed by something (RAM, a file, pagefiles if they exist) and if a commit fails, they'll get NULL back from malloc. It allows code to be more correct in the face of low-memory conditions or to try again later (Firefox for example, does this[1] on Windows).
2. Process creation is done with a specific API to create processes. The only problem with this I think is that you have to specify everything at creation time, but you could augment this by creating processes in a stopped state (iirc Linux has to do this anyway to set up some stuff before it can hand over control back to userland) and having the parent send FDs to the child or whatnot. Windows... doesn't do this, it has a couple of kitchen sink APIs for creating processes and setting up stuff like the standard streams... in any case I'm getting off topic.
Don't think there's much about that design that can be changed now though
I still remember following Andries’s “Linux kernel hacker’s hut” course he taught at the Eindhoven University of Technology (TU/e) back in 2010. Every week we’d get an assignment where we had to write exploits for commonly occurring security vulnerabilities (e.g., buffer overflows, bad printf format). It was one of the most enjoyable courses I ever followed. Thanks for that, Andries!
blux 2 days ago [-]
Hey fellow TU/e'er :) I followed his course as well, somewhere around 2004/5. Executing man in the middle attacks, writing buffer overflow exploits. Good memories!
AbbeFaria 1 days ago [-]
Is this course still available? What about the course materials? I know it will be dated but if so can someone pls share the links.
Tried searching for it on google but couldn’t find it.
EdSchouten 1 days ago [-]
It looks like the code of the course was 2WC16. Unfortunately the course material no longer seems to be available online.
hyperpape 2 days ago [-]
I confess, this is very funny and the underlying situation is a bit absurd, but it's unclear what point Brouwer is making by pointing out the absurdity.
There surely is something absurd about having to register specific processes as exempt from the OOM killer. But given that the OOM killer exists, and could kill xlock...how should that be fixed?
kelnos 2 days ago [-]
I think part of it is that the design of screen lockers on X11 is just broken. If the locker crashes (or is killed), then the screen unlocks. Security-wise, it fails open. On Windows and macOS (and Wayland, using the ext-screen-lock protocol, coupled with sane compositor policy), that can't happen.
The right way for this to work is for the X server to have an extension that lets a screen locker say "hey, I'm locking the screen now", and the X server should respond to that by pretending that the screen locker client is the only client that exists: no other client gets input or gets to draw. And if the screen locker crashes (or is killed), the X server should just put itself into a permanently-locked state where it will never again send any input to anything, and won't ever draw anything except a blank screen. That's not a desirable situation, of course, but it's better than unlocking the screen.
hyperpape 1 days ago [-]
Admittedly, that's right, and makes sense for that use case. But as others have pointed out, killing the user's web browser while they're using it is equally painful.
ameliaquining 2 days ago [-]
I read him as arguing that overcommit was a mistake. Of course, he doesn't answer any of the obvious follow-up questions, such as, does fork–exec copy all the process's memory and then immediately throw it away, or what. (One could argue that fork–exec was also a mistake, but it long predates Linux, so this doesn't answer the question of how Torvalds should have designed it.)
zinekeller 2 days ago [-]
> does fork–exec copy all the process's memory
NT: Yes? Why not?
(note that this refers to the Windows NT kernel's operation because it had historically a POSIX emulation layer (NT Personalities), not the modern WSL which is just Linux in a Hyper-V)
adgjlsfhk1 2 days ago [-]
because this is what causes Windows to use ~80% more memory than unixes
magicalhippo 2 days ago [-]
Well, in that case it's a good thing I guess. Windows is orders of magnitude better when it comes to memory management on the desktop compared to Linux. Like why would I even want a single process killed by OOM killer? On Windows things just work, or get slow. On Linux it works and then mayhem ensues.
Last year I was writing a reply on a forum in Firefox on Linux when the OOM killer decided to nuke Firefox. Poof gone, mid keystroke. How does anyone think that's acceptable?
This was on a stock Linux distro, nothing special.
ChocolateGod 2 days ago [-]
> Windows is orders of magnitude better when it comes to memory management on the desktop compared to Linux.
The bar is pretty low, but the windows scheduler is aware what the currently focussed app is so it can prioritise not killing it.
On Linux? Not so much.
zinekeller 1 days ago [-]
Actually, it depends on the Windows scheduler settings. On Windows Server, the default is to kill the foreground process (on the assumption that it is just a management app rather than a critical server component).
magicalhippo 1 days ago [-]
In either case, Windows tries a lot of things to avoid killing processes. Which at least in a desktop setting is an infinitely better approach than random beheadings without warning.
adgjlsfhk1 1 days ago [-]
yeah. a lot of the issue with Linux's approach is that until recently, the kernel was the one making the choice, and it doesn't know which processes matter. The part Linus does a lot better if not getting to oom in the first place (and with the newish compressed ram stuff is getting even better)
jkrejcha 1 days ago [-]
Windows doesn't use fork/exec for process creation in any relevant way today
There are Native APIs for implementing fork (needed for the obsolete POSIX subsystem, primarily), but even on the Native API side, processes are usually spawned through NtCreateProcess or RtlCreateUserProcess, though there is a bunch of setup with regards to the Csr APIs for the Win32 CreateProcess[1]).
Processes are usually spawned with CreateProcess. There's no fork in win32.
wahern 1 days ago [-]
> does fork–exec copy all the process's memory and then immediately throw it away, or what
No, you just account for it (commit the charge) in the bookkeeping. If a 1GB process forks, you decrement the amount of free memory by 1GB to ensure other processes don't overcommit such that you won't have 1GB of free memory if and when you actually needed to allocate that memory. If the forked process immediately exits, you just bump the free memory counter back up. This is what Solaris and Windows do.
But precise accounting of memory is difficult if you didn't design for it in the first place. For example, you have to figure in the memory needed for page structures. (Though I think Linux can do that in particular, bugs notwithstanding.) Last time I checked (5+ years ago) Linux was incapable of such precise accounting across the board, so even if you disabled overcommit the kernel could still find itself in an OOM situation when the time comes to allocate memory it already promised or perform an operation it implicitly or explicitly guaranteed it could complete.
The expectation that Linux overcommits meant many Linux kernel developers didn't design subsystems in a way that the kernel as a whole could provide reliable, guaranteed, precise memory accounting. For example, some filesystems rely on being able to use the OOM killer to free up memory needed for an operation that it can't back out of once it starts because it wasn't written in a way that it could either predetermine or bound it's memory requirements, or cleanly back out of an operation it started.
To be fair I'm not sure any of the BSDs can do it either, at least when it comes to fork and CoW. IIRC, nor can macOS, though it will dynamically add swap so you won't get an OOM kill until you run out of disk space.
ameliaquining 1 days ago [-]
Well, Windows doesn't have fork–exec so there's no problem with a 15 GB process spawning a 15 MB subprocess. Whereas doing that on Linux without overcommit requires there to be 15 GB free. vfork and posix_spawn work around this, but lots of existing code doesn't use them, vfork is notoriously hard to use correctly, and posix_spawn doesn't (and doesn't try to) cover all fork–exec use cases.
Precise memory accounting and CoW fork aren't intrinsically antagonistic, and the general ability to clone CoW mappings or similar kernel structures is useful beyond fork, which is why NT had all the necessary facilities in the kernel (it's the userspace CRT state that can be tricky, especially in the presence of threads, which is true on Unix systems as well).
The example of forking a process with a giant VM space just to exec some other program is, IMO, a straw man. Processes with such huge RW mappings typically don't fork and exec like that. Nobody architecting an app like PostgreSQL was relying on the ability to easily fork processes for minor tasks or exec utilities from processes already forked for resource intensive tasks. And when such a thing is desirable, it's easy enough to use the alternatives, like vfork, or architect a controller for spawning subprocesses, or just use threads. Heck, fork existed long before CoW. Expectations around fork, that you can and should be able to call it without any forethought about resource management was a consequence of Linux' popularity.
Linux embraced overcommit because people wanted to run existing big iron applications like networked databases on tiny PCs with fractions of the memory those applications were written to expect to be able to use. Overcommit was a hack that let your play around with those applications without them immediately falling over, partly because back then such applications often preallocated memory for cache, etc, but would never use all of it when running in an environment like early Linux, which would never see the same high loads and utilization as big iron servers.
Linux could have pivoted in the other direction and pursued strict memory accounting with the ability to expressly overcommit in, e.g., some process subtrees or dynamically allocate swap (which in the expected scenario it normally wouldn't have to actually do). But like most userspace developers they found it easier to write kernel code when they could pretend memory was infinite, and when the system hit the wall just blow up and blame the user. That choice can be defensible for userspace, but it's simply not defensible for a kernel.
jkrejcha 22 hours ago [-]
To be 100% fair, it's rare that processes are cloned on Windows, if only because it's part of the Native API that applications generally don't use directly, and CreateProcess is easier and does all the housekeeping stuff, etc, that people writing Windows applications generally come to expect (or don't even know happens)
I do think overcommit was a poor design choice, but I think it probably mostly does logically follow from the fact that fork and friends are the only ways available to create a process that's available to userspace. It's quite unfortunate though.
Part of the problem is that some applications wanted to reserve lots of address space but didn't necessarily want to touch it right away (such as when they were using it sparsely). Something that VirtualAlloc(x, MEM_RESERVE) (or mmap(..., MAP_NORESERVE)) would be suited for. But while malloc exists, mreserve doesn't in libc, and I think it was pretty uncommon to use it.
silon42 2 days ago [-]
Fork should be replaced by vfork (or something better) in almost all situations.
dooglius 2 days ago [-]
The point is that the OOM killer shouldn't exist and arguing about how to tweak it is addressing the wrong problem
hackyhacky 2 days ago [-]
I agree that that's the point he's making, but I don't see how that would work practically. His attitude is that malloc(1<<63) should immediately crash the system, every time? How is that better?
cpgxiii 2 days ago [-]
No, if a process allocates an infeasible amount, malloc fails and the process needs to deal with the failure (which is what already happens, "malloc doesn't fail on Linux" is only really true for smaller-than-page-size allocations). The point being made is that the system should account conservatively for all memory that can be used, not just the optimistic underestimate that overcommit enables (i.e. the plane should always carry enough fuel for contingencies, and landing with extra fuel is a good outcome).
StilesCrisis 1 days ago [-]
You never need to crash the system if you remove overcommit. You just crash the one process. Practically speaking, you don't even need to crash here; you just return null (which malloc is always free to do) and let the consequences speak for themselves.
jkrejcha 22 hours ago [-]
malloc can just return NULL (in specific, mmap returns -ENOMEM and your libc translates that). Applications need to check for success anyway
hyperpape 2 days ago [-]
But the second clause doesn't follow from the first!
I don't think Linux was plausibly going to remove the OOM killer in 2004 or later. So the right solution for Linux is very much to tweak it to be less painful.
sankhao 2 days ago [-]
I also think the analogy doesn't work. In the plane situation it seems obvious that the luggage should be ejected before passengers, which is what the guy was asking ?
fragmede 2 days ago [-]
The analogy doesn't work because you can't call fork() on the plane and then it duplicates just the seat for the passenger or pilot that did something different. Also, killing them rather ghastly.
rwmj 2 days ago [-]
It's 2026 and I still can't configure the OOM killer to kill firefox before anything else.
3r7j6qzi9jvnve 2 days ago [-]
If it helps, I run ff in systemd-run with memory limits set -- that's usually enough to avoid the problem in the first place (ff does freeze when loading google spreadsheets or whatever heavy UI, so I also have a script to adjust /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/ff-*.scope/memory.max and memory.high at runtime... I should publish my $bindir someday)
So, in actuality, I think your assertion just taught us all something, because despite knowing that the OOM killer and that the Magic SysRq key[1] exists, I didn't know you could configure this as an input!
I'm aware of it, but it's awkward to use in practice. You have to track down all the FF processes, each time you run it, and adjust all their scores.
nick__m 2 days ago [-]
You could launch it as a systemd user target with OOMScoreAdjust=500 in the service section; weird and unconventional but wrapped in .desktop file it doesn't appear to be unwieldy.
bellowsgulch 2 days ago [-]
Ah. Yes, that is awkward. Well, nonetheless, you taught me a new feature. Thanks!
loeg 2 days ago [-]
Maybe firefox could self-adjust, as a policy?
pavon 1 days ago [-]
It looks like it does, which depending on your goal is either helpful or part of the problem. By default processes should inherit their parent's oom_score_adj. If I exit out of firefox completely, then start it up (with no saved tabs), this is the behavior I see:
$ firefox-esr& PID=$!; choom -p $PID -n 42
[1] 105360
pid 105360's OOM score adjust value changed from 0 to 42
$ for p in $(ps --ppid $PID -opid --no-headers $PID); do printf "%3d" $(</proc/$p/oom_score_adj); ps -opid,comm --no-headers $p; done
0 105360 firefox-esr
0 105425 Socket Process
167 105451 Privileged Cont
0 105456 RDD Process
100 105495 WebExtensions
0 105524 Utility Process
233 105534 Web Content
233 105542 Web Content
233 105549 Web Content
See how each firefox process has a different oom_score_adj with Web Content being more likely to be killed than other processes (233), and none of them have the value that the process was started with (42). This is Firefox 140.11 ESR running on Debian 13.
loeg 1 days ago [-]
Nice!
rwmj 2 days ago [-]
Yes this would be nice. Or maybe the OOM system would have two other files, /sys/oom/kill_first and /sys/oom/kill_never which would solve the problem more directly for the majority of cases.
I should really send a patch rather than complaining ...
jitl 2 days ago [-]
sounds like a job for a program
po1nt 2 days ago [-]
It would be nice to have a signal as a warning to process to reduce it's memory footprint or else OOM will kill it.
Joker_vD 2 days ago [-]
You still need some way to make the kernel to send those signals to the processes of your choosing. If the kernel decides to send SIGLOWMEM to xlock instead of firefox, the xlock will get killed because it really doesn't have any memory it can give up.
FreeBSD has a "protect" command which does something similar to what this asks for – the man page [1] describes it:
"The protect command is used to mark processes as protected. The kernel does not kill protected processes when swap space is exhausted. [...] If you protect a runaway process that allocates all memory the system will deadlock."
I never pay for the OOF insurance, it seems like a waste of money and I've never met anyone that's had it happen.
keyle 2 days ago [-]
It can only happen once anyway, and I fly weekly!
nemothekid 2 days ago [-]
While I have had my time fighting the OOM killer, I believe overcommit would have always won. To torture the metaphor a bit more, airlines have OOF mechanism - they just eject the overcommitted passengers before the plane takes off.
A passenger buying a ticket is malloc(), but passengers don't always utilize the seat (use the memory). Normally this works out fine, but occasionally, there are too many passengers. Thankfully though instead of executing a couple passengers they give you a voucher.
jkrejcha 22 hours ago [-]
I've mentioned this elsewhere in the thread, but I think it's a difference of view on what malloc represents. Operating systems do have "reserve this part of the address space" APIs and these reservations don't get charged against your commit because you're simply reserving the space, not committing to using it, and so the operating system doesn't need to back it with anything.
In this worldview, malloc is like me buying a plane ticket at the counter for a specific flight that's going to leave soon. I'd be really annoyed if I were bumped off a flight I just paid for (and would've rather been told "that flight is full, try again later" (malloc returns NULL)). This is, for example what Windows does. Under memory pressure, it'll say to applications, "hey no I'm not in a giving mood for memory right now" (and will sometimes bump the size of the pagefile if configured to do this, but only up to a point).
The thought behind this is that well... applications have to handle malloc returning NULL anyway. Whether that's calling abort and giving up is one matter, another might be to retry the allocation at a later time (maybe after Windows has bumped the pagefile size), another might be to handle an error using some preallocated buffer or whatever.
sedatk 2 days ago [-]
I’d say, let the one who tried to allocate memory crash, and if you’re a critical process like xlock, use statically allocated memory and don’t alloc again.
LoganDark 2 days ago [-]
This is only a viable answer when overcommit is disabled. The problem comes when overcommit is enabled and you find yourself in a position where many programs think they already have memory and yet there is none to give them. If you simply kill the first piece of code that encounters the end of available memory you might take down anything including the kernel itself.
Nothing like statically allocating memory can work when overcommit is enabled because the kernel is free to compress memory, page it out and etc. and then murder you the next time you try to perform any operation that it doesn't have the space for, no matter how safe and static your initialization was.
Note that overcommit is very useful in many cases including the ones where swap saves the stability of the system under conditions that would otherwise completely lock up or panic, so it's also not viable to just prevent it from being used.
SoftTalker 2 days ago [-]
OOM killer always felt like a band-aid on a severed artery to me. I've rarely seen a machine that got into OOM state really recover without a full reboot.
sph 2 days ago [-]
Why would a system break if you SIGKILL a process?
I’ve seen plenty of server log with OOM killing mariadb processes, and then being restarted automatically by systemd, often with no one noticing if not days later.
The thing that bogs down systems and often makes them unrecoverable is when a memory hungry process starts swapping. Good luck trying to SSH in. Swap is such a silly idea on servers - good to deal with pages no one accesses, catastrophic when you’re out of RAM and memory latencies suddenly become 4 or 5 orders of magnitude slower.
2 days ago [-]
sedatk 2 days ago [-]
I’m not against taking down the kernel if the situation is that catastrophic. Better than killing the lock screen for sure.
LoganDark 2 days ago [-]
IMO if the security of a system depends on the lock screen not crashing then the system is not very secure. Security protocols should never fail open like that; a lock screen should never simply be a layer on top of the authenticated desktop. Windows and macOS get this right. I believe Wayland display managers are also able to get this right (but I haven't checked).
eqvinox 2 days ago [-]
I don't know why X11 didn't just add an extension that a client can enable saying "if this client exits unexpectedly/uncleanly [without disabling the extension], just kill the X11 session".
yjftsjthsd-h 2 days ago [-]
Yes, Wayland should fix this. Granted, then you have a locked screen that the user may or may not be able to unlock, which is awkward if better.
account42 1 days ago [-]
Depending on the implementation and exactly which component crashed, you may still unlock the session from the console in a different VT.
LoganDark 2 days ago [-]
Wayland the protocol already fixes this -- there's nothing that exactly requires a display manager to not have a completely separate desktop for the unauthenticated state, where a trusted application (or the display manager itself) can accept credentials in order to authorize a transition to the authenticated state, and where a crash of the trusted application or lock screen does not result in access to the authenticated state. I just dunno if anyone does that yet. I'm sure somebody must have...
> Granted, then you have a locked screen that the user may or may not be able to unlock, which is awkward if better.
The most secure system is one that cannot be accessed, technically. In some cases it's better not to let anybody in than to let an attacker in (technically). Of course, this is frustrating for the user.
yjftsjthsd-h 1 days ago [-]
> The most secure system is one that cannot be accessed, technically.
No, security includes
Confidentiality, Integrity, and Availability; a lockscreen DoS is a problem
LoganDark 1 days ago [-]
Yes, a DoS is a problem, but it doesn't let an attacker in. Like, if an employee of a company can't get through their lock screen to access a confidential shared server, that is far less bad than an attacker downloading the entire server and leaking it online. But yes, of course, if suddenly no employees could get through their lock screens, that would still be quite bad -- but it only takes one attacker getting in to cause damage.
josefx 2 days ago [-]
Shouldn't desktop environments detect if a lock screen terminated abnormaly anyway? The OOM killer is just one of many possible causes.
Retr0id 2 days ago [-]
Statically allocated memory can still OOM on access, due to overcommit and lazy page table population. What you really want is mlockall(2) (probably with MCL_CURRENT|MCL_ONFAULT followed by madvise with MADV_POPULATE_*)
Retr0id 2 days ago [-]
oops MCL_ONFAULT kinda does the opposite of what I wanted - I think if you omit that you can skip the madvise, and mlockall will populate everything for you.
feelamee 2 days ago [-]
> if you’re a critical process like xlock, use statically allocated memory and don’t alloc again.
This doesn't save you if someone other allocates and OOM killer chooses you as victim
hkolk 2 days ago [-]
What is proposed is to not have an OOM killer with a selection process, meaning that the "someone other allocates" would be the one dying.
tux3 2 days ago [-]
The problem is that Linux has memory overcommit and it will OOM when a process faults a page in, not just when someone allocates memory.
So the OOM condition can hit any random process, not necessarily one that just tried to allocate. If you don't have some sort of selection, then you would still have an OOM killer, only it will be killing completely at random.
muvlon 2 days ago [-]
That's true, but critical processes could mlockall() after setup, so their stuff never needs paging in.
sedatk 2 days ago [-]
Yes, don’t have OOM roulette.
silon42 2 days ago [-]
At least for processes that don't overcommit...
amluto 2 days ago [-]
The fact that xlock crashing unlocks an X11 session is, IMO, pathetic.
gjvc 2 days ago [-]
[flagged]
silon42 2 days ago [-]
At least the session manager should kill everything if xlock dies.
lokar 2 days ago [-]
I know this is not a popular / mainstream position, but I managed a very large fleet of systems this way:
- no system swap
- enough memory for core system services set aside in a cgroup for them to use
- by default, all prod service binaries load all code pages into ram at start, and lock them in (no paging out code pages at runtime)
- if needed (rare) services can mount some swap in their own cgroup, but very much discouraged
You need to know how much ram you are going to use, and actually stick to that. Very little is wasted in practice, and you don't have to deal with OOMs all the time. Everything is much more predictable.
tosti 2 days ago [-]
Have you disabled swap in the kconfig entirely?
If not, is your vm.swapiness 0? How do you deal with overcommit? Did you replace malloc with a more strict implementation?
lloeki 2 days ago [-]
> How do you deal with overcommit
echo 2 > /proc/sys/vm/overcommit_memory
lokar 1 days ago [-]
No swap device
xyzzy_plugh 2 days ago [-]
I agree with your perspective. I certainly agree that swap can be invaluable at times, and is generally a mistake for your run-of-the-mill production services.
It's a nice approach particularly because all OOMs become actionable: there's a bug in a service or a limit is wrong or traffic is changing in an unexpected way.
Systems built this way end up being extremely reliable in my experience.
It's an uphill battle both ways though and not everyone is up for that experience.
1 days ago [-]
mad_vill 2 days ago [-]
Happy to see this trending, I probably share this in my company's slack once a month.
cwillu 2 days ago [-]
(2004)
jml7c5 2 days ago [-]
Thanks. I was confused for a bit, given these days you can do
There's also /proc/sys/vm/panic_on_oom and /proc/sys/vm/oom_kill_allocating_task for other behaviours suggested in the comments.
bastawhiz 2 days ago [-]
Especially in an era where RAM is so expensive, the obvious answer is to simply never use memory. If your data can't fit in the plethora of CPU registers at your disposal, your software is probably too complicated. /s
It's a funny reply. But what was not funny was the OOM killer killing my screen locker.
Joke all you want, but 22 years later I still stand by that I'd rather get a kernel panic than kill the screen lock.
These days you can do oom score adjusting, which is not as strong as a pardon. I may be taking too much credit, and may misremember the timeline, but I feel like someone took my crappy kernel patch and went "fine, I'll do it the right way", merged that oom score adjusting maybe a year or so later.
Here's an LWN article about it, too: https://lwn.net/Articles/104179/
Writing -1000 to /proc/<pid>/oom_score_adj will cause the OOM killer not to consider the process at all :)
From the man page proc_pid_oom_score_adj(5)
> The value of oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). [...]. The lowest possible value, -1000, is equivalent to disabling OOM-killing entirely for that task, since it will always report a badness score of 0.
For example, KDE: https://preview.redd.it/plasma-lock-screen-messed-up-v0-zx7h...
GNOME: https://forums.freebsd.org/attachments/index-jpeg.8571/
I think this only works because there is top-down integration between the different parts. The compositor knows when it's supposed to be locked. Whereas the old screen lockers were just very aggressive Xorg apps that suffer from "What if two programs did this?" problems (https://devblogs.microsoft.com/oldnewthing/20110310-00/?p=11...)
An argument can be made that the kernel should not cover for architectural missteps of the X server and that X server should be the one to crash when it's security-critical component was killed for whatever reason.
Also there are other safety and security critical reasons why you'd want to exempt some processes.
Arguably (and it definitely has been argued) the real architectural misstep is the Linux kernel overcommitting by default in the first place.
Agreed though, overcommit is the culprit here. I get why it happened (unfortunate consequences of fork and friends existing as the way to spawn tasks and wanting those to be both performant and not fail in frustrating conditions), but I don't think it was a design that aged particularly well.
I actually like somewhat the notion of how Windows handles these two things
1. For address space reservations, you can reserve address space but in order to touch it you have to commit it. Commits have to be backed by something (RAM, a file, pagefiles if they exist) and if a commit fails, they'll get NULL back from malloc. It allows code to be more correct in the face of low-memory conditions or to try again later (Firefox for example, does this[1] on Windows).
2. Process creation is done with a specific API to create processes. The only problem with this I think is that you have to specify everything at creation time, but you could augment this by creating processes in a stopped state (iirc Linux has to do this anyway to set up some stuff before it can hand over control back to userland) and having the parent send FDs to the child or whatnot. Windows... doesn't do this, it has a couple of kitchen sink APIs for creating processes and setting up stuff like the standard streams... in any case I'm getting off topic.
Don't think there's much about that design that can be changed now though
[1]: https://hacks.mozilla.org/2022/11/improving-firefox-stabilit...
There surely is something absurd about having to register specific processes as exempt from the OOM killer. But given that the OOM killer exists, and could kill xlock...how should that be fixed?
The right way for this to work is for the X server to have an extension that lets a screen locker say "hey, I'm locking the screen now", and the X server should respond to that by pretending that the screen locker client is the only client that exists: no other client gets input or gets to draw. And if the screen locker crashes (or is killed), the X server should just put itself into a permanently-locked state where it will never again send any input to anything, and won't ever draw anything except a blank screen. That's not a desirable situation, of course, but it's better than unlocking the screen.
NT: Yes? Why not?
(note that this refers to the Windows NT kernel's operation because it had historically a POSIX emulation layer (NT Personalities), not the modern WSL which is just Linux in a Hyper-V)
Last year I was writing a reply on a forum in Firefox on Linux when the OOM killer decided to nuke Firefox. Poof gone, mid keystroke. How does anyone think that's acceptable?
This was on a stock Linux distro, nothing special.
The bar is pretty low, but the windows scheduler is aware what the currently focussed app is so it can prioritise not killing it.
On Linux? Not so much.
There are Native APIs for implementing fork (needed for the obsolete POSIX subsystem, primarily), but even on the Native API side, processes are usually spawned through NtCreateProcess or RtlCreateUserProcess, though there is a bunch of setup with regards to the Csr APIs for the Win32 CreateProcess[1]).
[1]: https://stackoverflow.com/a/69605729/2805120
No, you just account for it (commit the charge) in the bookkeeping. If a 1GB process forks, you decrement the amount of free memory by 1GB to ensure other processes don't overcommit such that you won't have 1GB of free memory if and when you actually needed to allocate that memory. If the forked process immediately exits, you just bump the free memory counter back up. This is what Solaris and Windows do.
But precise accounting of memory is difficult if you didn't design for it in the first place. For example, you have to figure in the memory needed for page structures. (Though I think Linux can do that in particular, bugs notwithstanding.) Last time I checked (5+ years ago) Linux was incapable of such precise accounting across the board, so even if you disabled overcommit the kernel could still find itself in an OOM situation when the time comes to allocate memory it already promised or perform an operation it implicitly or explicitly guaranteed it could complete.
The expectation that Linux overcommits meant many Linux kernel developers didn't design subsystems in a way that the kernel as a whole could provide reliable, guaranteed, precise memory accounting. For example, some filesystems rely on being able to use the OOM killer to free up memory needed for an operation that it can't back out of once it starts because it wasn't written in a way that it could either predetermine or bound it's memory requirements, or cleanly back out of an operation it started.
To be fair I'm not sure any of the BSDs can do it either, at least when it comes to fork and CoW. IIRC, nor can macOS, though it will dynamically add swap so you won't get an OOM kill until you run out of disk space.
Precise memory accounting and CoW fork aren't intrinsically antagonistic, and the general ability to clone CoW mappings or similar kernel structures is useful beyond fork, which is why NT had all the necessary facilities in the kernel (it's the userspace CRT state that can be tricky, especially in the presence of threads, which is true on Unix systems as well).
The example of forking a process with a giant VM space just to exec some other program is, IMO, a straw man. Processes with such huge RW mappings typically don't fork and exec like that. Nobody architecting an app like PostgreSQL was relying on the ability to easily fork processes for minor tasks or exec utilities from processes already forked for resource intensive tasks. And when such a thing is desirable, it's easy enough to use the alternatives, like vfork, or architect a controller for spawning subprocesses, or just use threads. Heck, fork existed long before CoW. Expectations around fork, that you can and should be able to call it without any forethought about resource management was a consequence of Linux' popularity.
Linux embraced overcommit because people wanted to run existing big iron applications like networked databases on tiny PCs with fractions of the memory those applications were written to expect to be able to use. Overcommit was a hack that let your play around with those applications without them immediately falling over, partly because back then such applications often preallocated memory for cache, etc, but would never use all of it when running in an environment like early Linux, which would never see the same high loads and utilization as big iron servers.
Linux could have pivoted in the other direction and pursued strict memory accounting with the ability to expressly overcommit in, e.g., some process subtrees or dynamically allocate swap (which in the expected scenario it normally wouldn't have to actually do). But like most userspace developers they found it easier to write kernel code when they could pretend memory was infinite, and when the system hit the wall just blow up and blame the user. That choice can be defensible for userspace, but it's simply not defensible for a kernel.
I do think overcommit was a poor design choice, but I think it probably mostly does logically follow from the fact that fork and friends are the only ways available to create a process that's available to userspace. It's quite unfortunate though.
Part of the problem is that some applications wanted to reserve lots of address space but didn't necessarily want to touch it right away (such as when they were using it sparsely). Something that VirtualAlloc(x, MEM_RESERVE) (or mmap(..., MAP_NORESERVE)) would be suited for. But while malloc exists, mreserve doesn't in libc, and I think it was pretty uncommon to use it.
I don't think Linux was plausibly going to remove the OOM killer in 2004 or later. So the right solution for Linux is very much to tweak it to be less painful.
So, in actuality, I think your assertion just taught us all something, because despite knowing that the OOM killer and that the Magic SysRq key[1] exists, I didn't know you could configure this as an input!
[1]: https://en.wikipedia.org/wiki/Magic_SysRq_key
I should really send a patch rather than complaining ...
cgroups v1 has a pretty nice API but it requires root. V2 does not require root but it’s a lot coarser and not as simple or reliable: https://unix.stackexchange.com/questions/753929/receive-a-me...
https://github.com/facebookincubator/oomd/
https://github.com/rfjakob/earlyoom
"The protect command is used to mark processes as protected. The kernel does not kill protected processes when swap space is exhausted. [...] If you protect a runaway process that allocates all memory the system will deadlock."
[1] https://man.freebsd.org/cgi/man.cgi?query=protect&apropos=0&...
A passenger buying a ticket is malloc(), but passengers don't always utilize the seat (use the memory). Normally this works out fine, but occasionally, there are too many passengers. Thankfully though instead of executing a couple passengers they give you a voucher.
In this worldview, malloc is like me buying a plane ticket at the counter for a specific flight that's going to leave soon. I'd be really annoyed if I were bumped off a flight I just paid for (and would've rather been told "that flight is full, try again later" (malloc returns NULL)). This is, for example what Windows does. Under memory pressure, it'll say to applications, "hey no I'm not in a giving mood for memory right now" (and will sometimes bump the size of the pagefile if configured to do this, but only up to a point).
The thought behind this is that well... applications have to handle malloc returning NULL anyway. Whether that's calling abort and giving up is one matter, another might be to retry the allocation at a later time (maybe after Windows has bumped the pagefile size), another might be to handle an error using some preallocated buffer or whatever.
Nothing like statically allocating memory can work when overcommit is enabled because the kernel is free to compress memory, page it out and etc. and then murder you the next time you try to perform any operation that it doesn't have the space for, no matter how safe and static your initialization was.
Note that overcommit is very useful in many cases including the ones where swap saves the stability of the system under conditions that would otherwise completely lock up or panic, so it's also not viable to just prevent it from being used.
I’ve seen plenty of server log with OOM killing mariadb processes, and then being restarted automatically by systemd, often with no one noticing if not days later.
The thing that bogs down systems and often makes them unrecoverable is when a memory hungry process starts swapping. Good luck trying to SSH in. Swap is such a silly idea on servers - good to deal with pages no one accesses, catastrophic when you’re out of RAM and memory latencies suddenly become 4 or 5 orders of magnitude slower.
> Granted, then you have a locked screen that the user may or may not be able to unlock, which is awkward if better.
The most secure system is one that cannot be accessed, technically. In some cases it's better not to let anybody in than to let an attacker in (technically). Of course, this is frustrating for the user.
No, security includes Confidentiality, Integrity, and Availability; a lockscreen DoS is a problem
This doesn't save you if someone other allocates and OOM killer chooses you as victim
So the OOM condition can hit any random process, not necessarily one that just tried to allocate. If you don't have some sort of selection, then you would still have an OOM killer, only it will be killing completely at random.
- no system swap
- enough memory for core system services set aside in a cgroup for them to use
- by default, all prod service binaries load all code pages into ram at start, and lock them in (no paging out code pages at runtime)
- if needed (rare) services can mount some swap in their own cgroup, but very much discouraged
You need to know how much ram you are going to use, and actually stick to that. Very little is wasted in practice, and you don't have to deal with OOMs all the time. Everything is much more predictable.
If not, is your vm.swapiness 0? How do you deal with overcommit? Did you replace malloc with a more strict implementation?
It's a nice approach particularly because all OOMs become actionable: there's a bug in a service or a limit is wrong or traffic is changing in an unexpected way.
Systems built this way end up being extremely reliable in my experience.
It's an uphill battle both ways though and not everyone is up for that experience.
https://github.com/torvalds/linux/blob/master/include/uapi/l...