Do you use let's encrypt?

After installing a 2.6.38 kernel, I've seen the machine unexpectedly pausing when I'm not in front of it (typical scenario: while aptitude installs or upgrades things). It then recovers immediately as soon as I move the mouse.

Tracking /proc/sys/kernel/random/entropy_avail reveals it is constantly below 1000, quickly going down to ~100 in just a few seconds of inactivity.

If you've got a known good and known bad kernel version, you could try to use "git bisect" to track down the change in the kernel source that introduced the behavior. (requires kernel compiling and git skills)

other ideas:

some systems support hardware random number generation, could it be
that support for that was compiled into the old kernel but not in the new one?

could be worth to compare the kernel .config files (and dmesg output) from both versions

In fact, I switched from a self-compiled 2.6.36 to Debian-supplied 2.6.38, so a lot of config options changed. HW_RANDOM is explicitly one of the options that did change, although in the opposite direction from what you would have imagined: it was not set in my self-compiled kernel, it is now configured as "M". However, I have no hardware RNG, and the module is not loaded, so I guess this has no influence at all.

I tried deleting /dev/random and replacing it with a link to /dev/urandom that's supposed to provide the same kind of data without ever blocking, but nothing changed. I infer that the entropy-hungry process has other means for getting random numbers than accessing the device file, probably directly through get_random_bytes().

I still haven't found who's using up all this entropy. Of course ssh-ing drains the pool much quicker, which is expected as ssh is supposed to use random data for its crypto stuff. Next thing to try out is to boot in single user mode, and monitor the entropy pool while starting the various services one by one.

In any case, the actual issue is not who uses the random numbers, nor how to refill the pool, but rather why the whole system comes to a stop when the entropy is too low. I would expect a single process freezing while trying to read /dev/random, not the whole system becoming irresponsive. I even tried with a background script that refills the pool every 10 seconds, taking data from pre-built chunks of 2046 random bits:

The single user mode approach sounds good, maybe it leads you to the suspect
user space process.

If you suspect ssh - how much ssh traffic is there?

(maybe add some iptables LOG rules and check)

If there are lots of suspicious ssh connections, you might consider installing
something like denyhosts, fail2ban, ... (no easy decision on a public server, though)

Does disconnecting the network change the situation?
(maybe some kind of DOS attack?)

If you can rule out a user space issue (that means the problem persists in single user mode, with network disconnected) and you really suspect some buggy kernel driver abusing get_random_bytes, I'd suggest to give the printk approach, as mentioned above a try.

That means grep for all occurrences of it in the kernel that are relevant for your setup and add some printk's there.

or play the same game as in usespace, find modules / compiled-in parts
that make use of get_random_bytes and prevent the module from getting loaded,
(resp. the driver from being compiled in) until the problem disappears.

Be careful and apply common sense when excluding drivers or you system may cease to boot at all ;)

How about a self compiled 2.6.38 with "make oldconfig" based on the old kernel?

Could it be a rootkit?
(no idea how to check or rule that out - it could always be something that isn't
yet detected by rkhunter and friends)

First of all, my approach of writing to /dev/random was pretty silly: the data sent to the device file do get added to the random pool, but no entropy bits are charged for that. The only way for userland to increase the entropy count is accessing the device file through ioctl().

Then, the introduction of ASLR in the last years has boasted the kernel's need for random bits: this means that simply spawning processes is a way to drain the entropy pool.

Finally, a deeper scrutiny of the kernel sources revealed that, apart from the standard input devices (mouse & keyboard), no other driver used on my system is designed to contibute entropy. For example, the network card, "via-rhine", does not set the flag IRQF_SAMPLE_RANDOM when registering the interrupt handler (contrary to other network cards that actually do). The same holds for the ATA subsystem. That came as a real surprise to me: isn't the disk activity supposed to be a source of entropy?