@campisano : you don't happen to be using NewRelic, are you? The timing of these freezes (3 in the last month) was after we had started using newrelic on this machine. not sure if that's just coincidence.

Your log have not the message "Pushing vcls failed:#012CLI": the message "Child (XXXX) not responding to CLI, killing it." could be common, I read in some place that can be a timeout issue with your backend service (ex apache): if apache response time is greather then the maximum timeout of the varnish worker, the varnish controller can kill it and stop the request.

You can increase the default timeout config on your varnish config file, like this example:

> The main process was still active, but it didn't accept connection on > incoming port (simples go in timeout, wasn't refused). > > I only have the 'strange' varnishstat -1 log in attachment, how can I > increment the 'verbosity' of varnish ? > > Thanks in advantage, > Riccardo

I had increase the connect_timeout to 5s few days ago. What's the defalut value for it?

BTW, I don't consider the message 'Child (17381) not responding to CLI, killing it.' a problem in my case, 'Pushing vcls failed:#012CLI communication error (hdr)' appear more interesting but the problem is that the varnish was freezed at all.

- https://www.varnish- cache.org/docs/3.0/reference/varnishd.html?highlight=cli_timeout If your varnish is heavily loaded, it might not answer the management thread in a timely fashion, which in turn will kill it off. To avoid that, set cli_timeout to 20 seconds or more.

Replying to [comment:7 phk]: > The connect_timeout has nothing to do with starting the child process. > > As I said, you can try to increase cli_timeout, if the problem is disk-i/o pileups.

While I can understand the root cause for timeouts in master/client communications and fixed it by moving the shm log to tmpfs, I still think there's a bug or at least unexpected behaviour here.

When the master fails to push the initial vcl to the child, it kills the child but does not try to restart it. The master process is left hanging useless without a child and requires a stop/start cycle.

I can reproduce the problem with the shm log on HDD and large I/O, e.g.:

> is the root the master process that handle the 'varnish childs workers' ?

Not quite. As I understand it (haven't actually checked with the code yet), the child blocks on a write to the shared memory file when cached disk writes exceed a certain amount and the disk is busy. With large RAM, the disk sync can take several seconds. If the child blocks long enough, the VCL upload from the master process times out and the child is terminated but not restarted.

> Reducing the shm log size on disk can resolve the problem ??

I don't think so. Putting the shm log file on tmpfs resolved it for me. No physical disk, thus no waiting on the I/O scheduler. The Varnish Book explicitly recommends that the log must not cause physical disk I/O.

Replying to [comment:12 lampe]: > the child blocks on a write to the shared memory file when cached disk writes exceed a certain amount and the disk is busy. With large RAM, the disk sync can take several seconds. If the child blocks long enough, the VCL upload from the master process times out and the child is terminated but not restarted.

> Putting the shm log file on tmpfs resolved it for me. No physical disk, thus no waiting on the I/O scheduler. The Varnish Book explicitly recommends that the log must not cause physical disk I/O.

I understand, so the best way appears to use RAM directly or with tempfs. But I have not sufficent RAM to give to the varnish log, so what happen when varnish want store something and have no more space to write? I suppose that he discard something old, do the call to the apache and all can work fine, I'm right?

However, why using tempfs where can use RAM directly? Varnish could have a little overhead with tempfs that is not necessary I suppose, or I'm wrong?

Replying to [comment:7 phk]: > The connect_timeout has nothing to do with starting the child process. > > As I said, you can try to increase cli_timeout, if the problem is disk-i/o pileups.

@phk - I believe in light of [comment:10 #10] that this should be reopened. I experienced the same yesterday, after many months of smooth execution Varnish just killed its child and failed to restart one then gave up on it entirely. I don't think it's right that *one* failure in months results in a useless server not accepting connections anymore. If it fails repeatedly I could maybe see some value in halting the process but not for one odd failure.

Apologies if reopening this is out of line, but if it is the intended behavior I would like to hear a short explanation why.

See this from my syslog in case it helps, but it's nothing new (and yes I now increased my cli_timeout for safety, but I still think this issue should be addressed):

Replying to [comment:15 martin]: > We have added this as a future feature item, to have some parameter to control restart handling.

Being able to force it to retry N times is an improvement, but I still think that it should just exit at the point where it has no child process alive and can not start them anymore. Otherwise Varnish keeps running and as far as the OS and any process-based monitoring is concerned all is well while the system is actually unusable. This is quite a serious issue IMO.

Replying to [comment:16 seldaek]: > Being able to force it to retry N times is an improvement, but I still think that it should just exit at the point where it has no child process alive and can not start them anymore. Otherwise Varnish keeps running and as far as the OS and any process-based monitoring is concerned all is well while the system is actually unusable. This is quite a serious issue IMO.