One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Thanks for the update, Matt. A Merry Christmas and a Happy New Year to you, all the other personnel at the Lab, and to all my fellow crunchers. It's been a good year for science and I was lucky enough to play a small part in it. Now we've found a potentially-habitable planet just 12 light-years away -- can anyone invent instantaneous teleportation over that distance? (No, without proving Einstein wrong... :-( )
____________

Thanks for the news Matt, to you and everyone in the lab, have a very Merry Christmas and a Happy New Year!
____________
In an alternate universe, it was a ZX81 that asked for clothes, boots and motorcycle.

If moving data from one machine to another via the network is causing a global issue like that, you are right to suspect equipment or wiring, however, it could just be some limitation in the drivers for the NICs themselves.

Do you have jumbo frame support? Maybe some Rx/Tx buffer sizes need to be adjusted, or checksum offloading needs to be enabled.

Jumbo frames on gigabit are definitely nice. I have two machines on my network that on gigabit with the default of 1500 for the MTU can only manage about 270mbit and the slower machine's CPU is maxed out. I switched over to 9K for the MTU and I get 890mbit and about 75% CPU load. This is moving data across NFS, between Windows and Linux, I might add.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Good news, Matt ! :)
Good luck to fix the rest.
THX for the update.
Merry Christmas and Happy New Year to you and all your loved ones !
____________

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

____________DavidSitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

Tad more than that. As was explained it used to all shut down when the UPS(s) said they were on battery. The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there.

As to a script, I think that is something that needs investigation. As many of the machines pull double duty perhaps they can find a charge number that isn't on the Seti@home budget to write the script. If the script waited to begin the shutdown until say one minute of mains failure, then you can be rather sure something is really up. Hopefully that isn't so long that a UPS would run dry before an orderly shutdown is complete. But you test!

The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there.

Ah, that is a different ball game. I would have been totally amazed if the kit wasn't on UPS, it just wouldn't have been logical. But I thought UPS's could detect brownouts and knowing that they were transitory, not shutdown. Anyway isn't that the function of the UPS controlling program, e.g. Powerchute, that can be configured on how to react to various scenarios?

If I told UCB what I really thought of their power supplies, they would not like it one little bit, it's almost a public scandal, and it is high time something was done about it. Although the politics will probably preclude making too much fuss. We will plod on despite UCB, not because of them.

A couple of requests for your website. Both to improve our understanding of what your systems have to deal with on a continuing basis.

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

2. Again, in relation to the 'Server Status' page, you have some very precise definitions in your 'Glossary' section. Could someone put together a data/systems flowchart so we can better understand how the data flows through your systems?

Just some thoughts to assist us not as technically aware of the processes involved...

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

Something like this, perhaps. Green is data out from the Lab, blue is incoming. We commonly call this the "cricket graph" for reasons that may be obvious...
____________

So, why not include the summary data, not the graph, on the Status page?

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results.

Because the router is facing the other way, Green is downloads to us, Blue is uploads to the Servers,

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

In and out are from the router's point-of-view, green is into the router from the Lab and thus out to The World while blue is in from outside and out to inside.
____________