The Arscoin rollout, through the eyes of the server administrators

What went right, what went wrong, and how we eventually beat it into submission.

Despair

"Does anyone know where the love of God goes," sang Gordon Lightfoot, "when the waves turn the minutes to hours?" We asked ourselves similar questions as the seconds stretched to minutes and poor coins.arstechnica.com didn't return from its reboot. It took nearly fifteen minutes for the EC2 console to report that the server had come back online.

However, after the reboot, things appeared to be mostly working—the RDS server took up the database load, and the EC2 server didn't appear to be slamming into CPU limitations, either. The coins.arstechnica.com Web interface appeared to be mostly functional, and the storefront (hosted on a server in the Ars Chicago datacenter rather than on EC2) also appeared to work.

We heaved a collective sigh of relief, and then checked on the status of the scripts that kept money moving around in our small glass bubble economy... and that's when the bottom fell out.

No payouts from mining were happening.

Pool mining works by having the central server parcel work out to mining clients. In our case, this is done using the stratum protocol, and the mining clients are you guys and your computers. The pool's central wallet gets rewarded with all the blocks mined by all the miners; the pool software then has to take those rewards and divide them up and then send them out to all the miners who participated in mining them. The MPOS pool application we're using does this with a series of PHP scripts, which are run in order and with error trapping by another set of shell scripts, which are executed through cron jobs (cron, for the Windows users following along, is a daemon that fires scripts on user-defined schedules—sort of like Windows' Scheduled Tasks). We followed the MPOS guidelines on scheduling our jobs; once every minute or two, cron kicked off the MPOS run-statistics and run-maintenance scripts, and every 30 minutes it fired the run-payout scripts.

The MPOS developers have built these scripts with some amount of logic in them. Run-payouts, especially, is of particular import and needs some care in its execution—this is the script that checks for completed work, figures out which miner did the work, and then triggers all the payouts for completed work. There are some rare occasions where the script can fail with errors; when this happens, it logs that it has failed and disables itself to prevent it from re-running and re-failing.

With the Web interface starting to be functional again, we could finally check the job status. With dawning horror, we realized that run-payouts had died with errors hours ago, and we'd been too busy working on the Web interface and making sure stratum miners could connect to notice.

No wonder everyone was complaining about mining and not being paid. No payouts from mining were happening.

Enlarge/ These are the scripts that make Arscoin pool mining work. These are also the scripts that weren't working.

Aylward and I pondered what to do next. Forcing the payout job to run caused it to quickly terminate with the same error that had killed it in the first place: "Upstream share already assigned to previous block." Apparently, the initial burst of enthusiasm in mining arscoins had overwhelmed the pool server's ability to assign IDs to mined blocks, and at least two blocks had been assigned the same ID. (I'm not a smart enough developer to guess at why this process isn't appropriately atomic.) The simple fix listed is to delete the second of the two blocks, causing that block's 50 Arscoin reward to vanish into the ether.

We did this and re-ran the payout script, but within seconds, it died with the same error. We deleted dozens of non-uniquely-ID'd blocks and finally got the run-payouts script to push through a few hundred of the 1500+ uncredited blocks. We congratulated ourselves as the run-payouts script then began executing the tens of thousands of pending transactions. We kept on congratulating ourselves right up to the point where coins.arstechnica.com's Web interface simply...died.

Hope

We were as baffled at this point as we were frustrated. Htop, iftop, and iotop didn't show there to be any really excessive load on any one component, though I wondered if EC2's storage was causing issues that weren't immediately obvious. As we talked through what else we could look at, the Web interface was completely and totally absent—and, best of all, none of the various error logs showed anything even remotely useful.

Aylward had an idea, though. He had noticed earlier that the repeating AJAX requests made by MPOS to display statistics in the MPOS dashboard (those nice live charts and graphs) purposefully disabled caching in order to display the latest data. It was a feature that was working as intended, but the problem got worse the more people sat viewing the MPOS dashboard. Coupled with the extra load put on the server by the backlogged payout script running, it was enough to effectively disable the Web interface.

Aylward altered the MPOS files so that the AJAX queries would lean on memcache instead of bypassing it... and as if someone had flipped a magic switch, the Web interface blinked back into existence.

Enlarge/ This pretty interface and its crazy AJAX proved to be somewhat troublesome under load—at least, until we figured out how to make it cache properly.

Recovery

We poked gently at the interface and kept our breath held, but it stayed up. Readers noticed immediately as a surge of "Well, now I can get into the interface, at least!" posts began to show up in the article threads.

The road to resolution appeared set. It was at this point around 15:00 CST, and I settled in to begin the long process of nursemaiding the run-payouts script through its multiple hours of backlog. The process was boring, but necessary—run the script until the first (findblocks) portion of it failed and make a note of the block where it died. The first part's failure would then kick off the proportional payout script, which would run until it encountered the same misidentified block and then also fail. I'd then delete the misidentified block from the blocks table in MPOS' MySQL database, and re-run the script.

We plowed through the backlog slowly. It took another six hours for all of the discovered blocks waiting attribution to be identified, and all the tens of thousands of waiting transactions to be executed, and all of the workers to be paid.

Once we'd finally cleared the backlog, we cautiously re-enabled the cron schedule and watched as the system took back over doing its own automated maintenance and payouts. The scripts ran to completion with no errors.

Victory.

We finally slumped back in our chairs, watching Arscoin work. I gifted away all of my ~150k stash of Arscoins in huge increments, throwing thousands of coins at anyone who posted their address. Somewhat hilariously, because Aylward controls the address to which coins are sent when you guys buy your hats, he sent me another 200k arscoins after my own stash dwindled, and as of yesterday afternoon I'd given all of those out, too.

Prosperity

One final performance hiccup remained, though. On Thursday morning, we noticed that the store interface was becoming unresponsive. Aylward quickly migrated the wallet server's blockchain off of local disk and onto our datacenter's shared pile of NetApp disk, and the improvement was immediate. To celebrate, I gave out more coins, and Aylward added more hats to the store.

At its peak performance, the Arscoin pool was churning out about 200 million hashes per second, with about 1500 worker processes running. As of this writing, the rate has settled to between 90-100 MH/sec, with about 770 workers. This isn't nearly as large as big pools, but it's not a trivial amount of computing power, either.

In the end, we ran into load-related issues that we hadn't anticipated, and not all of those issues were hardware related. The caching issue in particular was a huge problem, but fortunately it was one that could be easily worked around in any number of ways once identified. The final configuration of Arscoin as it stands right now is one EC2 medium instance, one large EC2 RDS instance, and one Ars datacenter server with a high-speed shared disk (whether the NetApp disks should properly be called NAS or SAN in this instance is one of those angels-on-the-head-of-a-pin debates).

And, above all else, what really matters here is that everyone can happily buy hats. Because hats, as we all know, are awesome.

Lee Hutchinson / Lee is the Senior Reviews Editor at Ars and is responsible for the product news and reviews section. He also knows stuff about enterprise storage, security, and manned space flight. Lee is based in Houston, TX.