The kernel of my home office

Monday, November 14, 2016

Four and half years ago, I wrote a blog post about Varnish, a 'front-end proxy' for webservers. My best description of it then was as a protective bubble, analogous to how it's namesake is used to protect furniture. I've been using it happily ever since.

But last week, I got to really put Varnish through a test when the picture here, posted by Fair Vote Canada (one of my clients), went viral on Facebook. And Varnish saved the server and the client in ways I didn't even expect.

1. Throughput

Varnish prides itself on efficiently delivering http requests. As the picture went viral, the number of requests was up to about 1000 per minute, which Varnish had no trouble delivering - the load was still below 1, and I saw only a small increase in memory and disk usage. Of course, delivering a single file is exactly what Varnish does best.

2. Emergency!

Unfortunately, Varnish was not able to solve a more fundamental limitation, which was the 100Mb/s network connection. Because the poster was big (760Kb), the network usage, which is usually somewhere in the 2-5 Mb/s range, went up to 100Mb/s, and even a bit beyond. That meant the site (and others sharing that network connection) started suffering slow connections, and I got a few inquiries about whether the server had 'crashed'.

At that stage, I had no idea what was actually going on, just that requests for this one file was about to cause the site as a whole to stop responding. I could see that the referrer was almost exclusively facebook, I also noticed that the poster on it's own wasn't really helping their cause, and the client also had no idea that it was happening - they had uploaded the poster to facebook, so it shouldn't be requesting it from their site.

Fortunately, because the limitation was in the outgoing network, there was a simple solution - stop sending the poster out. With a few lines in my varnish VCL, the server was now responding with a simple 'permission denied', and within a few seconds, everything settled down.

In fact, the requests kept coming in, at ever higher numbers, for the rest of the day, but Varnish was able to deflect them without any serious blip in the performance of the server.

3. And Better

The next day, after some more diagnostics, we discovered that the viral effect had actually come from someone else's facebook post who shared the poster as it had gone out in an email. Although the poster on it's own wasn't going to help the cause of PR directly, we didn't really want to stem whatever people were getting out of it, so I uploaded the poster to an Amazon S3 bucket, (an industrial file service) and modified my varnish vcl to now give a redirect to the amazon copy instead of a permission denied.

Now the poster could go safely viral.

4. And Best

After some more discussion, Fair Vote noted it would be better if people ended up on the facebook campaign url here rather than just the poster. So I updated the varnish vcl so that if the poster request comes from a facebook referrer, then it redirects them instead to the above url.

Four days later now, it seems like it's worked - the poster is still pretty viral, and even the requests for the original url is still going strong (3.4 million requests in the 48 hours ending at 3 am this morning).

Without Varnish, my server would have crashed and been unable to get back up, even now. Instead, the poster is still being shared, the rest of the site is still working, and the facebook share is even more effective than it would have been.

Thursday, November 06, 2014

My normal configuration of a public site on my servers involves using varnish for the page cache and setting expire page to 1 day. This mostly works quite well (the varnish module in Drupal takes care of clearing the varnish cache when you're creating/editing content).

We recently launched a new Drupal version of the Calgary French & International School (okay, I was just along for the tail end to help with the launch, Karin and Rob get the credit for all the work), which includes an ical feed for parents (generated from views of course).

That's an excellent thing - parents can subscribe to the feed and have all the upcoming events on their mobile device (or google calendar, or both). But we discovered that although it works great on the Mac desktop, it wasn't working well for iOS (i.e. the iPhone). It would poll frequently enough, but only actually update once a day.

It turned out that these two devices are interpreting the http header 'cache-control' differently - on the iphone, it appeared to interpret it to say don't both looking for fresh data more than once a day. The header is not very well defined unfortunately, but it is used by Drupal/Varnish to control the maximum expiry date, so we didn't want to crank it too low (or risk a badly performing site, since most access is anonymous).

The solution was actually simple: a little help in the varnish vcl file, in my vcl_deliver function, below. The piece I added was the second if, and it's just modifying the cache-control header on output if it's delivering a file with extension 'ics'.

Tuesday, April 15, 2014

The 4.3 version of CiviCRM that first came out in April 2013 addresses a key problem with CiviCRM for large organizations: namely, accounting integration.

So what exactly does that mean, and how does it work? Since I'm working on a big migration to CiviCRM, and the client has "accounting integration" needs, I've been diving in and trying to understand the nitty gritty. Since I started, 4.4 is now out, and 4.5 is almost out, and I understand they've made some improvements, so this is now a bit dated, but still might be helpful.

First off, "accounting integration" doesn't mean that CiviCRM will now replace your accounting package, but the goal is to make it play nicer. The key issue for my client is that reports coming out of their current system are used as input for their accounting system, so it needs to speak the same language - i.e. things like GL account codes, and double entry accounting for things like contributions where the donor is promising to send a cheque. I like to describe it as: making CiviCRM a trusted entity within the accounting system, instead of it's current status where reports generally involve some wishful thinking, caveats, and a checklist of adjustments.

Initially I've just been going along assuming everything will take care of itself under the hood as I use Eileen's brilliant civimigrate module in combination with the Drupal migration module. But after a few failed attempts at getting what I want, I've been forced to try and understand what's really happening, and so that's what I'm sharing here.

The aha moment came after looking at the new civicrm_financial_* tables and reading that page.

My key insight was:

In the past, a row in the contribution table was playing too many roles: it was the transaction record as well as the accounting record. In this new, more sophisticated world of accounting, you still get one row in the contribution table per "contribution", but that really serves as a simplification of what are two collections of things: the collection of financial transactions that go into paying for it, and the collection of items that describe how that contribution is accounted for - where it went. In the simplest case, you get one of each (e.g. a check for a simple donation). But at it's most complicated, you might have a series of partial payments for an event ticket, some of which is receiptable as a tax donation and some of which goes towards the cost of the event. Yes, that's why they call it 'double entry'.

Implementing a system like this over top of an existing system is a bit tricky, and the developers seem to have been a little bit worried about the complexity implication for users that didn't want to know about this. So you have to dig a bit to see what's going on, and I'm probably missing some details. Patches welcome ...

One way of describing what we're doing is 'splitting' the contribution. The contribution becomes an container for a collection of inputs (financial transactions) and collection of outputs (attribution of the contribution to accounting codes). The original contribution table still contains a single entry per contribution, but the possibly multiple transactions that pay for it, and the possibly multiple attributions of that income, need to live in related tables.

One trick the developers used was to create something they call a 'financial type'. Using generic labels for specific purposes is a bad idea, and they really should have called it something more specific. The point of this entity is to allow administrators to delegate the accounting to a set of rules for each 'financial type' - meaning, the way a contribution gets allocated to the account codes is determined by the nature of the transaction (i.e. income, expense, cost of sales, etc.), which is then looked up for the 'financial type', and that determines the accounting code. Fortunately, this is just a mechanism for calculating the actual accounting - the data gets stored fairly directly.

Now let's check out the new civicrm_financial_* tables.

civicrm_financial_item - this looks like the accounting for each entry in the contributions table. It includes entity_table and entity_id fields that link it to more information, e.g. an entry in civicrm_line_item. It doesn't provide you the accounting directly, but it gives you the financial_account_id, and you can look up the accounting code directly from the civicrm_financial_account table.

civicrm_financial_trxn - these are the financial transactions that make up the contributions. You'll see it has things like the from and to account fields (to allow for both external and internal transactions, like when a check is received and the transaction is transferred from accounts payable to your bank account), as well as transaction id's, currency, amounts, dates, payment instruments, etc., i.e. everything you need to follow the money.

civicrm_entity_financial_trxn - this is the table that joins the above two tables to the contributions table. A simple typical contribution will have two entries, one pointing to the financial item, and the other to the financial transaction.

Okay, now let's dig a little deeper:

In the financial_item table, which holds the accounting, it also has a reference to a 'more information' record, with entity_table and entity_id fields. In my install, it's pointing at the civicrm_line_item table most of the time, except for old imported from 4.2 entries that point at the civicrm_financial_trxn table.

civicrm_line_item - I'm not sure why you'd need a reference to this, but I guess it does help track back to how the financial_item got created. Specifically, it has a 'financial type id' field, which in combination with the transaction, could be used to calculate the financial account id that ends up in the financial item.

civicrm_financial_trnx - I'm guessing that the only time a financial_item references this table is when there's a direct correspondence between the transaction and the accounting. For an install that was migrated from 4.2, for example, that's the case of all the old transactions that assumed this and for which there is no intervening line item to split it up. Maybe "backend" administrative entries and adjustments end up here as well?

And now back to the other financial tables:

civicrm_financial_type - a list of these 'financial type' abstractions. There's no accounting codes in here, you have to find the connection to the account id using something like:

civicrm_financial_account - the list of the accounting codes for each 'account', i.e. what you want to get from your bookkeeper when you set things up.

Conclusion: it's pretty complicated, and obviously you don't want to be manually mucking with these tables. In theory, the structure allows for some nice splitting up of income into different accounting categories, but at this stage, the interface is trying to hide most of the details to keep things simple for users.

Thursday, March 13, 2014

and sent the link off to some friends and family. They had some good things to say, and some of that helped me clean it up a bit. But the feedback and discussions I had also helped me to step back a bit from the specifics of that proposal and think more generally about the problem.

The problem I'm talking about is a mash-up of technical detail, privacy concerns, security concerns and good old fashioned apocalypse with a dash of conspiracy anti-government kind of stuff. So there's definitely more than one way to look at it. I like to think of it as "collapse of trust on the Internet as we know it".

Here's the scenario: at some point in the next 5 years, a method is discovered that allows people with enough computer power to decrypt 'secure' https connections. Once this is generally known to the public (e.g. via a leak like that of Mr. Snowden), no one will 'trust' that any communications on the Internet is safe. Banks and credit cards companies will stop accepting any transactions from the Internet, and e-commerce will collapse. How that will impact the world, I'll leave to your imagination, but I don't think it will be pretty.

The anti-establishment rogue in me gets some satisfaction from that scenario, but I also know that in a crisis, it's the people at the bottom of the ladder that get crushed, and mass human suffering isn't something I'd like to encourage.

So here are some follow-up notes to my post:

What problem are we trying to solve?

Avoiding a disaster is a nice big picture goal, but not one that lends itself to a specific solution. One way of framing the problem is narrowly, which is what I suggested in my post - i.e. focus on the mathematics behind the encryption problem.

On the other hand, perhaps that's not the right problem to solve? It's not a new problem, and it's been around for about 20 years and there hasn't been a whole lot of progress or change.

The mathematical piece of the problem as it is currently framed is about how to provide a "Public Key Infrastructure" (PKI) using mathematics. A PKI is a way of solving the more abstract problem of 'how do you establish trust between two parties on the Internet', where the only communication between them is this stream of bytes that appear to be coming from a source that is reliably identifiable only as number? What if that doesn't have a reliable solution?

The short version of what suddenly got quite complicated is this: this part of internet security was designed for e-commerce, in a bit of a hurry, back in the early days of the Internet when machines were less powerful and e-commerce was a dream. Then the dream actually came true (after the Internet bubble and collapse) but those emperor's clothes are pretty skimpy.

So "who do you trust and why" is the bigger, more abstract problem, and treads on some scary ground. Is there a different solvable technical problem somewhere in here, bigger than the mathematical problem of a PKI but smaller than the completely abstract one?

A smaller more tractable problem is 'symmetric encryption' (which isn't a mathematical solution to a PKI on it's own), and this solution has been adopted as a new standard. In other words, if you have a prior relationship with someone and way of sharing secrets outside of the Internet, then a secure private channel is not all that difficult.

This appears to be a solution to negotiating a shared random secret key, which solves part of the PKI problem (it helps provide a secure channel with your correspondent, it doesn't help prove who they are).

c. Human nature

Yeah, just kidding. Just to be clear though - none of this solves the general problems of fraud and how humans have built a glorious, terrible thing built on machines and social interaction, and how fragile it is. Perhaps that part of the problem (who do you trust) is not going to have a technical solution.

Tuesday, December 17, 2013

I run a web development business, and am always engaged in a question about how many of my supporting services should be contracted out or done myself. And for what I don't do myself, who I can trust to deliver that service reliably to my clients. And what to do when that service fails.

This is not an academic debate this week for me.

On Sunday, my server-hardware supplier failed me miserably. On Friday, I notified them of errors showing up in my log related to one of my disks (the one that held the database data and backup files). They diagnosed it as a controller issue and scheduled a replacement for Sunday early morning. So far so good. It took longer than they had expected, but it came back and seemed to check out on first report, so I thought we were done. It was Sunday morning and I wasn't going to dig too deep into what I thought was a responsible service providers' area of responsibility.

On Sunday evening, Karin (my business associate at Blackfly) called me at home (which she normally never does) to alert me that the server was failing. By that point, the disk was unreadable, so we scheduled a disk replacement and I resigned myself to using my offsite backups, which were now a day older than normal because the hardware replacement had run over the hour when the backup schedule runs normally (why didn't I manually run it after the hardware "upgrade"? yes).

That server has been much too successful of late, and loading all the data from my offsite server was much slower than I'd anticipated (i.e. 2 hours), and then running all the database restores took a while. To make it worse, I decided that it was a good opportunity to update my Mariadb (mysql) version from 5.2 to 5.5. That added unexpected extra stress and complications to it (beware the character configuration changes!!!!), which I can mostly only blame myself for, but at least I suffered for it correspondingly with lack of sleep.

But then on Monday, after sweeping up a bit, I discovered that not only had the hardware swap that was done on Sunday morning not addressed the problem and made it much harder by postponing what could have been a simple backup to the other disk, they actually swapped good hardware for older hardware of lesser capacity - in other words, the response to the problem had been to make it considerably worse. I had a few words with them, I'll give them an opportunity to come up with something before I shame them publicly.

Now it's Tuesday morning and and the one other major piece of infrastructure that I outsource (DNS/Registration, to hover.com) is down, has been for the last hour.

In cases like this, my instinct is to circle the wagons and start hosting out of my basement (just kidding!) and run my own dns service (also kidding, though less so). On the other hand, the advantage of not being responsible is that it gives me time to write on my blog when they're messed up.

Conclusion: there are no easy answers to the outsourcing question. By nature, I take my responsibilities a little bit too close to heart, and have a corresponding outlook on what healthy 'growth' looks like. Finding a reliable partner is tough. It's what I try to be.

Update: here's an exchange with my server host after they asked when they could schedule time to put the right cpu back in, and asking me whether they wanted to keep the same ticket or a different one:

Me:

Thanks for this. I don't care if it's this ticket or another one. Having a senior technician to help sounds good, and I wonder if you could also tell me what you plan to do - are you going to put back my original chassis + cpu or try to swap in my old cpus into this chassis? Or are you just going to see what's available at the time?

The cavalier swapping of mislabelled parts after a misdiagnosis of the original problem points to more than a one-off glitch, particularly in light of previous errors I've had with this server - it sounds to me like a you've got a bigger problem and having a few extra hands around doesn't convince me that you've addressed it.

What I have experienced is that you are claiming and charging for a premium service and delivering it like a bargain basement shop.
Them:

We will check available options prior to starting work during the maintenance window.

We are currently thinking we would like to avoid the old chassis in case there are any SCSI issues and move the disks to another, tested chassis. As an option, we could add a CPU to the current server.

If you have any preference on these options, we will make it the priority.

I apologize again for the mistakes made, and the resulting downtime you have experienced.

Saturday, October 19, 2013

This past month one of my servers experienced her first DDOS - a distributed denial of service attack. A denial of service attack (or DOS) just means an attempt to shut down an internet-based service by overwhelming it with requests. A simple DOS attack is usually relatively easy to deal with using the standard linux firewall called iptables. The way iptables works is by filtering the traffic based on the incoming request source (i.e., the IP of the attacking machine). The attacking machine's IP can be added into your custom ip tables 'blacklist' to block all traffic from it, and it's quite scalable so the only thing that can be overwhelmed is your actual internet connection, which is hard to do.

The reason a distributed DOS is harder is because the attack is distributed from multiple machines. I first noticed an increase in my traffic about a day after it had started - it wasn't slowing down my machine, but it did show up as a spike in traffic. I quickly saw that a big chunk of traffic was all of the same form - a POST to a domain that wasn't actually in use except as a redirect. There were several requests per second, and each attacking machine would do the same request about 8 times. So it was coming from a lot of different machines, making it not feasible to manually keep adding in these ip's into my blacklist.

It certainly could have been a lot worse. Because it was attacking a domain that was being redirected, it was using up an apache process, but no php, so it was getting handled very easily without making a noticeable dent in regular services. But it was worrisome, in case the traffic picked up. It was also a curious attack - why make an attack on an old domain that wasn't even in use? My best guess is that it was either a mistake, or a way of keeping someone's botnet busy. I have heard that there are a number of these networks of "zombie" machines, presumably a kind of mercenary force for hire, and maybe if there are no contracts, they get sent out on scurrilous missions to keep them busy.

In any case, I also thought a bit about why Varnish wasn't being useful here. Varnish is my reverse-proxy protective bubble for my servers (yes, kind of like how a layer of varnish protects your furniture). The requests weren't getting cached by Varnish because in general, it's not possible to responsibly cache POST requests (which is presumably why a DDOS would favour this kind of traffic). To see why, just imagine a login request , which is a POST - each request will have a unique user/pass and the results of the request will need to get handled by the underlying CMS (Drupal in my case).

But, in this case, I wasn't getting any valid POST requests to that domain anyway, so that made it relatively easy to add the following stanza to my varnish configuration:

And indeed, now all the traffic is bouncing off my varnish and I'm not worrying. In case it was a domain that was actively in use, I could have added an extra path condition (since no one should be POST'ing to the front page of most of my domains anyway), but it would have started getting trickier. Which is why you won't find Varnish too helpful for DDOS POST attacks in general. As usual, the details matter, and in this case, since I was being attacked by a collection of mindless machines, the good guys won.

Wednesday, August 07, 2013

In the old days during "polite" conversation, it was considered rude to talk about sex, politics, religion and money. You might think we're done with taboos, we're not (and I'll leave Steven Pinker to make the general argument about that, as he does so well in The Better Angels of Our Nature).

The taboo I'm wrestling with is about money - not how much you make, but about online payment processing, how it works, and what it costs. In this case, I think the taboo exists mainly because of the stakes at hand (i.e. lots of money) and the fact that most of those who are involved don't get much out of explaining how it really works - i.e. the more nuanced communications are overwhelmed by sales-driven messaging, and the nuanced stuff is either proprietary secrets or likely to get slapped down by the sales department.

In other words, if you want to really understand about online payment processing because you want to decide between one system and another, you're in for a rough ride.

Several years ago I wrote a couple of payment processors for CiviCRM, and more recently I've been working on a new version of one of them. At the same time, two clients have recently been trying to understand their existing payment processor services in order to integrate those processes into their CiviCRM installations. So this is my "Payment Processor primer for CiviCRM administrators" blog post.

What You Need To Know

Here's a simplified but useful model of what's happening. A typical online payment has three phases, and each phase may be the responsibility of a different (or the same) service provider. I'm talking about a typical real-time transaction via credit card - other flavours will introduce new complications.

Phase 1: The Form

The web form is the public interface where the visitor inputs things like a name and credit card number. Sometimes, it's a two part form. Depending on the transaction, you'll want this form customized so that your visitor doesn't get confused and leave. The "depending" bit is really about your visitor's relationship to you. If they already know and love you, it probably matters less. If they're new and not yet convinced they want to give you money, it's big.

CiviCRM can support the form, but also supports payment processors that insist on doing the form themselves (e.g. paypal standard). The big advantage to CiviCRM doing the form is customization and not-alarming-or-confusing-the-visitor (e.g. the paypal form allows credit cards, but many people get to that form and bail because they think they need to sign up for a paypal account). The big disadvantage is that you need to worry about your server and something called PCI compliance, which is another topic.

Phase 2: The Transaction Processing or The Payment Gateway

This phase starts after the visitor clicks the submit or confirm button and may happen entirely in the background, or may involve the visitor in supplementary forms and clicking. This phase is the responsibility of a "Payment Gateway", a service that you have to buy unless you're a large corporation that builds their own. This payment gateway service has business contracts and software relationships with the phase 3 section. The key services they provide are to abstract away the individual details of the different card company interfaces and to take responsibility for financial compliance stuff (e.g. they need to keep those credit card numbers very, very safe ...).

CiviCRM does not try and do this, but provides interfaces to many payment gateways and in theory allows you to write an interface to any payment gateway that publishes some kind of "API" or programmer interface. It can be confusing because many payment gateways also try to be in the business of providing Phase 1 services (e.g. 'hosted forms') and it may not be obvious if there is such a thing as an API, and sometimes they call it something else.

Phase 3: Transaction Completion

This is the murkiest phase, where the payment gateway, the institution behind the card (issuing bank or card association) and the "merchant account" all exchange information and some kind of electronic trail gets laid that eventually results in money being transferred from the card holder to the "merchant account". The "merchant account" is a special kind of bank account that is enabled for credit card payments. What makes it special is that the credit card companies have a noose around it's neck - i.e. they take a chunk of money before it gets to the account, and reserve the right to take the money back if there's a problem. The "merchant account" might be directly owned by you the site owner, or it might be owned by someone else who then passes the money to you.

It's not unreasonable to confuse this phase with phase 2 since they happen together, and since phase 2 without phase 3 is kind of pointless, but it's important to separate it out in terms of costs and responsibilities. Phase 2 is really a technical and business relationship service that is handling the details of the transaction (kind of like an electronic teller, or maybe the hired gun in a drug deal). Phase 3 is where the money ends up and is accounted for (the backroom settling of accounts ...).

It's also important to separate them out because you can have phase 3 stuff going on without phase 2. For example: a 'recurring transaction' where a donor says they'll give you money every month. Once the initial deal is sealed, the subsequent transactions don't need to go through phase 1 or 2 (but might anyway).

What You Actually Get and How Much it Costs

So the challenge of comparing various payment processor "solutions" is to figure out the apples and oranges. With CiviCRM you have to buy services from at least one company in order to get online payments from credit cards, but any company you find may offer a mix of services covering these three phases. Paypal standard will only give you a soup-to-nuts end-to-end solution. A merchant account will only get you the last phase and you still need a payment gateway service. If you don't have an ssl certificate for your site, you will need phase 1 services, but if you do have ssl, you probably don't want phase 1 services. Most payment gateway services will offer to bundle in a merchant account and/or a hosted payment form service. And each of these offerings will be better than the competition for reasons x, y and z. And each one will use a different vocabulary to describe what they are giving you.

So, here's what you should expect and look at:

1. Phase 1 services. You probably shouldn't pay anything for this unless they are providing really good customization - it should be a one-time and change fee for the customization of the form. Getting CiviCRM to host this form is usually a better idea unless you're doing cheap hosting and/or can't get an ssl certificate.

2. Phase 2 services. These will typically cost a monthly or yearly fixed fee, sometimes a set up fee, and probably always a per-transaction fee. There's no particular reason that there should be a % fee, since the costs of providing these service are basically per transaction + a setup and account maintenance, unless the company is trying to do some kind of gambling, which is stupid.

3. Phase 3 services. The merchant account part of the service is really about paying off the card mafioso, plus an extra handling charge to the bank. Each major card has it's own rates that they charge the bank, but MasterCard and Visa are similar and Amex costs about an extra 1%. If you're a small-medium organization, then you'll probably pay a pretty standard amount, but if you're a large organization, you can usually negotiate a better rate, which is just going to reduce the extra amount that the bank charges you. It'll never go below the industrial rate, which is complicated (i.e. it fluctuates and depends on lots of things), but I'd hazard lives at around 1.5% (why not? for a start, consider all those points reward systems out there and put that together with a certainty that card companies aren't losing money ...).

One thing that this model helps you do is to compare the bundled services, which will typically be the monthly or yearly + per transaction costs of the payment gateway plus the % costs of the merchant account. You can sometimes see how they gamble on the % costs and give you a single 'blended rate', and sometimes gamble on the size by shifting costs back and forth from per transaction to % rates.