In October, the night before Halloween, we co-hosted a special Test in Production session with Honeycomb.io. We asked speakers to bring their best horror stories of operational outages and other scary events.

Marc Devens, Director of Product at BrandVerge, shared two stories. First, he spoke about what happened when Appeagle — a company that provides price management software for third-party sellers on Amazon, eBay, and Walmart — was caught up in a major pricing error. (Items that typically cost thousands of dollars, were somehow listed as one cent.) Marc also shared a short story of how his current team at BrandVerge avoided potential disaster when launching 5+ major features with an offshore contract team after many months of only shipping fixes.

"When you really want to simulate the full lifecycle of the kind of events that you have, send a feed over to them, you wait for the response. You can only do that with real customer data." - Marc Devens, Director of Product at BrandVerge

Watch the video below to learn how Marc ensures these kinds of disasters never happen again. Learn more about how and why he tests in production. If you're interested in joining us at a future Meetup, you can sign up here.

Transcript

Marc Devens: I'm Marc Devens, Director of Product with BrandVerge. Similar to Paul's story, I'm speaking about experiences at a company that I was previously with, a little bit more than that. Not only was this a company that I'm no longer representing, I'm no longer with, I actually wasn't with the company at the time of this. That's a pretty big caveat. Did have this cleared by the CTO of the company to make sure that had his blessing to kind of represent the story as a lesson for everyone else, one of these really scary stories that happens. It was really kind of foundational to what the company kind of, the way we did things there, because it was such a scary thing that happened at such an early stage within the company. This happened in July of 2012. I joined the customer success team at that company. The company was called Informed.co. At the time, it was called Appeagle. I joined the team in 2014, so just shy of two years later, before I eventually moved into product. We did, I did with them, price management software for third party sellers on Amazon, eBay, and Walmart.

Amazon was really the vast majority of this business, so it's really what we're going to focus on in this experience, affected Amazon. That landscape has changed a lot since this first happened in 2012, but we have some stats here. You're an Amazon seller, surprise. I think this is really important to kind of understand who these people are, who people are that are sellers, because it's always really important to understand your users, and who you're making a product for. We can go into some background about what that sort of landscape is like. These are stats from, these are recent stats. This was different back in 2012. At the time there were, so right now there is about more than one million sellers on Amazon, third party sellers. When you go on there and buy something, a lot of the time you don't even realize that you're buying it not from Amazon, you're buying from one of these third party sellers. There's over 30 billion dollars in annual revenue, just from these third party sellers. There's also two options for doing fulfillment.

There's Amazon FBA, which is when you as a seller will send your inventory into Amazon and they'll send it off to customers when they buy it. Or you can do seller fulfilled, you're fulfilling it on your own. We'll go into that a little bit later, because that actually plays a really important aspect into this. What happened? We know that Amazon is kind of the everything store, and so we can talk about what you can buy, and what you can sell on there. Obviously there's books, that's Amazon's kind of foundation, that's what they got started in. Third party sellers can sell books on there too. What else can you buy? You can buy tools, you can buy toiletries, groceries, anything you want. Going kind of higher up on the tickets, there you can buy TVs, you can buy cameras. You can buy anything on Amazon, and you could sell anything on Amazon pretty much. How does price management software work? Basically, if we go back to 2012, at the time it was pretty basic.

You could set minimum and maximum thresholds for your pricing, and there were basic rules that you could have, to say that, "If a competitor on the listing, they raised, or they lower their price, I want mine to kind of follow along with them." Now there's more advanced algorithms to kind of take a lot of this, you don't have to come up with your own strategies, the software can do it for you. At the time, that's how it worked, "My customer, my competitor would lower their price from a $100 to 99.99, and I want to match them and be competitive with them so I don't lose out on sales." If you don't want to have the soft price for your inventory, that's fine. You can just not set a minimum price. Those items without that minimum price, they get excluded from the repricing. Except, except, when you modify that single line of code, in a way that makes these items all without a price, go to a penny. That is a pretty scary sort of event, and what caused that to happen.

In, we had all these items that they didn't have that min price set in this system that was kind of a null, or a zero, and something changed in the code that allowed that to get sent to Amazon as a zero. Amazon, unfortunately, rather than rejecting that, they did something that I guess they must have thought was very convenient, they rounded up from zero to a penny. Don't know why they would do that, it seems like it would be a lot easier to just reject that and toss that out, but a penny, that's what it did. All these things that we had here, TVs, cameras, items that could sell for thousands of dollars, they all of a sudden went to that penny. As you can imagine, customers took advantage of that. You could start getting these things for things that, I feel like if you were on there buying something, you saw a $3,000 camera, some SOR, for a penny, probably knew that something wasn't going right, that wasn't some crazy sale that someone was having, but people pounced on that. I can't really blame anyone for trying.

You know how I was talking about the different kinds of fulfillment a little bit ago, where you could either do that seller fulfilled, or that FBA, that Amazon fulfilled. That's where things really started to kind of compound, because with this Amazon fulfillment, these orders kind of started just flying out the door already, sometimes even before a seller knew what had happened. They had gotten this order within an hour or two. These orders have already shipped to customers. This sort of thing is basically the worst case scenario for a company in this industry. Didn't mention that this was really early in this company's history. This was the first day of the fourth employee. Not really a company equipped to handle something like this. How do you even begin to clean something like this up? It was only four people at the time, kind of circle the wagons, trying to figure out how to do it. They were able to thankfully identify what had happened and correct that, get that changed, undone pretty quickly, but then you had to spring into action and really start working with people to get those fixes in.

One of the first steps was to notify Amazon. They needed to be made aware that these hundreds, or thousands, of orders had gone in kind of mistakenly. Also, wanted to work with them to make sure that the sellers, that they could have the best chance of cleaning this stuff up, maybe not fulfilling them, or if they had to cancel orders, that the seller metrics and things that they're compared on wouldn't get dinged, it wouldn't hurt the seller. The company acknowledged full responsibility for the incident, it wasn't the, tried to make that clear to Amazon that this was not the sellers fault. This was the company's fault for the mistake that they had put into place. Then it's working with sellers, trying to get them, all those people that bought these 100, or a $1,000 ticket item for a penny, trying to say, "Hey, I hope you realize that this was a mistake. You, we weren't actually offering this to you as a penny. Can you please go ahead and cancel that order." You of course had obstinate customers that said, "I paid a penny. I'm going to get the product that I paid for."

Then it was kind of at the sellers discretion, maybe they just went ahead and fulfilled that, if it was a low ticket item, it wasn't worth taking the hit on that, on their metrics to cancel that order. Or they just canceled it themselves and took that hit. Now on the company's side, how do you deal with these dozens, hundreds of customers out there, that had something pretty traumatic happen to them? Because this is, it's not only our business that was affected like this, these are all individual businesses that were affected in this way. There was a lot of service credit issued, as a small company that's kind of one of the best ways that you can kind of make good with someone. There were some reimbursements done to kind of cover any losses for people that did go ahead and fulfill those orders at a loss. As a reminder to anyone running a business, don't ever forget to pay your business insurance, because that comes in handy when you have to do some sort of payouts like this.

Surviving and thriving, this was really early stages in the company, this was 2012, the company's still around. It has 10 times the number of employees that it had then. Probably a similar, or even bigger exponential of customers. How do get through an incident like this and not have it kill off the company? One of the first things is obviously to ensure that it never happens again. You need tests and safety checks. Some of the things that we implemented there afterwards were to specifically check for certain kinds of changes, that if there was anything going out with a zero, or a penny, to really reject those, sounds all sorts of alarms to make sure that those weren't hitting any of the marketplaces that we worked with. Even though we went through that issue, and it was pretty visible within that sphere, a lot of people knew that that happened, that didn't prevent other companies from having the same exact mistake repeated on their side.

I was working on the customer success side, and about six months into my role, got a call from a customer saying, "I had all these products just go to a penny," and I was sitting there on the phone, [inaudible] was, "Oh, my God. This happened again. What did we do?" Eventually came to realize that that customer had used another competing piece of software in the past, that actually we were no longer using them, but they never revoked their authorization to their Amazon account. This company still had access to their Amazon account, even though they weren't even using it anymore. The same sort of thing, like what had happened, that these things that weren't actively managed, they didn't have that min price, they got sent to zero. This company, all the inventory that wasn't being managed by the company, got sent to zero. Pay attention to the mistakes that other companies make so that you do not repeat them yourself. Taking care of your customers is again one of the most important ways that the company got through it. This was very present within the company.

A lot of people knew that this had happened to the company, if you Google the name of the company you were interested in purchasing their services, you would find articles that had been written about this. When I first worked in that customer facing role, and I'd speak to customers that were interested in trying out the product, they'd say, "What have you done to make sure that this never happens again?" I can go into the things above, all the tests and safety checks, but at the end of the day, a customer doesn't really care about that. They say, "Well, it's happened before, how do you ensure that it doesn't happen again?" What I would say to them is that, "We had customers that were impacted, really bore the brunt of this, that are still our customers." I know that even today, six plus years after the fact, there are customers that are still with the company. That's the way that this company made it through that event. Going to the theme of this meetup test and production, one of the specific peculiarities with Amazon is that there is no sandbox for their API.

When you really want to simulate the full lifecycle of the kind of events that you have, send a feed over to them, you wait for the response. You can only do that with real customer data. You can try and have some test marketplaces of your own, but it's not going to scale the same way that real live products do. There's some element to just being able to control your rollout and make sure that you have a way to turn it back off. A gate for everything. You get a gate, and you get a gate. That was one of the things that we really learned after this, is that you really need to make sure that you can roll back and adjust quickly. You don't have to spend that time trying to undo whatever problem that you have. A flick of a switch gets you back to where you were. That was definitely one of the major lessons for the company, the way that we did things going forward, and in my takeaways, the way I do things now, even though I'm at a new company. Just, that was kind of this horror story.

I just want to speak a little bit more to my current experience, where gating saves the day again. After four years with Informed and Appeagle, I left earlier this summer. I joined a company called BrandVerge. We are a platform for the discovery and planning of premium marketing programs. When I first came in, we were working with a contract development team based in India. The initial launch of the product was in April of this year. I came on in June. There was a planned deployment for this big V2 of the product for September of 2018. The spec and requirements that were in that kind of, and also, there was not a deployment for anything but bug fixes between April and September. All of the features that were being developed in that time were set to launch in September. That's five, six, pretty banner features that were all set to launch, not only set to launch at one time, but the team was based in India, so it was going to launch at like 3:00 AM New York time.

These were all built to spec and requirements that had been developed six to 12 months prior, before any customers were on the platform, before we really had a good understanding of the way that customers were going to be using the platform. One of the first things that I did when I got in there, unfortunately because the team was a contract team, they had their specs run out, they wanted to build everything exactly as they said that they were going to do it. You can't blame them, that's how that relationship was, but was able to make some modifications, shift some things around, to at least get gates around these things. Because some of these features were just not ready for primetime, and so really great benefit for us to be able to hold these things back to do that additional work that was needed to be done before actually getting them in front of customers. I don't know. I think the way around, and I can say that within the third party marketplaces, eBay is surprisingly like a developer's dream.

Their whole, the development documentation, the environments that they have for developers, these sandbox things, are totally fleshed out. It's just nobody wants to use eBay, so you don't get to take advantage of these things. I think what's happened is that, we always used to joke that within Amazon, their, this division is called Marketplace Web Services, MWS, that they're like in some deep dank basement in Seattle kind of scuttled away from everyone else. Because like they just, I think one of the things is that they know that sellers are going to be selling, that these marketplaces are going to exist around sellers to support them, that they don't really care that much about the APIs. I think unfortunately you kind of have to bear the consequences when things like this happen, because you don't have those sandboxes, but they just didn't really seem to give this much of a, much attention within the organization. Thankfully, I can say that, within the last year or two, I can, we noticed that that was starting to change.

We were getting calls from product managers over at MWS, reaching out to us, as kind of one of the biggest users of their APIs and saying, "What were we looking for out of it?" Didn't see too much kind of done in response to those things, but at least they were reaching out, so hopefully that's a step in the right direction. That I am not sure of. I think we're all too nervous to try and send anything over and see what happens, so I would hope so. At least, as of a couple of years ago, when we saw someone have the same exact thing happen to them, and I mentioned that it happened to a competitor, it actually happened to at least two other competitors in the same space that I know of, there may have been more. It definitely existed as a problem for at least a couple years after this. I don't know if it's since been corrected. I can't speak to, specifically to the actual error in the code, but basically within the software, you would define a minimum price for your item.

You'd say, "I don't ever want this specific piece of inventory to sell for less than $50." You could choose not to assign a minimum price to something, and those things just would be exempted from the price in workflow. It wouldn't send prices for anything that you didn't set a threshold for. Except when something comes in and it fails the safety check, and all the things that basically had no minimum price, and I forget if it was, I'm not clear on the specifics, if it was either a null or a zero on our system, but what ended up happening is, because it didn't have a value set, it got sent to Amazon as a zero, which was then interpreted by Amazon as a penny. I'd say that, I really, in terms of what we could do differently, I wish that there was some better way to kind of, one of the things that we ended up running into more of an issue is like kind of the scale and load testing.

Because the price changes that we were sending were on the order of tens of millions a day, and there was a lot of metadata that existed around each individual price change. When we were doing any sort of work around this, again that slow rollout, being able to kind of to see that as you're rolling stuff out, it was holding up-

Speaker 2: Were you able to do like load tests beforehand?

Marc Devens: That's the thing that when I was leaving the company, was they were really trying to focus on being able to do those simulated load tests. That was something that without having a sandbox environment, it's actually really, really difficult to do.

Speaker 2: Really hard, without having a snapshot of real query traffic and everything.

Marc Devens: Yeah. Unfortunately, it's still a small company. The kind of resources that you need to throw at something like that, to be able to do really that full lifecycle of simulated testing, it's, those are multiple teams that would be responsible for something like that. Great. Thanks, everyone.