Node.js on the Road: Ben Acker

Node.js on the Road is an event series aimed at sharing Node.js production user stories with the broader community. Watch for key learnings, benefits, and patterns around deploying Node.js.

SPEAKER:
Ben Acker, Senior Software Engineer
Walmart

I have so many disclaimers that I probably should give, but one of them is, there's a promise that I had made to a couple of people to use my NPR voice in one of my talks. So that's what I'm going to do to introduce myself I guess. I'm Ben Acker. Today's talk is going to be a short narrative with slight technical overtones.

Discussing the rise of Node in production at Walmart. And I'm Ben, and that's about all I can do. I apologize. I thank you for, thank you for bearing with me on it. Alright, so like I said, I work at Walmart, but I'm basically going to detail Node for the past two years over there. I'm not going to do too much technical stuff.

There are plenty of talks that go into deep dives on our technical stack, on our opensource framework, Hapi, the other modules that we provide. Lots, and lots, and lots. The only technical thing that I'm going to do is something that we haven't done yet. I haven't really seen too many people doing it. I'm actually going to—I cut and paste a whole bunch of how we do our builds, and how we do our deploys, so I could show those because people ask me a lot.

So I'm going to start. Also, I kind of drew everything for all the slides because slides are really boring. So in the long, long ago Walmart, the small online retailer decided that they wanted to make a mobile app, and they made one. They made one. It was huge, and it was built on old dated Java technology for even the time that it was built. The services they were exposing, supply chain management stuff, was all old, and you end up getting this mobile app that you'd expect to get from Walmart six years ago.

Fast forward a little bit. Walmart sees that it's made a giant mistake, hires in some people, starts acquiring companies, creates some pretty rad native apps, and iOS, and an android, creates all new mweb thing, starts providing opensource libraries, and doing other stuff, but all of this is still based on the services tier that was from the old thing, which is once again you're on SOAP, old SOAP services which—don't you all, don't, don't.

So here. OK yes, so what you have is you have this team that's made up of a whole bunch of these guys right? These folks, they're happy folks they've been doing development. When I was experimenting with doing color, most of the drawings are black and white because I like pencil, but this one's colored. You've got these guys, they're doing OK, they're now working on a services team for pseudo-popular apps that are seeing millions of people, so they're kind of excited, but now they're starting to get a little bit, they're starting to get a little upset because instead of just reporting to one, like having one team consuming their services, they've got lots of different teams consuming their services. iPhone, iPad, all different kinds of Android teams.

So they're starting to get anxious. And what's even worse, is that they're still operating on all of these old, dated services that are really like—y'all, they're bad. So it's difficult, and then what you have is, you've got groups that are—you've got groups internally that are not responding to things, that are treating mobile as second class citizens, so you're not getting support that you need to get actual real services, services in there I know that some folks are experiencing that currently, and you get, it's just you have nights where there's support calls that happen basically all night.

Java team is completely strung out, and this is where Node comes in a couple of years ago. The first hire that they made for this services team, the old services team is still there. We're all together, but they started to bring in a Node team to create a services tier that would actually be able to service mobile, and provide the scale that we would need at Walmart and some nicer looking APIs, but they brought in Eran Hammer. They brought him over from Yahoo, and then like he brought in a couple of more people which were me, a guy named Wyatt, a guy named Ben, and we just started building.

We got to use Node, hooray! It was rad, Node's really fun to develop in. We got to work on opensource software, also rad, and we got to do—TJ was talking a lot about Eran, he's quite outspoken. So we would go to conferences, and then Eran would yell a lot about stuff, and then people would come up and talk to us because I think we might be a bit more approachable, I don't know.

So we've now got this small Node team that's at Walmart, and we're writing stuff, and we're creating our framework, and going forward with that. So we came up with a plan. I've got all my notes written on graph paper. It's like my little setlist. We came up with this plan on how to deploy, and this specific slide is a direct copy of one of my slides previously but I really, really, really love it.

So this is one of the technical portions. We've got the old services, right? So the old services layer is there, and those are all these SOAP services that are exposing the supply chain management, all that junk. And then you've got the Java services, have you all ever heard of a framework called Wicket?

So Wicket is sitting on top of that exposing all of these. And then the plan was just so that we could get Node into production, we're going to throw a full on proxy in front of everything. Get Node into production, so we can like getting stuff—getting hardware moving at Walmart is difficult. Getting stuff requisitioned is difficult.

One of the companies they bought was a cloud provider. Can we use those cloud provisioned machines to do deployment or anything you sold? No, no we cannot. Not for this. We can for other stuff, but like it takes a long time to get stuff through all the corporate security audits and all that crap. So get it in production as a reverse proxy.

The other thing that this provides is it gave as a chance to—like there's millions of users for the apps, and for the old mobile web, for people that don't want to update, the proxy is still going to have the old services exposed, so no big deal. Any new development can be done in Node. Any old like—what it also gives you is—what we found out later is, with sub 10 millisecond latency we were able to tap in and get loads of different analytics stuff, so we basically created a backlog of things to say these services are now slow, and we can rewrite them in Node, and we can take our time with it because we can.

Yeah, so that was the plan, and it continued to go forward here, so that got us to our build. We went through a few different operating systems trying to get here. I hope you all don't ever have to deploy on Solaris. That was one of the most painful things, like at first we thought we were going to have to do that, and it wasn't good Solaris, it was a Solaris where—also make sure you install packages as opposed to just like copying directories over, because anytime Touch would happen on Solaris it would bomb and we'd get a [xx]. Anyway, we went to Red Hat, and now everything there is running on SmartOS, but this is our build. This is it. We use Jenkins to manage stuff. TJ recommended it as a CI server for us.

[TJ] No, oh no.

That's a lie. So Jenkins is basically just managing a bunch of shell scripts for us. All of our deploy stuff is on shell scripts. We do an install, we test it.

Eran mandates that we have 100% test coverage on all of our stuff. So then once it's gone through test, we removed Node modules do an NPM install production, shrinkwrap it, and tar it up, and that's all. I did remove a couple things in there that's just like part of the bash scripts for maintaining version numbers and all that crap, but this is seriously it, and then deploys, that's our deploy.

We have a list of hosts that we end up SCPing this tarball to, and that's it. So going from a time where these old Java mobile deploys could sometimes take 12 to 16 hours to coordinate everything, deploy them to loads, and loads, and loads of different servers, and making sure that all of the versions and everything were correct, it dropped it down to if you run—we have it in phases, and we do one then check it make sure it's cool, then do a big string of hosts, and then a third strings of hosts. If we just had that all—if we remove the human check point of that, it takes less than like from build to full deploy is less than five minutes. Usually like two and a half right?

If it's done pragmatically, the tarballs are pretty small, so SCPing those to N hosts is really, really easy to do. And there is, there's more in that story in a second. Alright, so the—doing Node at Walmart has been rad, and with all the community involvement that we've had their's been some pretty high profile things of help that we receive from Joyent and members of the community. One of them is this giant memory leak that we had, and I love that I got to draw this picture of TJ. He helped to solve this giant well it wasn't a giant memory leak. It was a problem that was initially giant because we switched from 8 to 10, from RedHat to SmartOS, and a couple of other things all at the same time, and then eventually found out that there was a very, very small leak that TJ found. Some of the links at the beginning link to talks talking about this.

And this is also why Eran wrote a children's book and read it at NodeSummit. So other things. The community is so approachable, and easy to talk to, but also like folks are interested in what you're doing, so I was at one conference and Trevor Norris came up to me and he was talking to me about the changes that, he was like man I've got some ways that you can really speed up stuff at Walmart. Come on, you gotta check these out. It's all talking about the slab allocation and everything, so that's my picture of Trevor Norris as the Fonz like coming up, and helping me out.

All of this stuff—whoa, <span style="background-color: rgb(204, 204, 204);">easy Tiger.</span> How it's been going is pretty good. Our first major test of everything that was pretty public was last year, we live Twitted everything that was going on for Black Friday, which is—Black Friday is obviously a big day for Walmart. And the whole weekend, the whole holiday season is pretty crazy, but Black Friday specifically is absolutely nuts.

So, for Black Friday, by this time what we had was our mobile web, what had moved from the being rendered in Java and served to being completely from Node, so all mobile web stuff was served from a hapi server. We had an analytic system written in Node. Glenn Block, where you at? Anyway, that's all sending stuff into splunk, and then we've got all other native mobile stuff, all of the—basically anything that goes to mobile.walmart.com, including all services for native devices are going to go through this Node at least the Node proxy, and mostly Node servers. So there's—on this day specifically, there were over 500 million visitors to Walmart.com. Over 50% of those went to mobile, and for each one of those, like each call, each individual service call is going to be multiple calls to the analytics system. So we live tweeted all of this stuff we had a—we created a schedule—I kind of like having this microphone, by the way, usually it's like a little thing I can—it's just funner to talk into. Sorry.

We created a schedule that would have like a rotation schedule for holiday coverage, and the goal is to have somebody available at all times because we were fairly confident that everything was going to go pretty well, but we wanted to be prepared. What ended up happening was, we were having so much fun with it that basically from Wednesday night before Thanksgiving, until we passed out at different times Friday morning, we were on, like our whole team was on a Google Hangout that entire time. I took like a two hour break to go eat lunch or something on Thursday, but like other than that, we' were on the whole time and it was rad. It was really rad.

Did anybody look at the Node Black Friday stuff from last year? Can you all read that at all? OK. Like nothing happened. It was the best thing ever. The biggest CPU use was like to 2% usage, that's kind of rad. But the biggest thing that was awesome, was—and OK, spoilers, if people were following it and know the answers to these questions that I'm about to put up from Eran's Twitter feed…

What he was doing was he was showing all of the graphs for all of the usage, like RSS and CPU usage, he was showing all of them pretty much the whole time. And so on Black Friday he posted this. This is the RSS feeds, and so you can see they go pretty stable and then towards the end they start to drop off.

So he posted this asking if folks knew what was going on. Anybody have any guesses? Anyone? No, not dinner. Here it's continuing, here. Anyone? Does anybody remember what this was from watching it? OK I'll show the last one and then, what was it Bryce?

You deployed.

Yeah we deployed, Black Friday. On Black Friday, the highest traffic day for traffic at Walmart.com, and we deployed in the middle of it because we could. We had zero downtime, we lost no traffic or revenue, it was—it was flipping magic. So this is one of my favorite Node.js reactions. So next, what happened was everything went really good. Everything went really good. So instead of just doing mobile, now what they've done is they've expanded us to all the way across GEC, hopefully with the ultimate end goal being all Walmart traffic, all dot-com traffic going through Node.js. And so they combined all the services teams from mobile services, and I kept coming up with these awesome names, so I was like, 'We could be called The Away Team', and I could wear a red shirt and this could be so rad, and they're like, no, that's not going to happen. So I was like, all right so we got to do Hammer team, we have to do Hammer team because that's the coolest, and they were like No and I was like I knew they'd done that because they'd come up with a rad, rad name that was going to blow all these away, and Eran's pretty good at coming up with names, so that's what they came up with. So that's what happened.

Anyway, I just put this—I really like this drawing—but I guess what I wanted to illustrate with this is that, like Node, Node has arrived. It's got—there are plenty of examples of its use for anything from a start up all the way up to heavy enterprise use. It has proven itself to be stable, it has proven itself to be fun to develop in, it's fast to develop in.

If there were any question about whether or not it's ready, then it's—I would definitely say that there really shouldn't be one. If at any point you're on the fence about it, you can go and ask folks. The Node.js community is one of the funnest things about programming in Node. Like the people are fun, they're easy to approach, and quite often like everybody is interested in what you're doing, and it's easy to get questions asked.

So if participation in the community, if all we can do is ask questions about it, that helps to grow to the community, because it gives folks the chance to answer, so join us. All right, that's all I got, folks.