Saturday, December 11, 2010

We turned a sharp pivot this year at OtherInbox, shifting away from the product I initially built (now called Defender 1.0) to a newer, easier-to-adopt product called Organizer. I had a lot invested in that product; I cut my teeth in the startup world while building it. We worked hard to write the code and I even dressed up as a superhero at SXSW to promote it. I gave a lot of technical talks about the technologies we used, and I really loved the whole process, but the fact is we could not build a sustainable business with Defender.

I'm really proud of us for pivoting to Organizer and don't feel the least bit bad about it (though I am sorry for anyone inconvenienced by the need to upgrade to our Google-hosted alternative, called Defender 2.0 - it really is a lot better for everyone in the long term, though).

Josh Baer gave a great explanation of how we reached this conclusion which I think really captures the decisionmaking and guesswork of entrepreneurship:

We had to pivot and focus our attention on Organizer, even though Defender was our first love.

It took too long

Nobody wanted to hear this. I certainly didn't want to admit that the "great idea" we started with was not going to be successful. We probably should have began this process six months earlier than we did but it was hard to swallow.

That put us in a situation where we had more than 500,000 users signed up for Organizer and less than 20,000 for Defender. Only about 1000 were paying $20/year. Yet Defender was about half of our code base and 80% of our customer support issues. It was hard to put many resources into improving Defender because 95% of our users were using Organizer. Yet everyone on the team believes in Defender and uses it ourselves. It hurt the morale of the entire team to see a product stagnate from lack of attention. We needed a way to combine our resources so that all of the effort we put into the product can be benefit both Defender and Organizer users and so that we could deliver a more reliable service.

The good news is, Organizer is doing really well and we're making it better every day! Check out the full post; it's a great case study!

Tuesday, November 23, 2010

More details later, but just a quick note to thank all of the sponsors and volunteers and participants who helped make the Hackathon a huge success. Here are the winners and here are some cool photos. Stay tuned for video, too!

Tuesday, November 9, 2010

Recently friends have been offering to help us with Ignite Baltimore. It's a true labor of love that's completely powered by volunteers, donations, sponsors, and goodwill and I'm super psyched that people want to help.

However, it's hard for me to know how serious people are. Heather, Neal, and I have various responsibilities divided up between us but there are plenty of things that we don't have time to do or can't do as well as a dedicated person might. So, as an experiment, I'm putting this list out into the ether to see who might step forward to take up one or more of these projects!

Organizing an after party: It would be nice to have a really well-planned event immediately after each Ignite that we could funnel the audience towards. Ideally it would feature live music by a Baltimore performer. This event would probably need separate sponsorship.

Creating content for our website: I'd like us to post things like follow-up articles with Ignition Grant winners and past speakers, interviews with upcoming speakers, etc.

T-shirt printing and sales: we have a few custom-printed shirts that the organizers wear, and people often ask where they can buy the shirts. We could really use someone to step up and take charge of this, including selling the shirts at events. Ideally they could be sold online as well. All proceeds would benefit the Ignition Grants.

Manage Facebook presence: I'm in charge of this right now and just don't have the enthusiasm for Facebook that I do for Twitter. I'd love to see someone become the guru of our Facebook page and turbocharge it.

Manage LinkedIn presence: Ditto.

Manage media partnerships: We have great relationships with The Urbanite and MD Daily Record. I'd love someone to take this over and start recruiting a few new partnerships (like a radio station).

Print marketing: We don't do anything with postcards or flyers right now. I feel like we could diversify and renew the audience further by trying this for the next Ignite.

BmoreSmart: We're part of this group of social entrepreneurs. They have all kinds of interesting projects and opportunities. Helping them would be a great way of helping us!

Thursday, October 28, 2010

We are having a hell of a good time organizing the first Baltimore Hackathon. The idea is to assemble teams and individuals to build software and hardware projects from idea to prototype in one weekend: 11/19-11/21. Our motto is "Build. Meet People. Have fun."

We're awarding cash prizes to the winners and door prizes to all participants. We're gonna have opening and closing ceremonies, free food from Puffs N Pastries, and other surprises.

We're organizing it using the TechInBaltimore Google Group. I'm really amazed at how people are coming out of the woodwork to sponsor, volunteer, coordinate, etc. This is the most decentralized can-do project I've been involved with!

If you're in the MD/DC/VA/PA/DE area or beyond, I hope you'll consider coming to be a part of it! We're really hoping to draw in people to our tech community who don't participate in other more purely social events. This is all about making!

Friday, October 22, 2010

This week I had the chance to give my "Social Media for Everybody" talk to two different groups of people. I promised everyone who attended that I would post the links and slides to the talk here on my blog. Follow this link and you'll find everything.

Thanks to everyone for attending! I really enjoy helping people navigate this world as it's been so personally enriching to me and helpful to my projects.

Monday, October 4, 2010

I'm very grateful to have been nominated for a few nice awards this past year because of my work with Ignite Baltimore and other ventures. But there are others in Baltimore who are doing as much or more than I am to encourage Baltimore's development as a technology hub. Since these are annual awards, I'm hoping the organizers will check them out when it comes time for the next round of nominations. They've all got my vote!

John Trupiano and Yair Flicker: These guys own SmartLogic and have donated a considerable amount of time, money, and resources to the tech community. They have led the growth of the Bmore on Rails group from a mere monthly meetup for Ruby enthusiasts to a full-fledged networking and teaching society with weekly events for web programmers. They spearheaded the welcoming committee for June's Railsconf convention, and they funded the organization of a highly-regarded free unconference called BohConf. On top of that, they were also principal organizers of TechCrawl East.

Mike Brenner: He's just getting started but by the time 2011 rolls around he will be a serious contender. Baltimore was lacking a specific group that focused on technology entrepreneurship and he has taken up the charge with Startup Baltimore. They've had several well-attended startup breakfast meetings followed by workshops. This is probably the most energetic long-range community organizing effort we've seen to date.

Nate Mook: He was the co-organizer of TEDxOilSpill and one of the main organizers of TEDxMidAtlantic. These events are some of the highest-profile, highest-quality happenings in town and have brought a lot of outside attention to the city.

Baltimore Node Organizers: These people have built a mecca of Maker culture in Station North. The key players that I know of were Adam Bachman, Jon Lesser, Mark Huson, Kelly Egan, Matt Forr but let me know in the comments if I left someone out.

Who did I miss? Let me know and I'll update the post!

Side note: Baltimore is an extremely friendly town that embraces new projects and new leadership. There are lots of other cool things that could be done to keep our momentum going, and you can expect that even a modest amount of effort on a meetup, hackathon, cofounder dating event, blog, podcast, or whatever will be recognized and celebrated! That's been my experience with Ignite and the Baltimore Improv Group.

Sunday, September 5, 2010

I'm fascinated by Peter Thiel's essay "The Optimistic Thought Experiment". It's about globalization's past and future and has many awesome things to ponder, but as a technologist and entrepreneur this paragraph grabbed me, emphasis mine:

Technology entrepreneurs and investors would do well to return to hard and important problems. As globalization proceeds apace, the decisive unsolved problem concerns the issue of security. There remains a tremendous need for real defense against the proliferation of destructive technologies — reaching well beyond the Orwellian “defense” industry, with its proclivity for constructing new contraptions that kill large numbers of people. Along with the New Economy and New Media, there should exist a valuable sector that could be described as New Defense — at least in any twenty-first century in which humanity does not blow itself up. The absence of such a sector serves as a subtle reminder of the complacent myopia of Silicon Valley venture capitalists investing in “technology.”

I don't know what this means for me personally yet but I'm pretty excited to think about it. I can't find the links anymore, but it reminds me of criticism that emerged from last year's Techcrunch 50 event, about how most of the 50 companies were not working on anything that would fundamentally improve the human condition.

Wednesday, August 11, 2010

Update: I receive occasional inquiries for cybersecurity career advice because of this post. I haven't worked in this field in years, so I recommend you read this advice if you're trying to get a cybersecurity job.
Cybersecurity, while offering lucrative job opportunities, might not be an ultimately rewarding career for Maryland technologists. I worked in this sector for about eight years as a military officer, government civilian, and government contractor in a variety of different roles, and here's what I want to say about it.

Maryland's business press, government officials, and various tech organizations have lately been enthusiastically banging the gong for cybersecurity. I can appreciate why - there's a lot of money at stake, and a lot of it comes from Maryland's foremost benefactor, the federal government. This is a recession-proof, guaranteed-to-grow industry, and Maryland is already home to many successful cybersecurity companies like Sourcefire. The government and private companies employ many thousands of people and contribute many millions of dollars to our tax base.

So it makes sense for our government to be pursuing these opportunities, but does it make sense for you, Maryland hacker? Here are some things to consider; these are obviously generalizations extrapolated from my experience. Feel free to leave comments if you feel this is a gross distortion.

Cyber defense is often the opposite of a creative activity; in many of these jobs you're going to find yourself acting as an enforcer, a mere gatekeeper. You'll be telling the creative people in your organization all the things they can't do or aren't allowed to have. Often you'll be restricting them not because of policy reasons but because it's too hard to figure out how to allow them to do what they want within the regime you are enforcing (Naturally this does not apply if you work for a company that builds the tools the enforcers use) or because it's just easier to say "no".

In classified settings, you are severely restricted in the sources and kinds of technologies you use. You'll be leaving your smartphone and your iPad in your car or in a locker outside the SCIF. You won't have admin permissions on the machine you're working on. Forget installing Chrome with the latest extensions, you'll be lucky to get version 2 of Firefox! Or you might not have access to the Internet at all! Also, forget about telecommuting or riding your bike to work; your job will be in a well-defended federal facility or an anonymous office park in the suburbs.

Because cybersecurity is so tied to "the enterprise", you'll almost certainly be living in Microsoft land, which may or may not be a problem for you.

Many of the government organizations in this field are gigantic, top-down, and super-hierarchical. You will made to turn as a soulless cog in a giant machine. There are plenty of smaller, more enlightened companies out there, of course, but the highest paying jobs will probably be offered by big contractors.

The federal government has crazy monopsony power over this sector. Besides the usual and expected bureaucratic games you'll endure, if you work for a private company that does much business with the government you are going to see some brutally depressing market distortions that arise from this monopsony. You may find yourself working on a product or a program that nobody in your client agency cares about, or wants to succeed, except that they need to spend up their budget dollars so Congress doesn't take the money away next year. Or you might find your job in limbo because the sales cycle for getting government contracts is so long, and it can take forever for the company to actually have money in hand. There's some truth to the myths about the Pentagon spending $10K on toilet seats - it probably does cost about $9950 in sales salaries to sell a $50 toilet seat to the Department of Defense!

I was well-paid as a cybersecurity analyst, and often I did enjoy the work, and parts of it involved amazingly cool, James-Bond-like exploits. But those are the reasons I ultimately chose to leave. Now I am working on my own startup. My job is less glamorous (I'm not "saving the world" every day) but because my individual contribution counts thousands of times more in a small company which I own a piece of, and because every second and every dollar counts, it's an infinitely more satisfying way to spend my time. My labors are simply more meaningful. So that's what I wanted you to know.

UPDATE 8/16/10: Please check out @NetSecGuy's post where he further elaborates on these issues.

POSTSCRIPT FOR MARYLAND GOVERNMENT AND BUSINESS LEADERS

I applaud you for positioning the state to take advantage of the "cyber doom boom". I'm sure it will help many of my fellow citizens in the short term. But I wonder how much wealth you think cybersecurity is ultimately going to create in Maryland, especially if it accrues to big consulting companies like Booz-Allen that aren't even based here. Also, what's going to happen when this sector matures, when Internet security gets better, and spending declines? Who's going to fill up those office parks and abandoned SCIFs?

I implore you not to neglect other parts of Maryland's Internet tech economy, because it's product companies like Advertising.com, BillMeLater, Millenial Media, Localist, Ipiqi, Common Curriculum, Figure53, Replyz, Deconstruct Media, and a bunch of others I can't think of right now that are building a new, sustainable post-industrial base in our state.

Thursday, June 3, 2010

I love Baltimore and I love Ruby, so like my fellow Bmoreonrails colleagues, I am super-psyched for RailsConf 2010 being held at the Baltimore Convention Center!

If you're new to our city, here's a short list of things to see and do, optimized for people staying near the convention center (e.g. this is not necessarily the list I would give you if you were staying with me, had ready access to a car, etc.)

SOUTH

Your best bet in this direction is the Federal Hill neighborhood which has a lot of activity, bars, and nice restaurants. During the day you can climb to the top of Federal Hill itself, and then visit the incredibly awesome American Visionary Art Museum: this is the ideal art museum for hackers, because it celebrates self-taught artists.

On Wednesday or Thursday evening I recommend checking out Illusions Magic Bar & Lounge. If you're around on Friday or Saturday night, there's an awesome magic show featuring an upside-down straitjacket escape. I guarantee there's nothing like this place back where you come from!

Paul Barry has organized a RailsConf group attendance package for the Yankees vs. Orioles game on Wednesday night. Camden Yards is a very nice ballpark, so get out there and enjoy it before we get conned into building another one a few years from now! You'll get free beer and hotdogs!

NORTH

On Monday, Bmoreonrails is having its monthly Pub Night at Pratt Street Ale House, right across from the convention center, a very fun place to hang out.

Walking farther north, Maryland's signature dish is the crab cake, and one of the best versions is made at the Faidley's stand in Lexington Market which is worth visiting to soak up all the vibrant activity of a city market.

If you have a car or can spring for a cab ride, definitely visit Mt. Vernon: a beautiful neighborhood with brownstone homes and great restaurants. The Brewer's Art bar was named "Best Bar in America" by Esquire, and they brew an excellent ale called Resurrection.

If you have a car, you may also want to check out the Hampden neighborhood which tends to get a lot of attention by people writing articles like this one - John Waters recently described it as a mix of "hipster culture and redneck culture".

EAST

The Inner Harbor is our ubertouristy area, but it's very nice if you've never been there. Besides our great National Aquarium, you can each catch a water taxi from there to the Fells Point neighborhood farther east, which has a ton of bars and restaurants and cool shops and coffee shops.

If you have access to a car or taxi, after stopping in Fells Point you might want to visit Canton, one of the city's technology hubs. The Beehive coworking facility is well-worth a visit if you have time to kill before or after the conference.

Did you hear about this gritty, realistic, little-known but super amazing cop show on HBO a few years back? It was so much more than a cop show. It was called The Wire and it was a tremendous work of art, but also very entertaining. If you have heard of it you may want to visit some of the shooting locations which I have catalogued previously in "The Wire tour".

Monday, May 31, 2010

Introduction

At OtherInbox I recently built a QA system using the Cassandra datastore. I really like this technology and so far I would recommend it, but the learning curve for Rubyists is still pretty high. There are some good examples online (especially the canonical article by Evan Weaver) but nothing showing more intermediate, real-world usage. Hence, this article.

The system requires us to log millions of events per day. I could have built it using a traditional relational database like MySQL (which we use for the main application), but these factors led me to consider a NoSQL database:

We're only interested in large patterns in data, so we don't need 100% ACID assurance that every single write will succeed. The system would be useful to us even if it only caught 80% of the events.

Since we perform these actions millions of times per day, write speed is the prime consideration.

The QA reports are generated offline, once per day. We don't mind if reads happen more slowly, or if we need to do some extra programming to build reports because we can't use SQL.

The shear volume of events made me less excited about punishing a MySQL table. We already do a lot of extra work to keep MySQL healthy performing OtherInbox's main functions via sharding.

I was curious to see how a schema-less datastore would change the way I solved programming problems.

I've been playing with the technology only for a few months so I'm sure I'll need to correct some parts of this article as I learn more - please comment if something is unclear or incorrect.

It's Sorta Like an Ordered Multi-Dimensional Hash

Rubyists can think of Cassandra like a hash of ordered hashes, or a hash of ordered hashes of ordered hashes, requiring up-front planning to use. You don't have to specify your schema, but you do need to tell Cassandra how your keys and columns will be organized. That affects how the data is stored on disk and how you'll read the data later.

Since the columns are stored in sorted order, Cassandra can answer queries very quickly (which is why it's in use at sites like Facebook and Digg). I had to change the way I built keys and column names several times before I got it right. Anytime you change how the data is stored on disk you need to restart Cassandra.

Columns and ColumnFamilies

ColumFamilies store a set of columns (which you can think of as key-value pairs) partitioned by a row key. The column names can be arbitrary strings, long integers, or UUIDs; at start time you have to tell Cassandra how to sort the column names but beyond that you have complete freedom to create column names that will be useful to you.

If each row has the same data, you might think of it like this:

{ user_id => {'email'=>'sarah@example.com', 'last_name'=>'Jones' }}

Where user_id is the row key (which Cassandra hashes and uses to determine which nodes should store the columns for this piece of data), and 'email', and 'last_name' are column names. Using the gem your code would look like:

But you can also store useful data in the column names. This is useful when there are many columns and you want to be able to select a particular range of columns. For the QA system, we page through a large range of columns within each key, and assigning smart column names helps this go faster. The data might look like this:

In this case we are storing messages for a particular user, and we're using unique identifiers for columns that we can query later in ranges. If your data has a temporal component you might use time-based UUIDs (where the most significant bits are a timestamp and the less significant bits are entropy) so that you query only columns that fall within a particular range of times.

You do need to tell Cassandra how your column names should be sorted on disk, which happens in the configuration file for each ColumnFamily:

'details' and 'preferences' are super columns containing columns 'email', 'last_name', and 'expert_controls'. Just as with regular column families, you can encode arbitrary data in the column names, or just set them to UUIDs. When you define a SuperColumnFamily, you tell Cassandra how to sort and store the column names and the subcolumn names:

One key consideration: as of this writing the current version of Cassandra (0.6.2) does not do any indexing of the subcolumns, which means when you load a supercolumn, all of its subcolumns are loaded into memory. If you expect to have more than a few thousand subcolumns, you would be better off using a regular column family, and overloading the row keys and columnnames with your nested data. In our example, your column names could be something like "sarah@example.com/Jones/true", and it would be up to you to split the data on retrieval.

Key Names vs. Column Names

For the QA system, everything we keep track of is associated with a timestamp. The most natural partitioning of the data seems to be by day, hour, and whether we synced the message or not. The reporting system runs once per day, iterating over each hour and each synced state for the previous day. This gives us rows that are small enough for Cassandra to easily distribute across nodes without loading up too many columns in any one row. Our keys look like this:

key = "#{time.strftime("%Y-%m-%d-%H")}*#{is_synced}"

Since I only need to track 4 or 5 properties about each sync/nosync decision, I decided to use supercolumns. Actually, I first used the composite column approach described above, but I found supercolumns made for better-looking, slightly more-efficient code. The columns looks like this:

Each supercolumn name is a composite of the domain name of the message we examined and an MD5 hash of the message header. This ensures we don't store duplicate records if the same message gets processed twice. It also means I can drill down on specific senders in the future if needed by using range queries with partial column names, as shown next.

Range Queries

I don't know the optimal number of columns that Cassandra can serve up in one request, but in our system one row (meaning one hour's worth of sync and nonsync events) could comprise tens of thousands of columns, more than we would want to request at once. But since the columns are stored in sorted order, it's easy to fetch them with a range query. Here is a super simplified version of what we do:

I divide up the previous day into 48 keys (synced events for each hour and nonsynced events for each hour).

I then thread these requests, 24 at a time. According to the docs, "a good rule of thumb is 2 concurrent reads per processor core", so 3 machines times four cores times 2 reads per core = 24 concurrent reads. Each node has its ConcurrentRead property set to 8. I may not be doing the math correctly so feel free to chime in with a correcting comment.

This code uses a range query to page through all the columns within one row. Since the columns are stored by UTF8Type, I can just increment the key and know that I'll get the next chunk of columns. You can also query with partial range keys, so that if I wanted to see all of the data for a domain, I could range query with the column start as "example.com*". I also have some code that aggregates the results of those 48 queries.

I also have some pretty complicated code that collates the resuls of those 48 threads, which are themselves composites of all the range queries I ran within each row. I realized after writing it that I had essential re-implemented my own half-assed map-reduce.

Happily, while I was implementing this code, Cassandra 0.6 came out, which includes built-in support for Hadoop. Cassandra has a Pig load function, so it should eventually be possible to replace the above code and my half-assed map-reduce with something much more elegant, maybe just a few lines of Pig. For now, this works great. Of course you don't need any of this if you aren't using your datastore for reporting.

Notes on EC2

I don't have enough experience yet to recommend whether you should use EC2 or not. I originally built this to use one xlarge instance, but I found that it could not keep up with the network load. There were a lot of timeouts from the nodes reporting to Cassandra. As soon as I split it into three smaller Cassandra nodes, the timeouts went way down. It might even make more sense to split into six small nodes.

Each node has two EBS volumes, a smaller one for the commit log and a large one for the data. The commit log is append-only and is used to replay writes in case Cassandra crashes before the data in memory can be written to the data disk. Keeping them separate improves throughput so one operation doesn't block the other. It might make more sense to use an ephemeral store for the commitlog; I haven't had time to explore.

I definitely recommend following the recommendation in the documentation: use at least three nodes in production.

Notes on Adding Nodes

Adding each node was easy, and that's one of Cassandra's key features. All you have to do is tell the new node the address of at least one other node and set its AutoBootstrap value to true.

The only problem I had was that the first server was getting hammered so hard by all these requests, many of which were timing-out, that it took awhile for the second node to complete the gossiping with the first node to start bootstrapping.

CassandraObject

Michael Koziarski wrote a cool ActiveModel interface to Cassandra called CassandraObject which I haven't played with yet, but offers a higher level abstraction for accessing data beyond just "a hash of hashes". He's presenting it at RailsConf this year and I'll definitely be in attendance for that talk.

Incidentally, Stackoverflow is starting to become the site I go to first when I'm searching for the solution to a technical question. Google searches return at least 50% garbage or duplicate mailing list content for a lot of technical topics. That makes me more interested in the future of niche/vertical search engines. It is definitely possible to out-Google Google within a niche.

Wednesday, May 26, 2010

Last night I gave the Social Media for Everybody talk at SKYLOFTS, an amazingly nice space in Highlandtown. As promised, below are the slides and the links I discussed. It was definitely a firehose of information! Thanks to all who attended.

Thursday, May 13, 2010

I'll be giving a super personal, super passionate talk about social media at the SKYLOFT artistic community in Highlandtown on May 25th. I was invited by two new interesting veteran's groups that have formed in Baltimore, the Veteran Artist Program, of which I am a board member, and The Sixth Branch.

These guys have big, creative plans, so my talk will be about how to get the word out while being a responsible community member (e.g. this is not about marketing per se). What I plan to talk about should be useful to people from all walks of life (entrepreneurs, artists, employees, nonprofit types, etc).

Monday, April 5, 2010

I've been experimenting with the Cassandra datastore lately and came across an interesting interview with Ryan King discussing Twitter's use of Cassandra. I thought this tidbit was worth sharing even if you don't care about Cassandra:

A philosophical note here — our process for rolling out new major infrastructure can be summed up as “integrate first, then iterate”. We try to get new systems integrated into the application code base as early in their development as possible (but likely only activated for a small number of people). This allows us to iterate on many fronts in parallel: design, engineering, operations, etc.

He also describes in detail how they used that philosophy to migrate Twitter status updates (the toughest part of their app to scale) to Cassandra. This also reminded me of Flickr's Feature Flags ... all of which can be used to support continuous deployment, something I've become very interested in after encountering it in the lean startup literature.

I'm planning to writeup my findings on using Cassandra with Ruby sometime soon if there's interest.

She's the CEO of an important mobile computing technology company here in town, mp3Car.com, which Gus has written about before. The company combines several different facets in a compelling way: it's a popular forum, a crowd-sourced design company, a consultancy with Fortune 500 customers, and a niche ecommerce store. Their office in the Emerging Technology Center includes a small warehouse of electronic parts, making it one of those rare web businesses that has actual inventory and real-world relevance. They do it all with a small staff led by Heather.

Thursday, February 25, 2010

As a Baltimore-based Rails hacker, someone who loves Baltimore and who loves Ruby on Rails, I was pretty excited to hear about RailsConf coming to town on June 7th. The Rails community here in town is determined to make a great impression on our fellow programmers (especially given the negative reactions by some people to the choice of our city). Along those lines I'm pleased to announce Ignite RailsConf taking place on 6/6/10 at 6 pm at the Sheraton Inner Harbor.

Wednesday, February 17, 2010

Bob Potter just showed me an awesome way to use Ruby's Queue class to communicate between two threads. I try to avoid multi-threaded programming as much as possible since I don't feel super confident with concurrency (at least with a language like Ruby). It can add a lot of complexity and headaches even when you know what you're doing, but there are several cases in OtherInbox where it makes a lot of sense.

In this case, I needed to log a very frequently-occurring event to SimpleDB. We don't need these logs to be 100% accurate, and the data gets processed offline for reports, so there's no need to keep the data in our main relational database. My first naive version of this code pushed a new item to SimpleDB each time the event occurred, but this ended up chewing up a lot of memory. To avoid slowing down the main process, I had been handing these one-off SimpleDB calls to an EventMachine deferred block, which caused references to all the objects I was using to generate the SimpleDB items to stick around for too long.

That's when Bob suggested I use a Queue to batch the work into chunks of 25 (the SimpleDB BatchPutAttribute limit) onto a separate standard Ruby thread. The resulting code looks like this (simplified for readability):

Once more than 24 items are in the queue, the thread calls code that we wrote to handle batch puts, then clears the array and waits for more.

We use an inter-process/inter-machine version of this paradigm all the time at OtherInbox: we make heavy use of SQS queues to handoff work from one box to another, but that wouldn't have helped here because then we'd just be introducing yet another web service dependency. So if you're in a situation where you need to keep a high volume process running quickly while handing off less important work in chunks, check out Queues!

Monday, January 18, 2010

I've refined the talk a few times since then but it's still a good representation of my current state of mind about what I've learned about Ruby and about software design while building OtherInbox. Check it out if you'd like to see more! The slides are posted here.