New York is pushing hard in tech start-up world right now. Bloomberg’s WeAreMadeInNY campaign received a lot of press attention last week and hopefully this continuing support by the city, alongside the success stories that are published with chiming regularity, will encourage the best grads and experienced pros in our industry to flock here. But it doesn’t appear to be happening just yet. Not for us anyway.

So, who are we — a small-scale start up — competing against for talent?

One idea is that we are pitted against big corporate. We see a lot of resumes filled with banking development experience. Multiple jobs lasting 18 months to 2 years. Boredom seems to set in and candidates go searching for the next chapter in their career, invariably another bank. The trend goes on…

I get the impression that these candidates are asking themselves the wrong questions. It has been mentioned to me several times that going to a large company aids career progression in a way that can’t be achieved by going to a start-up. Well what kind of career progression? Better salary? Maybe. Better title? Possibly. More responsibility? I don’t think so! I understand different types thrive in different environments, but from a computer science perspective and someone working in software I’d encourage anyone to read James Currier’s open letter to his computer science alma mater at Princeton. And just in case you don’t, below is a short quote that captures the essence of Currier’s message.

“Big companies teach you how to work through layers of bureaucracy and how to solve problems in very risk-averse ways — in short, how to make something happen in their organization. A big company is not the safe career choice. It’s the risky choice. It risks your mind and your life.”

Here at AetherWorks, we have a very talented group who are all involved in the work and decisions that push the company forward. As a result, we all have a collective responsibility to learn, to achieve, and to progress. We are a team. There are opportunities that present themselves here that I believe would represent valuable experience to anyone at any stage of life, let alone new grads. Sit in on design meetings, take part in product planning, generate new sales strategies for our latest products, pick a topic for our weekly research meetings. We have an environment in which anyone can come forward. We all have our areas of specialty, but we also have the opportunity to learn and assist across the breadth of our business. The best bit? If we do something fantastic, then we’re all going to reap the benefits.

Great, right?

We do need people to take the initial leap of faith though. Our latest recruitment drive has been very challenging and it’s made me aware of a few areas we need to work on to change perceptions.

The resumes pile up.

One notable disadvantage we have, versus large corps and more established start-ups, is that we are less embedded in the systems of recruitment that can help attract talent. AetherWorks has had little time to integrate into the NYC community, we do not have a live product just yet (but we will very soon), we have had limited (but very positive) press, we have had little opportunity to create useful internships that may lead to permanent roles, and maybe more than anything, our alumni network connects us to the UK rather than the US.

We continue, however, to strive to address all of these issues. We are using multiple sources to advertise and post our openings and we continue to raise our profile in the city by attending meet ups & conferences, writing our blog and distributing press releases. On top of that, Fay (Director of Business Development) has been sharing her experiences in schools around the northeast, and Gus (Director of Research) is planning on opening up our research sessions to the public once a month (for more info email us).

It’s a long and competitive game, but hopefully we will see some good results soon. Of course if you happen to be searching for a software engineering role, you know what to do!

Modern networked applications are generally developed under the assumption of ubiquitous, high availability networks to afford communication between computing devices. This assumption is based on the tenets that all nodes in the network are addressable at all times, and all nodes in the network are contactable at all times. But, what if we consider a network environment where not all present devices are contactable? What can we learn from building software that operates in a low-availability environment?

The word “networking”, can both refer to a set of communicating computing devices (a TCP/IP computer network), and to a set of people meeting to build inter-personal connections (a NYC start-up networking event, for example).

If we frame these two concepts together, we can gain understanding of how to build applications tolerant of low-availability environments.

We should consider the properties of each of these networking concepts. What makes them similar? What makes them different? In the typical office environment, desktop machines are always plugged in to the office backbone network, which is always present and always has access to the internet using the office’s connection. Regular computing networks are designed to afford communication in such an environment. IP/Ethernet addresses of network members are assumed to resolve to a machine, and we assume that machines will not change Ethernet addresses (generally they do not change IP addresses either). All functional machines, therefore, are assumed to be addressable and contactable, at all times.

This is clearly quite a simplistic example. We can, however, contrast it with the human networking example. In an NYC start-up event, we consider the communication network to be human discussion. When the attendees are mingling over drinks, they are moving in and out of audible range of one another. Not everyone present at the event is immediately contactable by every other attendee, despite being in the same room (network), because not all attendees are part of the same group conversation. All attendees are addressable, however, because everyone is wearing their name badge!

I like to think conceptually that these types of networking lie on opposite ends of a spectrum. On one end we have a “solid” networking state, where computers do not change location, routing paths are assumed to be static (or effectively static) and all connected machines are assumed to be addressable and contactable by all other machines on the network.

At the other end of the spectrum we have a sort of “gaseous” network where meet-up attendees are in small, disparate networks. Members are available to communicate locally, but are aware of all other attendees who, while they cannot be communicated with at present, are addressable and assumed to be contactable at some point in the future (as attendees mingle).[i]

Most of my work in academia focused on routing protocols designed for these low connectivity environments similar to the human tech meet-up. In these types of networks, a node may be aware of the address of the device that it is attempting to contact, but there may be no routing path for communication. In this case, a full path will never be available, and therefore, the node will never successfully communicate with the intended destination. Nodes must therefore pass messages through intermediate nodes that store and forward messages on each other’s behalf. These types of networks are called opportunistic networks, as nodes may pass messages opportunistically whenever the opportunity to do so arises.[ii]

A good example of an opportunistic network would be an SMS network of everyone in NYC, where messages could only be passed between phones using Bluetooth, with messages forwarded on each other’s behalf as soon as they came into range. By exploiting six degrees of separation, messages could travel throughout the city, producing an approximation of a free SMS delivery network, albeit a rather slow one!

Manhattan provides a great location for an opportunistic network due to its high population density and small geographic area.

It’s not hard for me to see the link between this and my new job at AetherWorks, where we develop distributed application software. Consider AetherStore, our distributed peer-to-peer storage platform that presents the free disk space of multiple machines as a single address space. Data can be written to any subset of machines, and the data and backups are automatically synced and managed by the nodes themselves. Like most modern software, it is designed to operate in a heterogeneous network environment.

AetherStore uses a decentralized network where nodes may join and leave at any time, so it operates in a problem space at the convergence of well-connected TCP/IP networks and a disconnection-prone environment. Consider a customer using AetherStore to sync files between two desktop machines and a laptop at their office. They may wish to take their laptop out of the office and work on the stored files in a park. If someone else in the office decides to modify “the big presentation.ppt” simultaneously with our user in the park, when the two devices sync there will undoubtedly be conflicts.

This synchronization may seem like a trivial problem, but it is it not. Time stamps cannot be trusted. How do you know which machine has the correct time? Furthermore, how do you know how to construct the differences between the files? One way to quickly determine if conflicts are present is to construct a tree of changes to files, similar to the approach of modern Distributed Version Control Software (e.g. Git, Mercurial). These change-trees are then compared when machines can once again communicate. In our example of the laptop taken home, we can immediately draw some parallels to our disconnected network environment. The two nodes in the office and the laptop in the park are part of the same AetherStore network, and yet not all nodes are contactable or addressable by all other nodes.

By building a system that can handle the difficulty of the disconnected environment, a state that may not occur often but must be accounted for, we can necessarily cope with the well-connected, high availability network environment.

I present no answers here. I will, however, leave you with a few questions:

Should nodes queue change-trees to be exchanged between devices when they meet? How big should we set this limit? Can we discard old change sets?

Should certain machines always defer to others versions of files when conflicts occur?

Can we develop a system in which timings of changes can be trusted?

Should we take human factors into account? Should we consider employee hierarchy? Is it a good idea to always accept your boss’s changes?

Is there a “plasma” state for networks somewhere past “gaseous” on my spectrum? I may not know who is going to turn up (non-addressable), or for how long they will be part of the network (address lifetime).

Should we allow users and addresses to be separated?

Perhaps we could predict addresses that might be usable in future on such a network?

[i] Somewhere in between solid and gaseous networks we have ‘liquid’ state Mobile Ad hoc NETworks (MANETs). In a MANET the routing paths between nodes may change frequently but nodes are all are addressable contactable at any given point.http://en.wikipedia.org/wiki/Mobile_ad_hoc_network

[ii] Note the distinction between opportunistic networks and Delay Tolerant Networks. A DTN may assume some periodicity to connections, which gives rise to a different set of routing algorithms. http://www.dtnrg.org/

People don’t like slow systems. Users don’t want to wait to read or write to their files, but they still need guarantees that their data is backed up and available. This is a fundamental trade-off in any distributed system.

For a storage system like AetherStore it is especially true, so we attach particular importance to speed. More formally, we have the following goals:

Perform writes as fast as your local disk will allow.

Perform reads as fast as your local disk will allow.

Ensure there are at least N backups of every file.

With these goals the most obvious solution is to backup all data to every machine so that reads and writes are always local. This, however, effectively limits our storage capacity to the size of the smallest machine (which is unacceptable), so we’ll add another goal.

N can be less than the total number of machines.

Here we have a problem. If we don’t have data stored locally, how can we achieve reads and writes as fast as a local drive? If we store data on every machine how do we allow N to be less than that? With AetherStore, goal 4 is a requirement, so we have to achieve this while attempting to meet goals 1 & 2 as much as possible.

This proves to be relatively trivial for writes, but harder for reads.

Write Fast

When you write to the AetherStore drive your file is first written to the local disk, at which point we indicate successful completion of the write — the operating system returns control to the writing application. This allows for the write speed of the local disk.

AetherStore then asynchronously backs up the file to a number of other machines in the network, providing the redundancy of networked storage.

It is, therefore, possible for users to write conflicting updates to files, a topic touched on in a previous post.

Read Fast

Ensuring that reads are fast is much more challenging, because the data you need may not be stored locally. To ensure that this data is available on the local machine as often as possible, we pre-allocate space for AetherStore and use it to aggressively cache data beyond our N copies.

Pre-Allocating Space

When you start AetherStore on your PC you must pre-allocate a certain amount of space for AetherStore to use, both for primary storage and caching. As an example, a 1 GB AetherStore drive on a single workstation machine can be visualized like this when storing 400 MB:

The 400MB of data indicated in orange is so-called replicated data, which counts towards the required N copies of the data we are storing across the whole system. Each of these files is also stored on N-1 other machines, though not necessarily on the same N-1 machines[1].

Caching Data

The remaining 600GB on the local drive is unused, but allocated for AetherStore. We use this space to cache as much data as we can, ensuring that we get the read speed of the local drive as often as possible. Cached data is treated differently than replicated data (as illustrated in our example below), as it must be evicted from the local store when there is insufficient space available for newly incoming replicated data. At this point some read speeds will be more similar to a local networked server.

The advantage of this approach is that it allows us to store data in as many places as possible when we have the space, but then store as much data as possible as the drive fills up. In the worst case we’re as good as a networked server, in the best-case we’re as good as your local file system.

Of greater importance, is that we are able to store data across only a subset of machines. This allows the capacity of the AetherStore drive to outgrow the capacity of any individual machine, while ensuring that users of every machine can still access all of the data on the drive.

[1] Replica locations are determined on a file-by-file basis, rather than on an entire set of files, so it is unlikely that large groups of files will be backed up to the same set of machines. This allows us to make use of low capacity machines by replicating fewer files in comparison to larger machines.

For the first of my posts on the operations of AetherWorks, where better to begin than our first real challenge: finding the perfect office for our company.

When we started our search in 2010, we had three options:

1) Setting Up Within an Incubator. The tech scene in NYC is booming and there are plenty of options here. If we had applied and been successful, an incubator would have provided managed office space alongside a range of other services. These vary from organization to organization but often include access to mentors/advisors and strategic help with partnering and access to loans and grants. Being in an environment focused on the development and longevity of start-ups and having bundled expertise on hand was a very attractive proposition. See Dogpatch Labs as an example.

2) Fully Managed Facilities. These facilities are fully furnished and equipped offices that are already in move-in condition. They provide dedicated support staff and are available on short or long-term leases. In addition to reducing your upfront costs, work can begin immediately in Fully Managed Facilities. This is where we were based when we started our search. See Regus as an example.

3) Traditional/Custom Office Space. With traditional office space you must first decide on the grade of building you are willing to settle in. Grading’s are based on attributes from location strengths (public amenities, transportation etc.) through the building efficiencies (energy management, exit routing, parking etc) and can give a quick overall assessment of the quality of the building. The grading gives a clear idea of how much you will spend per square foot. If you can find a suite, room or floor that has been occupied in the past, and suits your needs, your calculations can stop there.

Alternatively you may find a raw space within a building you like that requires a build out. The landlord may offer to do this for you, with some degree of flexibility over the space, or you can opt to take care of this yourself. If building out your space, there are significant costs for contractors as well as the delay in entering the space while work is ongoing. There is also a great deal of setup involved for services that can be taken for granted in the other options. Leases are standardly 5+ years and as a young company with limited credit there can be large down payments, returned at the conclusion of the lease. But, this does allow any company to completely tailor the space for their specific needs. See Abramson Brothers as an example.

We found the perfect raw custom office space at 501 Fifth Avenue. It would involve a comprehensive build-out, requiring a significant amount of input from a number of our employees. But, having long-term investment, we felt that long-term planning was the way forward. With a clear vision of where we wanted to be, this would ensure we had the perfect space to operate from and expand into. Little did we know everything else that this would entail. In April’s blog I’ll write in detail about the build out and everything you’ll need if you ever consider going through with one!

Requirements

The only way we could successfully tackle this project was to focus on who we were, what our goals were and, what we needed to achieve those goals. As an R&D firm, we felt we had a lot of default requirements for an office space that would not be as highly prioritized within a more standard software development company. Yes, we still had to ensure that each individual felt comfortable in their space, had the room they needed and the tools to assist them, but we also needed to consider the research in R&D. Our longevity as a company hinges on not only designing and delivering software of the highest quality, but also continuing to create intellectual property and ensuring our research cycles are efficient and rewarding.

Our office before the build-out.

There are many ideas that claim to contribute to random stimulation and creativity within your office or work space: visual stimulation through artwork and colors, having magazines and journals lying around, providing games and toys for breaks, but here, we had the chance to go a step further. We had the opportunity to create the perfect environment where all these extras would add to the creativity within our space. To this end we knew the pressure was on to build a great space for creative cooperative working and within this requirement, we had to cater for research groups that would often vary in size. We had to ensure that both large and small groups would feel motivated and comfortable enough to spend prolonged periods of creative time (occasional overnights if necessary…) dealing with hard problems – all within our space restrictions in NYC.

With the walls of the office up, all that’s left to do is make the space our own.

Being rather inexperienced in the world of build-outs, we had yet to discover how important it was to have designers and contractors that bought into our ethos and fully understood what we needed. We got lucky… the second time.

The finished article. Lots of natural light and open spaces.

As I mentioned before, I’ll save the specifics of the design and construction, which inevitably went on far longer than planned and caused an incredible amount of stress, for another post.

Was it the right option?

Building out our space was a lengthy process; the duration from our initial viewing through absolute completion was nearly 18 months. But for all the phone calls, emails, drama, irritation and stress, we now have a fantastic, light, open, creative office that suits us perfectly. We do have a long term lease, and we do have to manage our own services, but having the freedom and flexibility to create our own environment was far more important to us. Everyone now has their own personal space, we have ample room for our research activities, and we also have a great environment for welcoming customers and visitors. Most importantly, if you speak to any of our employees, one of the first things they will say about working here is that we have a phenomenal space to do our thing.

We knew what AetherWorks was, we built what it needed, and now we have an office that shows people who we are. That’s a result.

This is the first post of our new series answering some of your most frequently asked questions. We’ll start with the most common query.

Where is the server?

This is easy to answer: there isn’t one.

Other systems that make use of the spare capacity you have on your servers or workstation machines either require that you have a constant connection to the cloud, or that your machines are connected to a dedicated centralized server.

In AetherStore the software that runs on each of your workstation machines manages and co-ordinates access and backup to data. These machines co-ordinate among each other, so there is no need for a central server.

I love the challenge of creating peer-to-peer systems and the flexibility that they give us.

A well constructed peer-to-peer system allows us to create applications that work just as well with one hundred machines as they do with two, all without predetermined co-ordination or configuration; applications that don’t rely on a single machine or a specific network topology to run correctly.

With AetherStore, this is precisely what we need. We are creating a software system that eliminates the need for a storage server, instead allowing you to make use of the capacity you already have. If you have ten machines each with 1TB of free storage, AetherStore allows you to combine this capacity to create 10TB [1] of networked, shared storage, without any additional hardware.

With no shared server, we want to avoid making any one machine more important than the others, because we don’t want a single point of failure. We can’t delegate a machine to manage locks for file updates or to determine where data should be stored. Instead we need a system that is able to run without any central co-ordination, and that dynamically up-scales or down-scales as machines start up or fail.

This post discusses one of the ways in which AetherStore achieves this as a peer-to-peer system.

Conflict Resolution

As we have no central server and no guarantee that any one machine will always be active, we have no way of locking out files for update — two users can update the same file at the same time and we have no way of stopping them. Instead we need to resolve the resulting conflict.

Consider the following example. When two users decide to concurrently update the same file, we have a conflict. These updates are gossiped to the other machines in the network [2], which must independently decide how to resolve the conflict and make the same decision regardless of the order in which the updates were received.

This independent evaluation of conflicts is critical to the scalability of the system and to peer-to-peer architectures in general. If each node makes the ‘correct’ decision without having to contact any other nodes, the system is able to scale without introducing any bottlenecks [3]. This is the advantage of the peer-to-peer architecture, but it is also the challenge.

In the case of AetherStore, to deterministically resolve file conflicts we can only use one of the two pieces of information available to us: the time of the file update and the identity of the machine making the update. Time is an imperfect comparison, however, because the system clocks of each machine in the network are unlikely to be synchronized. Using machine ID for comparison is even less suitable because it results in an ordering of updates entirely determined by a user’s choice of machine [4].

Both options are imperfect, but they are the only choices we have without resorting to some form of central co-ordination. Consequently, we use the time of the update — the lesser of two evils — to determine which update takes precedence, with the other, conflicting update being added into a renamed copy of the file. If each update occurred at precisely the same time, we use the machine ID as a tiebreaker [5].

Truly Peer-to-Peer

The advantage of this approach is that every machine is an equal peer to every other machine. The failure of one machine doesn’t disproportionately affect the operation of the system, and we haven’t had to add a special ‘server’ machine to our architecture. Also, because each node resolves updates independently, we can easily scale out the system without fear of overloading a single machine.

Machines can be temporarily disconnected, users can take laptops home, a lab can be shut down at night, and the system remains operational [6].

Contrast this with a more traditional setup, where users are reliant on continued connectivity to a single server to have any chance of access to their data.

The key point here is that the removal of any central co-ordination greatly increases the flexibility of the system and its tolerance of failures. In AetherStore we have a system that is resilient to the failure of individual machines and one that seamlessly scales, allowing you to add or reintegrate machines into your network without configuration or downtime.

There is no central point of failure, no bottleneck, and no server maintenance.

And, for this, I love peer-to-peer systems.

[1] You probably want to keep multiple copies of this data, so the total space available may be slightly less.

[2] Rather than sending updates to all machines immediately, they are sent to random subsets of machines, eventually reaching them all. This allows us to scale.

[3] This is beautifully illustrated in Chord, which can scale to 1000’s of nodes with each node only having to know about a handful of other nodes to participate in the ring.

Welcome to the new blog. I’ll use this first post to give you a brief idea of who we are and what we do.

At our heart we are a software R&D firm with a particular expertise in distributed systems. We were founded in 2010 with the goal of developing intellectual property assets for the enterprise, and have since focused our efforts on bringing our first product, AetherStore, to market.

We are now a seven person team, built from a core of experts with the knowledge and experience needed to make AetherStore a reality.

Over time everyone will be contributing their perspective to the blog, sharing their experiences at AetherWorks. Personally, I’m looking forward to contributing my own experiences as COO in the coming weeks and months.