March 16, 2005

Amazon: Interplanetary e–commerce

What kind of problems will amazon face in delivering retail services to mars? Or to put it another way, why is it that we don't think global e-commerce is possible?

We already do some things at massive scale – the internet, mobile phones, chips (multi-billion transistors that all work). There are 1 quadrillion ants on the planet (allegedly)

What do we need to solve the problem of massive scalability? Not just technology, though that may be a necessary precursor. There are only a few systems that can scale up to millions of parallel nodes.

Scale ought to be seen as an advantage – the more you scale the more you can sell

Can we use the same engineering techniques to build really large systems that we use for current big systems? Management becomes a big deal; how to cope with unreliability

Real Life scales well - systems need to learn from biology for high fault-tolerance. Biological systems go through continuous refresh - cells are designed to die and be born without affecting the organism as a whole.

Outside monitors are not a good indicator of 'health'. system should be designed for continuous change, not stability.

Turings 3 categories of systems:

organised (current apps)

unorganised (networks)

self-organising (biological)

– need to move to self-organisation for massive scalability

Can't expect complete top-down control – since applications won't be deterministic. Real life is not a state machine

Functional units need to be self-organising feedback-centric machines

comparison point: Why are epidemics so robust wrt message loss / node failure? Can be mathematically modelled in a rigorous way. It works because each node can operate independently if it needs to. As the number of nodes becomes really large then you only need to know a subset of the system in order to succeed.

Fault detection protocols – monitor on a particular node A how long since another node B updated it's state. B does not need to contact A directly because the state will eventually replicate around the whole system. Need clear partitioning of data but then the system becomes highly reliable.

2 comments by 2 or more people

Real Life scales well – systems need to learn from biology for high fault-tolerance. Biological systems go through continuous refresh – cells are designed to die and be born without affecting the organism as a whole.

I read an article a little while ago (can't find it now – this was before I started furling everything) about the systems behind Google. I forget the details but their systems are based around small, cheap PCs (off-the-shelf & easy to replace and keep spares of) and are built on the assumption that several will fail per day. When that happens, they just plug in a new one, boot it up, and the system is designed to populate it with whatever part of the distributed database was on the failed one, and within a very short time it is able to serve requests again.

It's true. What's even cooler is that google are expanding so fast that when an individual node dies, they don't even bother to remove it – just leave it to rot in it's rack. In many cases they don't even know whether a particular node is running or not – just that a certain percentage are available/out

16 Mar 2005, 16:01

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.