> parallel jobs on massive datasets when you have a simple interface like
> MapReduce at your disposal. Forget about complex shared-memory or message
> passing architectures: that stuff doesn't scale, and is so incredibly brittle
> anyway (think about what happens to an MPI program if one core goes offline).
this is a bit unfair - the more honest comment would be that for
data-parallel workloads, it's relatively easy to replicate the work a bit,
and gain substantially in robustness. you _could_ replicate the work in
a traditional HPC application (CFD, chem/md, etc), but it would take a lot
of extra bookkeeping because the dataflow patterns are complex and iterative.
> The other Google technologies, like GFS and BigTable, make large-scale
> storage essentially a non-issue for the developer. Yes, there are tradeoffs:
well, I think storage is the pivot here: it's because disk storage is so
embarassingly cheap that Goggle can replicate everything (3x?). once you've
replicated your data, replicating work almost comes along for free.
> So, printf() is your friend. Log everything your program does, and if
> something seems to go wrong, scour the logs to figure it out. Disk is cheap,
> so better to just log everything and sort it out later if something seems to
this is OK for data-parallel, low-logic kinds of workflows (like Goggle's).
it's a long way from being viable for any sort of traditional HPC, where
there's far too much communication and everything runs too long to log
everything. interestingly, logging might work if the norm for HPC clusters
were something like gigabit-connected uni-core nodes, each with 4x 3TB disks.
so in a sense we're talking across a cultural gulf:
disk/data-oriented vs compute/communication-oriented.