Transcript

2.
brief introduction to hover.in
choose words from your blog, & decide what content / ad
you want when you hover* over it
* or other events like click,right click,etc
or...
the worlds first publisher driven
in­text content & ad delivery platform...
or
lets web publishers push client­side event handling to the
cloud, to run various rich applications called hoverlets
demo at http://start.hover.in/ and http://hover.in/demo
more at http://hover.in , http://developers.hover.in/blog/
http://developers.hover.in

5.
➔
hover.in founded late 2007
➔
the web ~ 10­ 20 years old
http://developers.hover.in

6.
➔
hover.in founded late 2007
➔
the web ~ 10­ 20 years old
➔
humans 100's of thousands of years
http://developers.hover.in

7.
➔
hover.in founded late 2007
➔
the web ~ 10­ 20 years old
➔
humans 100's of thousands of years
➔
but bacteria.... around for millions of years
... so this talk is going to be about what we can
learn from bacteria, the brain, and memory in
a concurrent world followed by hover.in's erlang
setup and lessons learnt
http://developers.hover.in

10.
some traits of bacteria
●
each bacteria cell spawns its own proteins
●
All bacteria have some sort of some presence
& replies associated, (asynchronous comm.)
●
group dynamics exhibits 'list fold' ish operation
●
only when the Accumulator is > some guard
clause, will group­dynamics of making light
(bioluminiscence) work (eg: in deep sea)
http://developers.hover.in

11.
spawning, in practice
●
for a single google search result, the same
requests are sent to multiple machines( ~1000
as of 09), which ever replies the quickest wins.
●
in amazon's dynamo architecture that powers
S3, use a (3,2,2) rule . ie Maintain 3 copies of
the same data, reads/writes are succesful only
when 2 concurrent requests succeed. This ratio
varies based on SLA, internal vs public service.
( more on conflict resolution... )
http://developers.hover.in

12.
pattern matching behaviour
●
each molecule connects to its specific receptor
protein to complete the missing piece,to trigger
the group behaviour that are only succesful
when all of the cells participate in unison.
●
Type = case UserType of
user ­> true;
admin ­> true;
_Else ­> false
end
http://developers.hover.in

13.
supervisors, workers
●
as bacteria grow, they split into two. when
muscle tears, it knows exactly what to replace.
●
erlang supervisors can decide restart policies: if
one worker fails, restart all .... or if one worker
fails, restart just that worker, more tweaks.
●
can spawn multiple workers on the fly, much
like the need for launching a new ec2 instant
http://developers.hover.in

14.
inter­species communication
●
if you look at your skin – consists of very many
different species, but all bacteria found to
communicate using one common chemical
language.
http://developers.hover.in

15.
inter­species communication
●
if you look at your skin – consists of very many
different species, but all bacteria found to
communicate using one common chemical
language.
hmmmmmmmmmmmmmmmmmmm..............
....serialization ?!
....a common protein interpretor ?!
....or perhaps just­in­time protein compilation?!
http://developers.hover.in

21.
in­memory is the new embedded
●
keeping your entire data in­memory by having N
number of nodes , ( where N = total data in gb /
max ram per node ) is like ...
– building a billion dollar company with 999 milion
dollars of funding!
or
– having only a right brain !
●
surely we can do better than that!
http://developers.hover.in

22.
in­memory capacity planning
●
No matter how many machines you have, and
how many cores, in production level – your
product could be defined by how well you
design your in­memory / RAM strategies.
●
alternatives to avoid swapping could be – just
leaving results partioned on diff nodes, or
additional tasks to reduce the data­load further
until they can fit in memory
http://developers.hover.in

23.
in­memory capacity planning
●
parallizing jobs in­memory is a lot of fun...
●
but...
●
more often bottleneck will not be how well you
can paralliize, but how much you need to
parallize so that memory does'nt swap (eg: || db
reads)
http://developers.hover.in

25.
(1)#2 implementing flowcontrol
●
great to handle both bursts or silent traffic & to
determine bottlenecks.(eg ur own,rabbitmq,etc )
●
eg1: when we addjobs to the queue, if it takes
greater than X consistently we move it to high
traffic bracket, do things differently, possibly
add workers or ignore based on the task.
●
eg2: amazon shopping carts, are known to be
extra resilient to write failures, (dont mind
multiple versions of them over time) http://developers.hover.in

26.
(1)#3 all data is important, but some less important
●
priority queue used to built heat­seeking algo
( priority to crawl webpages that get more hits
rather than depth­first or breadth­first)
●
can configure max number of buckets
●
can configure max number of urls per bucket
●
can configure pyramid like queue. ( moving
from lower buckets to higher is easier than
moving from high to higher )
http://developers.hover.in

27.
erlang in a crawler architecture ?
●
each time a hit occurs for a url, it moves from bucket N
to bucket N+1
http://developers.hover.in

28.
erlang in a crawler architecture ?
●
each time a hit repeats for a URL , it moves from
bucket N to bucket N+1
●
crawls happen from top down (priority queue)
http://developers.hover.in

29.
erlang in a crawler architecture ?
●
each time a hit repeats for a URL , it moves from
bucket N to bucket N+1
●
crawls happen from top down (priority queue)
●
so the bucket is locked, so that locked urls dont keep
move up anymore
http://developers.hover.in

30.
erlang in a crawler architecture ?
●
each time a hit repeats for a URL , it moves from
bucket N to bucket N+1
●
crawls happen from top down (priority queue)
●
so the bucket is locked, so that locked urls dont keep
move up anymore
●
each user/site has their own priority queues, which
keep shifting round­robin after every X urls crawled
per user/site
http://developers.hover.in

31.
erlang in a crawler architecture ?
●
each time a hit repeats for a URL , it moves from
bucket N to bucket N+1
●
crawls happen from top down (priority queue)
●
so the bucket is locked, so that locked urls dont keep
move up anymore
●
each user/site has their own priority queues, which
keep shifting round­robin after every X urls crawled
per user/site
●
python crawler leaves text files which dirty loaded into
fragmented mnesia
http://developers.hover.in

33.
(1)#5 before every succesful persistent write & after
every succesful persistent read is an in­memory one
●
you listen to a phone number in batch's of 3 or 4
digits. the part that absorbs just before writing
(temporal), until you write into your contact book
or memorize it ( persistent)
●
eg: if LRU cache exists in­memory, like 100
most recent url's or tags, then no need to parse
server logs for computation, try during writes
itself . No logs, no files. live buzz analytics!
http://developers.hover.in

35.
(1)#7 what cannot be measured cannot be improved
●
you can't improve what you can't measure. an
investment in debugging utilities is a good
investment
●
looking forward to debugging with dtrace,gproc
etc but until then – just a set/get away!
●
using tsung (written in erlang again ) – load
performance testing tool, for simulating 100's of
concurrent users/requests , and great for
analysing bottlenecks of your system ,CDN's ) ,
http://developers.hover.in

38.
7 rules of in­memory capacity planning
(1) shard thy data to make it sufficiently un­related
(2) implementing flowcontrol
(3) all data is important, but some less important
(4) time spent x RAM utilization = a constant
(5) before every succesful persistent write & after
every succesful persistent read is an in­memory one
(6) know thy RAM, trial/error to find ideal dataload
(7) what cannot be measured cannot be improved
http://developers.hover.in