Тех-Детали

понедельник, 14 августа 2017 г.

The question was raised a week ago on Reddit. Recently I implemented the elaborated solution in nginx-haskell-module. In this article I will talk about why I wanted to pass ByteStrings contents directly, what obstacles lay in this way, and what was the best way to implement this.The answer to the question why is quite simple and straightforward. Many Haskell handlers from ngx-export, including all asynchronous and content handlers pass data from lazy ByteStrings to the C side to be further used in Nginx variable or content handlers. Before version 1.3 of the nginx-haskell-module the data was merely copied into a single buffer (for variable handlers) or multiple buffers (for content handlers). Below are both the historical functions.

toSingleBuffer::L.ByteString->IO(MaybeCStringLen)toSingleBufferEmptyLBS=return$Just(nullPtr,0)toSingleBuffers=doletIl=L.lengthst<-safeMallocByteslift/=nullPtrthendovoid$L.foldlChunks(\as->dooff<-aletl=B.lengthsB.unsafeUseAsCStrings$flip(copyBytes$plusPtrtoff)lreturn$off+l)(return0)sreturn$Just(t,l)elsereturnNothingtoBuffers::L.ByteString->IO(Maybe(PtrNgxStrType,Int))toBuffersEmptyLBS=return$Just(nullPtr,0)toBufferss=dot<-safeMallocBytes$L.foldlChunks(const.succ)0s*sizeOf(undefined::NgxStrType)l<-L.foldlChunks(\as->dooff<-amaybe(returnNothing)(\off->doletl=B.lengths-- l cannot be zero at this point because intermediate-- chunks of a lazy ByteString cannot be empty which is-- the consequence of Monoid laws applied when it growsdst<-safeMallocByteslifdst/=nullPtrthendoB.unsafeUseAsCStrings$flip(copyBytesdst)lpokeElemOfftoff$NgxStrType(fromIntegrall)dstreturn$Just$off+1elsedomapM_(peekElemOfft>=>\(NgxStrType_x)->freex)[0..off-1]-- [0 .. -1] makes [], so wise!freetreturnNothing)off)(return$ift/=nullPtrthenJust0elseNothing)sreturn$l>>=Just.(,)t

What imports are needed and what NgxStrType is, you can find in the original module. These functions firstly allocate a storage for a single buffer containing all the buffers of the original ByteString (toSingleBuffer), or a storage of storages corresponding to every single buffer of the ByteString (toBuffers). In case of allocation error the functions return Nothing. The caller (which is still a Haskell function that represents a specific Haskell handler on the C side) interprets this as an allocation error and returns a tuple (nullPtr, -1) of kind (bufs, n_bufs) via the pattern PtrLenFromMaybe. In both the functions, data from the original ByteString gets copied with copyBytes. User of the functions must free the data at some point. Imagine the following simple version of toBuffers that directly passes the ByteString buffers to the C side.

Now we return the tuple directly and put in its first element (the storage of storages that now transforms to a storage of references) the references to the original ByteString buffers. Later, the data gets passed to the C side. Very simple, no extra allocations, no copying, no obligations of freeing data on the C side.Is this really possible? How does life-time of the internal ByteString data correspond to desired life-time of the data on the C side? The answer is simple: there is no correspondence between them! After returning from a Haskell handler on the C side, the passed (or better to say poked) contents of the ByteString can be easily freed by the Haskell garbage collector, because nothing refers to the ByteString on the Haskell side anymore. This is a bad news! Nginx uses epoll() (or a similar mechanism for feeding tasks when epoll() is not available), and this means that we may need the references to stay valid during unpredictable period of time. How could we ensure validity of references?The StablePtr to the rescue!

A stable pointer is a reference to a Haskell expression that is guaranteed notto be affected by garbage collection, i.e., it will neither be deallocated norwill the value of the stable pointer itself change during garbage collection(ordinary references may be relocated during garbage collection). Consequently,stable pointers can be passed to foreign code, which can treat it as an opaquereference to a Haskell value.

So, a stable pointer guarantees that an objects it points to will be alive until the pointer gets freed (read the docs further for details). We could pass a StablePtr to the ByteString along with the references to the ByteString buffers to the C code, and free the pointer via calling a special imported Haskell handler like

when the references are no longer needed (e.g. at the end of the HTTP request).But in the documentation on the StablePtr nothing is said about possible relocations of the object it points to. Normally, the garbage collector moves alive objects to a new heap when the old heap gets removed (see a brilliant article here).Oh, no! Is the StablePtr not a cure? Relax… We do not need the ByteString itself actually. Remember? We need its buffers only! Digging into ByteString implementation reveals that the buffers are allocated in special pinned byte arrays using function newPinnedByteArray#. The docs say about it (more precisely, about newPinnedByteArray, but it merely calls the former):

Create a pinned byte array of the specified size. The garbage collector isguaranteed not to move it.

Thus, the Haskell garbage collector will leave the internal ByteString buffers in their original places. On the other hand, the StablePtr guarantees aliveness of the ByteString until it’s freed (hence, aliveness of its not-relocatable buffers too). This is all that we really need, and this seems to be the best solution.

среда, 4 января 2017 г.

In this article I will show how to setup HTTP routing with nginx and Nginx Haskell module to mutable arrays of geographically distributed labeled media storages, front-ended by REST API media managers. Let's assume that the managers accept only 3 types of requests: /conf, /read and /write. On the /conf request a manager must return a JSON object with a list of all media storages it manages with their parameters: labels, hints, mode (RW or RO) and timestamp referring to the last update of the data. On the /read and the /write requests a manager must try to read from or write to a media storage according to values of the request parameters label or hint respectively. On successful /write request the label of the affected media storage must be returned: thus the client will be able to provide the label value in subsequent /read requests. If a /read or a /write request was unsuccessful then HTTP 404 status must be returned. I won't discuss implementation of the media managers: the spoken requirements and clear understanding that they must serve as transparent transformers of the REST requests proxied via media routers to the media storage layer is enough for the purpose of the article.The media routers is another story. They must serve user's /read and /write requests properly proxying them to those media managers which know about supplied label or hint. To collect information about available labels and hints and keep it valid, the routers must regularly poll their own media managers and other routers, or they can sign into a message queue to obtain updates asynchronously. For the sake of simplicity I will use the polling model: in this case on any unexpected response a client may assume that the media storage layout was altered but the router had not received the update yet, and simply retry the request in a minute or what the router polling interval is. Active polling requires asynchronous services from the nginx-haskell-module. Let's poll other media routers with /conf request as for media managers, expecting that they return a JSON object with a list of configurations collected from their own media managers. The model is depicted in the following image.

There are 3 geographically (or by other criteria) separated areas in the LAN which are colored differently. On front of every area a single Labeled Media Router stands (LMR 1, LMR 2 and LMR 3): this manifests the closest to the client's area (i.e. WAN) layer called Routers (to be precise, all the routers are depicted as double towers: it means that they may stand after a load balancer). The second layer, Managers, consists of media managers bound to media routers (LMM 1 through LMM 7): these bounds are statically defined in the nginx configuration. Some media managers are doubled as media routers: the situation when many routers are bound to many managers via a load balancer affects the collected data update algorithm in a media router. The lowest layer, Media, consists of media storages that may change dynamically (including addition and removal of new storages and altering their parameters). The media storages must have labels (dir_1, dir_2 etc.), a mode (RW or RO) and, optionally, a list of hints. On the image, some storages have a single hintfast. If the list of hints is empty then a storage is supposed to have hintdefault. Considering that storages with the same label are replicated somehow implies that reading from a storage in a different area is safe: normally data on such storages is expected to be equal, however when data is not found, the media manager must return status 404 and the router must skip to another area. Presence of replication is depicted by blue dashed arrows for storages with label dir_1, other storages with equal labels are also replicated but I did not show this to not overload the image. Mutual polling between all media routers with /conf requests is shown with bold red arrows.An example of user's /read request is shown on the image. The user requested reading a file from label dir_1. Router LMR 1 found that the request could be proxied to media manager LMM 1, but it responded with status 404 and the router passed the request to media manager LMM 4 from another area, having skipped another media manager LMM 3 from its own area. An interesting scenario with two proxy_pass actions in a single request! I'll show how to achieve this later, but now I want to specify requirements to /read and /write requests and the media router behavior more accurately.

/read

parameter label must be provided, it is supposed to have been given in the /write response from the media manager where data was actually written,

media storages with RW or RO mode can be chosen,

if a media manager returns status 404 then the router must skip other media managers in the current area that hold this label behind,

if a media manager is not accessible or returns 502, 503 and 504 then the router must try to pass the request to another media manager in the current area that hold the label behind or to a manager from another area if there is no suitable managers in the current area,

if there are more than one managers that hold the label behind then they must be chosen in round-robin manner in every new request to the given nginx worker process,

if a media manager is not accessible or returns 502, 503 and 504 then it must be blacklisted for a time period specified in the nginx configuration,

if the list of suitable media managers get exhausted while no successful response has been received then the media router must return status 503.

/write

parameter hint can be provided, if it was not then it is supposed to be default,

only media storages with RW mode can be chosen,

other rules from the /read clause that regard to statuses 404, 502, 503 and 504, inaccessibility, round-robin cycling and media managers exhaustion are applied here too.

Now let's turn to nginx configuration file. Below is a configuration for a media router.

I put haskell code in a separate source file lmr.hs because I could not have it put in the configuration even if I wanted: length of an nginx variable's value cannot exceed 4096 bytes when it is read from a configuration file, but content of the lmr.hs does exceed. Type Conf is defined in lmr.hs as

Function queryEndpoints is a service that will run every updateInterval, i.e. 20 sec: this is the negotiation process between media routers (the partners) depicted in the image with bold red arrows, blacklistInterval is a global reference that defines for how long time inaccessible media routers must be blacklisted, backends are own media servers (i.e. LMM 1, LMM 2 and LMM 3 as for LMR 1 in terms of the image) with the URL to their /conf location, and partners are other media routers (i.e. LMR 2 and LMR 3 as for LMR 1) with the URL to their /conf location. Below is definition of queryEndpoints.

The function returns an encoded JSON object (written into variable $hs_all_backends on the nginx side) wrapped inside IO Monad: this means that queryEndpoints performs several actions with side effects sequentially. The first action is reading data Conf passed from the nginx configuration. Then it updates a global reference blInterval if the service runs for the first time, or waits 20 sec otherwise. The next action is querying own media managers (backends) and then querying other media routers (partners), and collecting results in allbd with respect to timestamps received from the backends which are compared with data collected on the previous iteration (oldbd which is read from the global reference allBackends). Then allbd gets written in the allBackends. So far queryEndpoints updates data with type CollectedData that closely represents the original layout of partners and backends.

Next actions in the function queryEndpoints are doing transformation from allbd to newRoutes of type (Routes, Routes) adapted for being used in the search algorithm both for labels (/read) and hints (/write). The first element of the tuple contains elements laid out for /read requests, the second element is for /write requests.

Thus, the type Routes is a Map of Map of Map. The innermost Map's key is Possession that is just a distance to backends (Own or Remote addr), the value is [Destination]: a list of addresses of all backends that correspond to specific hint and label which is respectively keys of the outer and the middle Maps. As soon as Possession keys are always sorted with Own as the first element (as it should deriveOrd expectedly), the own media managers will always be the first in the search hierarchy. The hints are unrelated for /read requests whereas labels are unrelated for /write requests, that's why a special taggedAnyHint is used for the read part of the returned Routes tuple, and value "any" is used as the label for the write part of the tuple. Below is the definition of TaggedHint.

So, the hint in the search hierarchy of Routes in the read part will also look as "any".After making newRoutes in queryEndpoints we firstly check if it's equal to the stripped value held in the second element of a tuple from the global reference routes defined as

(type RoundRobin is imported from Data.RoundRobin of package roundRobin). The routes' tuple holds two elements with search hierarchies enriched with round-robin elements: the old and the current. The old value is held to avoid loss of the search path in a request currently being processed when the search hierarchy gets altered after an asynchronous update in queryEndpoints: each request gets tagged with the SeqNumber from the current element of the tuple and is supposed to survive two sequential updates of the search hierarchy (if by some reason a request fails to trace down all the search path until a successful response and the element with its SeqNumber in the search hierarchy has been already rewritten then it gets finalized with writeFinalMsg in getMsg, see below).Let's get back to the action in queryEndpoints. If newRoutes and the stripped (i.e. without round-robin elements) second tuple of the routes differ then the second element of routes (i.e. current routes) gets moved to the first place (it becomes old) and the enriched copy of newRoutes gets put on the second place. Stripping enriched routes is performed in function fromRRRoutes.

So, if the innermost Map's value of the Routes, i.e. the list of backends contains N elements and N is more than one then it gets bound to a RoundRobin element with a reference to N indexes starting from 1.The last things that I want to say about queryEndpoints is that the HTTP engine httpLbs used in it was taken from package http-client. It means that all requests are asynchronous and with adjustable timeouts. All exceptions that function query may produce are properly caught with functions catchBadResponseOwn and catchBadResponseRemote: they never return broken data replacing it with the existing good data instead.Now let's get back to the mysterious and frightening nginx directives haskell_var_nocacheable and haskell_var_compensate_uri_changes from file nginx-lmr-http-rules.conf and show content of file nginx-lmr-server-rules.conf.

Both the /read and the /write requests get rewritten to a regexp location that starts with /Read or /Write, the rewritten request gets supplied with the hint and the label with four zeros and the mysterious tail /START/Ok. What going on here is starting of the loop where suitable media managers will be found or not found and the request's operation (read or write) will be successfully executed or failed! The four zeros are starting values for the loop: sequential number, key — index of the current search position in the innermost Map of type Routes, i.e. Possession or distance, start — index that will be assigned by the round-robin element from the RRRoutes on the first try if number of suitable media managers is more than one, and index — the current offset from the start value that will also be assigned from within the haskell module. The START value is rather arbitrary: this is an initial value of the backend to proxy_pass to and will be assigned from the haskell part, Ok corresponds to the internal state of the loop.The loop itself is encoded in the regexp location. Ability to iterate over the loop is the result of internal redirections provided by error_page directives which redirect to the same location when backends respond with bad statuses. But internal redirections in nginx have shortcomings: values of variables can be either cached forever or the variable's handler will be accessed on every access to the variable. We need somewhat in the middle: caching variables within a single internal redirection and updating them on every new internal redirection. This is exactly what directive haskell_var_nocacheable does! Another shortcoming of the nginx internal redirections is their finite number (10 or so): there is an internal counter uri_changes bound to the request's context, it gets decremented on every new internal redirection and when it reaches value 0 nginx finalizes the request. We cannot say how many internal redirections we need: apparently 10 can be not enough, and static limit is not good for us at all. Directive haskell_var_compensate_uri_changes makes the variable's handler increment (if possible) the request's uri_change value. As soon as variable $hs_msg is also nocacheable, its handler will be accessed once per single internal redirection thus compensating uri_changes decrements. Thus, the two directives haskell_var_nocacheable and haskell_var_compensate_uri_changes can make error_page internal redirections Turing-complete!New parameters of the loop including the most important one — the backend where to proxy_pass to, are received via a synchronous call to a haskell handler getMsg. Function getMsg is wrapped in IO Monad, however it does not make unsafe or unpredictable side effects like reading and writing from network sockets or files: it only accesses round-robin elements in the routes data and may also alter the global reference blacklist. This means that we can safely regard getMsg as safe in the sense of synchronicity. Here is how getMsg defined.

readMsg=readDef(MsgRead""""0000""NotReadable).C8.unpackwriteMsg=return.C8L.pack.showwriteFinalMsgm=writeMsgm{backend="STOP",status=NonExistent}getMsg(readMsg->m@Msg{status=NotReadable})=writeFinalMsgmgetMsg(readMsg->m@(Msgophntlabelseqnkeystartidxbst))=dowhen(st==NotAccessible)$dobl<-readIORefblacklistwhen(b`M.notMember`bl)$getCurrentTime>>=modifyIORef' blacklist.M.insertb(getRoutesseqn>=>return.rSelectop>=>return.second(M.lookuphnt>=>M.lookuplabel)->r)<-readIORefroutescaser>>=\x@(_,d)->(x,)<$>(d>>=elemAtkey)ofNothing->writeFinalMsgmJust((n,fromJust->d),(_,gr))->do(s,i,b)<-ifst==NotFoundthenreturn(0,0,Nothing)elsegetNextInGroupstart(advanceIdxstartstidx)grcasebofNothing->do((s,i,b),k)<-getNext(key+1)dcasebofNothing->dounblacklistAlldwriteFinalMsgm{seqn=n}Justv->writeMsg$MsgophntlabelnksivOkJustv->writeMsg$MsgophntlabelnkeysivOkwhererSelectRead=secondfstrSelectWrite=secondsndgetRoutesv(a@(x,_),b@(y,_))|v==0=Justb|v==y=Justb|v==x=Justa|otherwise=NothingelemAtim|i<M.sizem=Just$M.elemAtim|otherwise=NothinggetNextInGroup_1(_,Nothing)=return(0,0,Nothing)getNextInGroup__(dst,Nothing)=(0,0,)<$>ckBl(headdst)-- using head is safe here-- because dst cannot be []getNextInGroupsi(dst,Justrr)=dons<-ifs==0thenselectrrelsereturns((i+).length->ni,headDefNothing->d)<-spanisNothing<$>mapMckBl(take(lengthdst-i)$drop(ns-1+i)$cycledst)return(ns,ni,d)advanceIdx0Ok=const0advanceIdx0_=const1advanceIdx__=succckBld=dobl<-readIORefblacklistcaseM.lookupdblofNothing->return$JustdJustt->donow<-getCurrentTime(fromIntegral->bli)<-readIORefblIntervalifdiffUTCTimenowt>blithendomodifyIORef' blacklist$M.deletedreturn$JustdelsereturnNothinggetNextkd=do(length->nk,headDef(0,0,Nothing)->d)<-span(\(_,_,b)->isNothingb)<$>mapM(getNextInGroup00)(M.elems$M.dropkd)return(d,k+nk)unblacklistAll=mapM_$mapM_(modifyIORef' blacklist.M.delete).fstngxExportIOYY 'getMsg

The function tries to find a backend in the routes data according to received data: hint, label, sequential number etc. In case of any errors it writeFinalMsg and the nginx loop finishes. The data is sent from the nginx part as a message of type Msg and returned back from getMsg in the same type.

I won't explain how getMsg works in details: it must be simple. Instead, I'll move to the module's compilation requirements and then make some tests in an environment that emulates what was depicted in the image from the beginning of the article.The requirements of lmr.hs are: ghc-8.0.1 or higher, containers-0.5.8.1 or higher, modules bytestring, async, aeson, http-client, roundRobin, safe and ngx-export. They all can be installed with cabal, but make sure that all depended modules were (re)installed with appropriate containers module's version. The lmr.hs can be compiled with command

if you want to collect haskell events to analyze performance and GC further (in this case you must uncomment the line with directive haskell rts_options -l in file nginx-lmr-http-rules.conf and make sure that the user of an nginx worker process, i.e. nginx, may write into the current directory).My nginx build configuration is

Media managers located by port numbers 8011, 8012 and 8013 are LMM 1, LMM 2 and LMM 3 in terms of the image, 8021 and 8022 are LMM 4 and LMM 5, 8031 and 8032 are LMM 6 and LMM 7. Read and write requests to them return HTTP status 200 and a short message about their location, except for 8011 and 8012 which return status 404. Let the first nginx configuration shown in the article (with data Conf) be the configuration for LMR 1 — nginx-lmr-8010.conf. We also need configurations for LMR 2 and LMR 3 — nginx-lmr-8020.conf and nginx-lmr-8030.conf that are clones of nginx-lmr-8010.conf with correctly adjusted values of partners and backends.As a superuser move lmr.so to directory /var/lib/nginx and start nginx.

(Command jq is an excellent command-line JSON parser that can pretty-print and colorize JSON objects.) Here we see that the both parts of the routes are empty and all the 3 media managers have zero configuration. Anyway, let's try to read from yet inexistent dir_1.

Good, No routes found as expected, there is also a debug message with the worker's PID and data Msg returned by function getMsg. Let's start the backends and wait at most 20 sec until the media router find them.

(I no longer print HTTP headers to make the output more compact.) Label dir_10 behaves as expected, but label dir_3 gets found and then not found in cycle. Is this what we expected? Yes! Because our backend is actually misconfigured. Remember that when a backend return status 404 we expect that any other backend from this area (i.e. with specific Possession key) will also return 404 and so we skip to another area. In this situation we have a single area and a directory dir_3 with 2 backends: it means that they are chosen in round-robin manner (cycling of start value in the debug message witnesses that). When backend 8011 gets chosen it returns 404 and getMsg skips to the next (inexistent) area without checking other backends in this area, finally emitting the final message with status NonExistent. When the other backend 8013 gets chosen it returns good message In 8013. In the next request the round-robin mechanism choses backend 8011 and so on. Let 8011 return 503: this must fix the issue and the 8013 must be chosen every time with blacklisting of 8011.Replace the line with return 404 in the read/write location of server 8011 in file nginx-lmr-backends.conf with return 503 and restart backends only.

pkill -HUP -f '/usr/local/nginx/sbin/nginx.*nginx-lmr-backends.conf'

The behavior of the dir_3 must alter immediately because the change of the returned status did not change the structure of the routes and hence there is no need to wait.

Let's make 8011 return a good message with status 200. For that, uncomment the line with In 8011 in file nginx-lmr-backends.conf and comment the next line with return 503 (that was return 404 in the beginning). Restart backends only (with the specially crafted signal -HUP as was shown above), wait 1 min (or nothing if the request come to the unaffected nginx worker) and test dir_3 again.

Now they'll start to negotiate with each other every 20 sec per router. You will see that when tailing -f file /var/log/nginx/lmr-access.log or in a sniffer output. Let's look how they see dir_1 for reading requests.

By the way you may notice that requests to the routers 8010 and 8020 receive sequential number2 from getMsg whereas requests to 8030 receive 1. This is not a surprise: we started 8020 and 8030 after 8010 and it replaced its configuration only once. The router 8020 was very quick to initialize its configuration before it was changed again by starting of the 8030. The 8030 was starting after the others and its configuration did not change since then.We still did not test several important scenarios here, for example addition of new labels and replacing timestamps in /conf responses from media managers, but I won't do that: the article is already too big and the shown tests do enough to give a taste of the proposed routing approach.The source code and nginx configuration files for the tests can be found here.