sutch has asked for the
wisdom of the Perl Monks concerning the following question:

I'm about to design a web application in Perl, to run on an Apache server (Linux), which will require a large, possibly hundred megabyte, in-memory data structure. I would like to initialize the data structure once from a database or file, then allow updates to affect the structure, as well as the persistent datastore, so that all future HTTP processes can access and update the most recent data.

I've tried using mod_perl for this type of persistent, in memory data, but any data structures that are initially shared get copied once per process and become unshared as soon as the data is changed by a process.

Do any other Monks have experience with using shared Perl data structures among multiple Perl (Apache) processes? What Perl, Apache, and/or Linux options are available handling this?

To directly answer your question, I should explain that
you are encountering a feature of Unix called copy on
write memory.

Apache 1.x on Unix uses a pre-forking model, whereby a
parent httpd process forks off a number of child httpds
each of which handle one request at a time. When a Unix
process forks, another identical process is created. To
save memory, this identical process shares all of its
data with the parent process. However, as soon as either
process changes a certain area of memory, separate copies
of that memory are created for each process. Hence the
name copy on write. This is what you are observing as data
become unshared.

Aside from memory requirements, the technique you are using
suffers from another problem. Imagine you have 3 httpd
processes (called A, B and C) serving requests, each with
a name and phone number data structure. Initially, each
process holds the following information:

NAME PHONE
---- -----
Fingermouse 528
Bergen 392
Lobster 771

If process A receives a request to change Bergen's number to
398, the data structures in processes B and C will not be
affected. So, if a request to retrieve Bergen's number
reaches process C, it will report that Bergen's number is
still 392. Thus, it is important to share data sets between
processes if the data may be written to.

Thanks tomhukins, this looks to be what is needed. And your explanation of copy on write memory sheds light on why I was experiencing the unsharing of memory.

I do have another related question: are there any methods for sharing a process (or a Perl object) among processes? For example, I want one shared object to update the data structure, to ensure integrity and to write the changes back to the persistent storage. Or is there a better method for handling this than one shared object attempting to service many requests?

There are many ways to share data between processes. You
can use a local dbm file. You can use IPC::Shareable.
And so on. But all of the efficient ones have the rather
significant problem that all of the requests in your
series have to come back to the same physical machine.
This does not play well with load balancing.

However one crazy way of doing it is like this. One
machine gets the request and forks off a local server.
Other CGI requests are passed the necessary information to
be able to access this temporary server, which is run in a
persistent process, and then when this server decides the
time is right, it de-instantiates itself. This would be a
lot of work though.

Personally I would just see if you can keep the temporary
state in the database, and just have each individual
request deal with the bit of the state that they need to
handle. But I cannot, of course, offer any guesses on
whether this would work without knowing more details than
you have given us.

Having trying myself to implement a semaphore-based locking mechanism for
IPC SysV shared memory, I'd recommend IPC::ShareLite for general purposes.
It comes with a powerful locking mechanism,
which is incredibly similar to flock() !Apache::SharedMem also depends on this module.

The handler prepares the data in hash %my_data and calls pnotes() method to store the data internally for other handlers to re-use. All the subsequently called handlers can retrieve the stored data in this way:

Other answers have described the copy-on-write of memory in Unix and the fact that you need to use shared memory (SYSV SHM stuff or mmap'd files) and a good thing would be to use a wrapper around these things like Apache::Sharedmem.

Shared memory is tricky stuff in a similar way that threads are tricky things, since you open yourselves up to race conditions where two processes are altering the shared memory and violate each other's assumptions.

As a simple example, a process might increment a variable held in shared memory by 1 and assume that it has that value later on in the same routine, whereas another process might have incremented it in the meantime. Hard-to-find bugs which are cured by adding locks (semaphores, mutexes, whatever) to define critical sections of code which only one process at a time may execute. Ugh.

There might be a simpler solution for you though. You mention that you want changes to your data to go directly to persistent store (i.e. on disk) but you also want your data to live in memory.

I'll assume that you want the data in memory for performance reasons - i.e. you don't want to suffer a disk access per-request. But...operating systems are smart and if you have sufficient RAM on your box (say, for example, enough to build the data structure you were talking of) and you are repeatedly accessing this data then the OS should keep it all nicely in cache for you. So whilst you might be accessing hash values in a GDBM tie'd hash, the OS is doesn't bother to touch disk. When you change data, the OS has the job of getting it to disk. If your data store is a relational database similar arguments apply.

The nice thing about this is that you get it for free. You still need to be careful in that different processes may change the underlying data store at a time which might be inconvenient for the other processes - this is where atomic transactions on databases come into play...

There might be other reasons why you want the memory structure, but I thought it was worth a thought.

You are correct, the shared memory structure is for performance reasons. It is for an application that I expect to be accessed often. The queries against the database are complex and will probably overload the database server so much that the required performance will not be met with a database alone.

Your GDBM idea sounds good enough, as long as the OS can be made to share the cache among all of the processes. Will the GDBM tied hash be automatically shared (through the OS), or does that need to be shared using shared memory? Or does this method require that each process have a separate tied hash?

The OS-level cacheing I mentioned was simply good old file-level cacheing. If your data store is held in files accessed through the file system (as is the case for simple databases like GDBM, flat file, etc) then often-used data is kept around in RAM - shared between processes.

You still need to spend some CPU cycles in doing lookups, etc but you don't spend any I/O - which is a win.

OK - so your back end data store is in a database which you wish to protect from the load which your hits are likely to generate. Do you know for certain this is going to be a problem? If not can you simulate some load to see?

Presumably you don't want to cache writes - you want them to go straight to the DB.

So you want a cacheing layer in front of your data which is shared amongst processes and invalidated correctly as data is written.

I don't know which DB you are using but I would imagine most/many would have such a cacheing layer. If this isn't possible or it doesn't stand up to your load then the route I would probably choose is to funnel all the DB access through one daemon process which can implement your cacheing strategy and hold one copy of the cached info.

But I wouldn't do that until it was clear that I couldn't scale my DB to meet my load or do something else...say regularly build read-only summaries of the complex queries.

I guess it all kind of depends on the specifics of your data structures...sorry to be vague. There is a good article describing a scaling-up operation at
webtechniques which seems informative to me.

One technique that comes to mind is to stamp each record
in the disk database as to when it was most recently changed, with the timestamp an indexed field.

When the server responds to a page request, it would first query the database for recently changed records (since the last page request by this process), weave them into the in-memory data structure before doing the requested query in-memory.

This approach would task the database server to maintain
a consistent state among processes, with each process synchronizing its in-memory state with that consistent state on start of each page request response.