Railski

Friday, June 10, 2011

Speeding Up Rails Load Time

The rake and mongrel start-up times increased quite a bit in my production after upgrading a large app from Rails 2.3.4 to Rails 3.0.7. I've read a lot about Rails 3 in general starting up slower, but what I was seeing seemed more than that.

In production, my gems and code are stored on a shared disk mounted via NFS such that the exact same code can be shared between multiple machines. I suspected the start-up problem is related to NFS.

I did a quick test with rake db:migrate to see how many file stats it is doing. In Rails 2, the rake did 280k. In Rails 3, the rake did 480k. Wow! No wonder NFS becomes overwhelmed especially when multiple of these are running at the same time. So I started to look at cause behind this huge increase.

Almost all of these file stats are caused by ruby scanning its load path trying to locate files for require calls. The load path in my application has 100 directories in it, which mean require 'foo.rb' will on average look at 50 directories.

In general, you don't see require 'foo.rb' but rather require 'foo' (without the extension). This makes the problem worse as ruby not only needs to traverse directories in the load path, but it also needs to guess the file extension in each directory. This is system specific, but at least on linux, it tries .rb and .so. So in fact, require 'foo' will on average look at 100 files.

I don't know about you, but 99.9% of my files are .rb so looking for .so in the vast majority of cases is completely useless. So I overrode require to force ruby to first scan the load path looking for .rb files and if nothing is found revert to the original way. module RequirePatch
def require(file)
unless file.to_s =~ /\.[a-z]+$/
begin
super("#{file}.rb")
rescue LoadError
super
end
else
super
end
end
end
This drops the number of file stats to 160k! Note, this is great but has nothing to do with Rails 2 vs Rails 3, which is really what I wanted to answer.

I realized that in Rails 2 version of the application, rails was in vendor/rails and in Rails 3 version of the application, rails is in the Gemfile. This difference causes the paths belonging to rails (active support, active record, etc) to end up in different locations in the load path. Here is why:

The load path is initially setup by Bundler based on the Gemfile. It does so by building a dependency graph of all the gems and placing the dependents of a gem before that gem in the load path (presumably this is to allow dependents to override functionality by naming files in the same way as the gem they depend on ... although this seems like a terrible idea). This means gems that have lots of dependents tend to be towards the end of the load path and gems without any dependents tend to be towards the beginning of the load path. Since the rails gems are popular and have lots of dependents, Bundler puts them towards the end of the load path.

After Bundler is done with the load path, Rails starts modify the load path (usually by prepending) during its initialization process. One of the directories that is prepended, if it exists, is vendor/rails.

So using vendor/rails causes rails to be towards the front of the load path and using Gemfile causes rails to be towards the end of the load path. Since so many files come from rails, this causes the large difference in file stats.

Knowing this, I wrote some additional code to move the rails gem to the front of the load path. Here is how it looks,rails_gems = [ 'activeresource', 'actionpack', 'activerecord', 'activemodel', 'activesupport' ]
rails_paths = rails_gems.map { |rails_gem| $LOAD_PATH.detect { |path| path.include? rails_gem } }
rails_paths.each { |path| $LOAD_PATH.unshift(path) } # put in front
Putting the two solutions together, the number of file stats dropped to 120k. No more problems in production related to NFS and few seconds got shaved off the start time.

Thursday, May 19, 2011

Proving Rails memory leak with gdb

I am no gdb expert by any means, but I thought I'd give it a try to see if I can identify the cause behind the controller instances. The basic approach is to catch the GC in the act of marking a controller instance and follow the chain to try to identify why it is being marked (as marked stuff is not swept).

Couple setup steps. First, I re-compiled ruby with debugging symbols which turned out to be trivial thanks to rvm. Second, I removed all memprof stuff from my simple test app.

With that out of the way, I can start the app with rails server -e production and get it in the right state with,curl http://localhost:3000/authors/new
curl http://localhost:3000/authors/new
curl http://localhost:3000/authors/new
curl http://localhost:3000/final_gc
Finally I can attach to this running process with gdb.

I know there is an instance of the AuthorsController somewhere on the heap. So, I can look up the class that is the AuthorsController with,(gdb) call rb_const_get(rb_cObject, rb_intern("AuthorsController"))
$14 = 4315109640
and set a conditional breakpoint in rc_gc_mark function in gc.c to look for objects of the class i'm interested in,(gdb) break gc.c:1361
(gdb) cond 1 (rb_type(ptr) == 2 && rb_class_real(rb_class_of(obj)) == 4315109640)
(gdb) c
Now I hit the server with curl http://localhost:3000/final_gc and wait ... the condition on the breakpoint makes the gc process much slower, but eventually it stopped,Breakpoint 1, rb_gc_mark (ptr=4377820520) at gc.c:1361
warning: Source file is more recent than executable.
/Users/pkmiec/.rvm/src/ree-1.8.7-2010.02/source/gc.c:1361:30666:beg:0x100043cb4
(gdb) bt
#0 rb_gc_mark (ptr=4377820520) at gc.c:1361
#1 0x000000010003064b in blk_mark (data=0x102a50b00) at eval.c:8596
#2 0x00000001000443e4 in gc_mark_children (ptr=4377672960) at gc.c:1570
#3 0x0000000100043ce7 in rb_gc_mark (ptr=4377673120) at gc.c:1366
#4 0x0000000100043b6f in mark_entry (key=66906, value=4377673120) at gc.c:1293
#5 0x00000001000a959c in st_foreach (table=0x10250df70, func=0x100043b56 , arg=0) at st.c:487
#6 0x0000000100043b9e in rb_mark_tbl (tbl=0x10250df70) at gc.c:1302
#7 0x00000001000442ce in gc_mark_children (ptr=4315109640) at gc.c:1538
#8 0x0000000100043ce7 in rb_gc_mark (ptr=4315109640) at gc.c:1366
#9 0x000000010004434d in gc_mark_children (ptr=4315134280) at gc.c:1551
...
First, I verified the breakpoint condition,(gdb) call ((struct RString *)classname(rb_class_real(4315109640)))->ptr
$5 = 0x10250df50 "AuthorsController"
What caught my eye next is line #8 ... rb_gc_mark of 4315109640 ... but wait 4315109640 is the class. So luckily the "distance" between something known and the instance is very short! Then I tried to decipher what line #4 means,(gdb) call rb_id2name(66906)
$8 = 0x102731700 "@_config"
(gdb) call rb_type(4377673120)
$2 = 11 # as.hash
$15 = {
basic = {
flags = 16971,
klass = 4315911160
},
tbl = 0x10047aa40,
iter_lev = 0,
ifnone = 4377672960
}
So, the class has an instance variable called @_config and it's value is a hash. You can see this is in fact true by looking in active_support (3.0.7) at configurable.rb line 12. Line #3 says this hash is being marked and line #2 says 4377672960 is marked. Notice 4377672960 is the ifnone of the hash ... so in ruby this is the block passed to Hash.new. This is exactly how ActiveSupport::InheritableOptions.new works (which is what is assigned to @_config).(gdb) call rb_type(4377672960)
$7 = 34
(gdb) p ((RVALUE *)4377672960)->as.data
$11 = {
basic = {
flags = 98,
klass = 4331614640
},
dmark = 0x1000305f1 ,
dfree = 0x1000306ea ,
data = 0x102a94c90
}
I didn't know what data is :(, so I looked at blk_mark function at eval.c:8596,static void
blk_mark(data)
struct BLOCK *data;
{
while (data) {
rb_gc_mark_frame(&data->frame);
rb_gc_mark((VALUE)data->scope);
rb_gc_mark((VALUE)data->var);
rb_gc_mark((VALUE)data->body);
rb_gc_mark((VALUE)data->self); dyna_vars);
rb_gc_mark((VALUE)data->cref);
rb_gc_mark(data->wrapper);
rb_gc_mark(data->block_obj);
data = data->prev;
}
}
Bingo! A request comes to /authors/new and is routed to an instance of AuthorsController. That instance renders a view and that view asks the controller instance for it's config. The controller instance asks the controller class for its config. The AuthorsController hasn't initialized its config yet so it creates an instance of ActiveSupport::InheritableOptions (which extends OrderedOptions which extends OrderedHash). ActiveSupport::InheritableOptions.new is implemented as super() { |h,k| parent[k] }. That block creates a closure which captures self which is the controller instance. Here is where I smell a ruby bug as I don't understand why self at this point is the controller instance.

This makes sense. The controller instances left in memory are the ones that perform lazy initialization and that lazy initialization creates scopes which are assigned to global or class level variables. This is why eager loading at start-up works.

Tuesday, May 17, 2011

Rails3 leaks (more) memory?

I'm involved in upgrading a rather large Rails app from Rails 2.3.4 to Rails 3.0.7. I say rather large because this app has about 150 active record models and another 200 classes that use the active model functionality. It also uses 90ish gems and plugins. The app is deployed with mongrels and REE 1.8.7 (2010.02 .. 32bit) with GC tuned.

With Rails2, a mongrel initially uses about 210MB of memory and eventually grows to high 200's as requests force additional parts of the app to be loaded. Monit is set to restart anything that goes beyond 300MB, which happens periodically but at a manageable rate (i.e. once per hour).

With Rails3, a mongrel initially uses about 190MB of memory and is restarted by Monit after few requests. Turning off Monit, the mongrel process quickly grows to 400s with initial few requests to few different controllers and then to high 500s without even hitting the more obscure parts of the app.

Something is wrong!

Additional experiments revealed that it's not a leak in the traditional sense. Meaning hitting the same url over and over does not grow the memory. Only the first and sometimes also the second requests cause memory to jump. This points to the memory growth being related to Rails' lazy loading. Rails3 (much more so than Rails2) defers lots of initialization until "later" to help achieve faster start up time. That "later" is usually when a request is being handled at which point Rails initializes whatever is needed to complete the request. For example, even though the User model is required during start time, the columns are not obtained from the db and the attribute methods are not created until they are needed.

I configured the app such that I can run in in production mode on my MacBook. I added memprof to the Gemfile. I created the following dead simple selenium script,login # /sessions/new, fill out form, redirected to /dashboards
open_and_wait("/owners/new")
open_and_wait("/owners/new")
open_and_wait("/owners/new")
open_and_wait("/owners/new")
open_and_wait("/final_gc")
This scripts hits the SessionsController, DashboardsController, and OwnersController. The final_gc is a middleware right after Rails::Lock which just GCs and returns 200. Afterwards, the heap is dumped via kill -URG <pid> as provided by memprof gem.

I was surprised to find instances of the controllers on the heap (and therefore everything that controllers reference: env, templates, etc). Let's focus on the OwnersController. That controller was hit 4 times and two instances were hanging around. In fact, it was an instance that handled the first request and another instance that handled the second request. The third and fourth requests did not leave controller instances. This behavior is very consistent.

At first, I was convinced the complexity of the app was causing the problem (i.e. some monkey patch or custom functionality). But the more I poked around the less it seemed related to the app itself.

But the question of WHY still remains. I believe it has to do with some of the craziness of lazy-initialization (i.e. eigen / singleton classes created by attribute_methods, scope capturing in resolver.rb with Hash.new { }, etc). But the heap dump does not show it ... which leads to be believe the problem is somehow fundamental to ruby vm.

I've tried both mysql and sqlite3. I've tried mongrel and webrick. I've tried ruby-1.8.7-p334, ree-1.8.7-2010.02, and ree-1.8.7-2011.03 (memprof does not work with 1.9 yet). All show the same behavior.

Monday, April 21, 2008

Background Jobs

As probably most web applications out there, we need to run jobs in the background. Some jobs just run periodically (every hour, or first of each month) while others are spawned by requests that need to perform long computation (generating pdfs, or processing credit cards) allowing the client to see a response immediately and check back later for the result of the long computation.
In the Rails world, the long requests are especially painful as each rails process is single threaded (i.e. one request at a time). So, having the ability to offload long requests allows the rails process to handle other requests while the long computation is done in the background.
For a while we tried the completely rewritten backgroundrb 1.x. Kudos to Hemant for the nice work. However, I did not feel backgroundrb was the right solution for us for the following reasons,

We have multiple mongrels spread over multiple machines for redundancy. In order for one mongrel on one machine to start a job and for another mongrel on another machine to get the result of the job, all mongrels need to talk to a single backgroundrb server.

We use monit to watch mongrels. Since the backgroundrb server starts the workers, it is not possible for monit to watch those workers, nor is it clear how to restart a worker if it were to die.

The backgroundrb jobs are not persistent. So when the backgroundrb server goes down the whole state is lost.

Searching around, I found background-fu by Jacek Becela (Polish roots? :)). In a nutshell, background-fu uses a simple active record model to persist jobs: the app creates jobs, daemon executes those jobs, the app queries the status of jobs (essentially home grown queue).
Out of the box, background-fu solves the problems we had with backgroundrb. Namely, since all communication is done via the database, any mongrel on any machine has access to jobs. Monit can start and monitor the background-fu daemon. The jobs are persisted in the database. Great!
Having the right underlying model in place, we went ahead and extended background-fu with the following,

Added lock_version to jobs which allows me to run multiple daemons and ensure that a job is executed by a one daemon.

Added ability to execute a job at a particular time in the future.

Added support for periodic jobs with interval and cron triggers (borrowed from backgroundrb).