I'm surprised to see some of these machines demonstrating high memory usage and some not. In particular, these machines seem fine, with low memory usage and processes all having stayed up since they were last kicked:
sync1.web.scl2.svc.mozilla.com
sync1.web.phx1.svc.mozilla.com
sync2.web.phx1.svc.mozilla.com
sync3.web.phx1.svc.mozilla.com
sync4.web.phx1.svc.mozilla.com
Do these machines differ enough from the others to provide any clues? I vaguely recall :atoll mentioning memory problems on one RHEL platform but not another.

(In reply to Ryan Kelly [:rfkelly] from comment #1)
> sync1.web.scl2.svc.mozilla.com
Oh, pencil suggests that this machine is getting 0 qps, which would explain why it's not showing the same memory use pattern as the others :-)
The others I don't know about.

rfkelly : correct, sync1-4 in PHX1 are configured to only come into play if the cluster is dying, otherwise they get no traffic.
sync1 in SCL2 has been dead for some time which explains it's data (I believe)

rfkelly : here's a summary of what we're seeing. Looks like it's highly variable how long it takes to reach the high memory utilization (some as short as 16 minutes to get to 1GB)
(03:37:26 PM) atoll: so when sync1..4.phx1 were having swap trouble, many weeks ago but this year since couchbase in may, i found that long-running processes had the 1GB plus ram usage
(03:37:46 PM) atoll: i deferred poking at it further until we deployed the new Sync code that bobm pushed a couple weeks ago
(03:38:03 PM) atoll: since analyzing memory issues in ancient stale code is not a very good use of time vs. analyzing it on new code
(03:38:34 PM) atoll: since ckolos reports we're still seeing issues, i *suspect* it's still "growth then plateau around 1.1GB", since it sounds like the new code appears not to have changed that profile
(03:40:19 PM) ckolos: so sync3.web.scl2
(03:40:23 PM) ckolos: pid 29149
(03:40:50 PM) ckolos: Virt is 2223m, rss is 2.0g, shared is 3116, stack (data) is 2.0gb
(03:41:50 PM) ckolos: other than sync1/5 all scl2 sync web heads have at least 1 gunicorn process taking more than 2gb of memory
(03:42:06 PM) ckolos: oop, damn you syn8
(03:42:28 PM) ckolos: okay sync8 doesn't have one over 2gb, but does have 3 over 1.2gb
(03:44:19 PM) ckolos: so... go fish.
(03:45:31 PM) atoll: any correlation between process age?
(03:46:00 PM) ckolos: likely some, but not definitively
(03:46:40 PM) ckolos: there are procs with 1 day of CPU time taking 1.2gb, while others with 3+ days, taking "only" 2.3
(03:47:01 PM) ckolos: so if so, it's not direct linear growth
(03:47:32 PM) ckolos: comparing phx and scl2 is even more frustratin
(03:47:51 PM) ckolos: where a proc with 2+ days of cpu time is only using 1.075 gb
(03:48:05 PM) ckolos: and another with 16 mins is using 1.046
(03:47:43 PM) atoll: yeah, i don't know why they're so variant yet :(
(03:47:54 PM) atoll: comparing sync5..7.phx1 to sync1..8.scl2 may help
(03:48:04 PM) atoll: and just ignore 1..4.phx1 since they're not in use most times
(03:48:14 PM) ckolos: this is on sync5.phx
(03:48:31 PM) atoll: maybe the initial memory burden for a worker is stable at 1GB after startup and a request or two
(03:48:55 PM) ckolos: possibly, but then that means that sync1-4 aren't used at *all*
(03:48:58 PM) atoll: correct
(03:49:03 PM) ckolos: b/c they're all around 90mb per proc
(03:49:26 PM) atoll: sync1..4.phx1 are set as "last resort" servers in the zeus pool, since if they're in active use they cause couchbase to swap out
(03:49:45 PM) atoll: once we have a couchbase hardware solution for scl2, it must also go to phx1
(03:49:54 PM) ckolos: really though, there's not enough running to come up with anything other than slightly-better-than-guesses
(03:50:12 PM) atoll: do the sync load tests show the same worker memory usage?
(03:50:22 PM) ckolos: unknown.
(03:50:26 PM) ckolos: where would that data be?
(03:50:36 PM) atoll: sync*.web.scl2.stage graphs and collection, if any
(03:50:55 PM) atoll: rfkelly is online and may be of further use here, in case he's ever observed memory usage previously
(03:56:54 PM) ckolos: none of the stage syncweb servers have gunicorn processes running that hot.
(03:57:13 PM) ckolos: highest use in stage is 94mb
(03:57:25 PM) ckolos: so I'm guessing no loadtests have been done in a while.
(03:57:37 PM) ckolos: most procs are dated aug 29

I tried about an hour of light load against stage this afternoon, and monitored the memory usage of two gunicorn processes - one which was freshly restarted, and one that had been alive since 29 August. RSS snapshots at 15-minute intervals:
New Proc Old Proc
t=0 41232 75856
t=15m 53580 75924
t=30m 54244 75912
t=45m 54836 75920
t=60m 55604 75908
So the memory usage does seem to slowly climb to a peak value as requests come in, then stay relatively steady at that level.

Sorry, I see what you mean now. In the sync.conf file we have :
In PHX1 we have 610 lines like this :
[host:phx-sync609.services.mozilla.com]
In SCL2 we have 1320 lines like this :
[host:scl2-sync1320.services.mozilla.com]

As a first step, I'd like to make a new release and push it to stage with the following changes:
* memory-usage-dumping support from Bug 799874
* update all our dependencies to latest version
In particular I want to update SQLAlchemy, which is a whole minor version behind the current release (0.6.6 vs 0.7.9) which has some known memory-usage improvements.
We can then throw some load at it and take periodic memory-usage dumps from one of the gunicorn worker processes. I can then analyse these dumps offline to get an idea of where the memory is being spent.
Will we have the Ops bandwith for a push to stage sometime in the next few days?
If not then I can run my own tests, but I think memory-usage data from stage under full load will be significantly more useful than what I can simulate locally.

Gene, in the config file you grepped in Comment 8 there should be a [storage] section. Can you please post (or email me if sensitive) the contents of that section, minus any passwords etc? I want to check for anything that might explain why memory usage on stage seems to be much better controlled than in production.
Stage has 160 [host:blah] sections vs 1320 in production, but the difference in memory usage between the two doesn't seem to scale with that number. Perhaps they have slightly different configurations in e.g. number of connections per pool.

Bug 802486 identifies a cache-clearing issue that likely contributes to the high memory usage.
This issue would result in an empty dict being kept in memory for each unique userid ever encountered by the server. That's only of the order of ~300 bytes of memory used per user, but we do serve a lot of users...
Probably not the whole story, but it's a solid start.

We currently watching out for this issue on the sync1.5 storage nodes, but I'm hopeful it won't be a problem in the one-box-per-node setup we're currently using. So let's keep it open but not a blocker.

> 4 years ago
>
> We currently watching out for this issue on the sync1.5 storage nodes, but I'm hopeful it won't be a problem in
> the one-box-per-node setup we're currently using. So let's keep it open but not a blocker.
4 years later, I haven't heard any complaints about this, so I'm going to go ahead and close it out. :bobm please feel free to open a new bug if there are similar concerns on the sync1.5 server boxes.

Status: NEW → RESOLVED

Last Resolved: 6 months ago

Resolution: --- → WONTFIX

You need to log in
before you can comment on or make changes to this bug.