Created an attachment (id=107)[details]
support for new cpuset filenames
Hello,
this is a continuation of the following problem:
http://www.clusterresources.com/pipermail/torqueusers/2012-March/014336.html
I have the very same problem on Gentoo with 3.2.14 vanilla kernel and
torque-3.0.5, but a solution above doesn't help.
Any job fails to run because pbs_mom is unable to create a cpuset for
a job, pbs_mom.log:
05/01/2012 04:09:11;0001;
pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint,
FALSE 05/01/2012 04:09:11;0001; pbs_mom;Job;TMomFinalizeJob3;job
not started, Retry job exec failure, retry will be attempted (see
syslog for more information) 05/01/2012 04:09:11;0001;
pbs_mom;Job;5.master;ALERT: job failed phase 3 start 05/01/2012
04:09:11;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
for job 5.master 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;top of preobit_reply 05/01/2012
04:09:11;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
05/01/2012 04:09:11;0080; pbs_mom;Job;5.master;obit sent to server
05/01/2012 04:09:12;0080; pbs_mom;Job;5.master;removed job script
And in syslog:
May 01 04:09:11 [pbs_mom] LOG_ERROR::TMomFinalizeChild, Could not
create cpuset for job 5.master
/sys/fs/cgroup/cpuset and /dev/cpuset are both mounted as cpuset
filesystem type:
$ mount | egrep "cpuset|cgroup"
cgroup_root on /sys/fs/cgroup type tmpfs
(rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755) openrc
on /sys/fs/cgroup/openrc type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc)
none on /dev/cpuset type cpuset (rw)
- on /sys/fs/cgroup/cpuset type cpuset (rw)
And their content is the same with "cpuset." prefix.
It looks like this change was made in 3.0 kernel, at least in works on
2.6.38 and fails on 3.2.14 kernel. Kernel's Documentation/cgroups/cpuset.txt
since kernel 3.0.y says that "cpuset." prefix must be used.
I wrote a patch to account path changes depending on the linux kernel
version. I verified that with this patch tasks are running and CPU
restrictions are enforced by the sceduler.

Chris,
If you run mount, you'll see your cpuset vfs is mounted with the noprefix
option. The "modern way" is to mount -t cgroup -o cpuset in which case
you'll end up with the "cpuset." prefix on cpuset attributes.
David

Hi David,
But if you want Torque to work unmodified you shouldn't do that. :-)
Breaking userspace is a bad thing so the noprefix behaviour is unlikely to go
away - here's a rant from Linus back in March on his attitude to breaking user
apps..
https://lkml.org/lkml/2012/3/8/495

I do not use noprefix option, thus ls shows "cpuset." prefixes.
There is no such thing as a stable kernel API and there are good reasons for
this. New applications will eventually use modern way of handling things, so
torque should adapt as well otherwise conflicts will occur sooner or later.
Anyway if you plan to stick to old file names at least for a while, please put
somewhere in the documentation, that people should use -o noprefix.

(In reply to comment #4)
> I do not use noprefix option, thus ls shows "cpuset." prefixes.
Neither do I, and it ls does not show "cpuset." prefixes. The reason is that
you already have a cgroup filesystem mounted and I do not.
This change in behaviour is since the Linux kernel commit
f9ab5b5b0f5be506640321d710b0acd3dca6154a "cgroups: forbid noprefix if mounting
more than just cpuset subsystem".
I'll try and find some time to report this as a kernel regression to see what
their attitude to this is - to me it seems like the sort of ABI behaviour
change and consequent user space breakage that Linus hates.
> There is no such thing as a stable kernel API and there are good reasons for
> this.
You are mistaking the *internal* kernel APIs (which are indeed unstable for
very good reason) with the external kernel ABIs exposed to user space and which
have different rules applied.
There has been an attempt to document the level of stability of interfaces in
Documentation/ABI directory (see the README for Greg-KH's reasoning), but as
far as I can tell the cpuset/cgroup stuff has not been added yet.
> New applications will eventually use modern way of handling things, so
> torque should adapt as well otherwise conflicts will occur sooner or later.
Agreed, but Torque will need to know to cope with both cases dynamically.
> Anyway if you plan to stick to old file names at least for a while, please put
> somewhere in the documentation, that people should use -o noprefix.
Sounds like a good idea, I've just tested that on a RHEL5 system and it didn't
complain about not knowing what that meant.

(In reply to comment #6)
+1 for being annoyed that they'd break user applications. I don't know why
things like this are done.
> I'm not sure if I'm doing something wrong, or if my kernel just doesn't
> understand 'noprefix'. Either way, I think TORQUE should support both
> syntaxes.
>
> The proposed patch looks for a specific kernel version, but clearly RedHat has
> backported cgroups making that check incorrect.
We may well need to make this patch lightly more sophisticated to work in all
cases but it is a good patch. I wonder if hwloc already handles this or not?
Does anyone know if this is broken for the 4 series? I assume it is but since
we use hwloc they might solve it for us - anyone can wish, right?
From Adaptive's perspective we will want to fix this just to avoid the support
calls we'd have to field for not fixing it.