Comments

On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
>> From: Paul Moore <paul@paul-moore.com>>> If the point of the lost_reset test is to flood the system with audit> records, why are we restricting ourselves with a filter? Let's log> everything.>> Signed-off-by: Paul Moore <paul@paul-moore.com>> ---> tests/lost_reset/test | 4 ++--> 1 file changed, 2 insertions(+), 2 deletions(-)
This resolves a problem I've been seeing with a fairly stubborn
system. I would be very curious to hear if any others were
experiencing problems with the lost_reset that this patch resolves.

On 2018-12-13 18:23, Paul Moore wrote:
> On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:> > If the point of the lost_reset test is to flood the system with audit> > records, why are we restricting ourselves with a filter? Let's log> > everything.> >> > tests/lost_reset/test | 4 ++--> > This resolves a problem I've been seeing with a fairly stubborn> system. I would be very curious to hear if any others were> experiencing problems with the lost_reset that this patch resolves.
This saddistic test would have brought the system to its knees before
the queue rework. This still works fine on my system, so this is
progress. I suspect that something weird was going on with your test
such that it wasn't getting the right PID. I don't know how this is
possible, but I have seen this same thing recently on the FUSE umount2
test where I needed to bump up the PID by one to catch the right task.
> paul moore
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-12-13 18:23, Paul Moore wrote:> > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:> > > If the point of the lost_reset test is to flood the system with audit> > > records, why are we restricting ourselves with a filter? Let's log> > > everything.> > >> > > tests/lost_reset/test | 4 ++--> >> > This resolves a problem I've been seeing with a fairly stubborn> > system. I would be very curious to hear if any others were> > experiencing problems with the lost_reset that this patch resolves.>> This saddistic test would have brought the system to its knees before> the queue rework. This still works fine on my system, so this is progress.
Yes, I'll admit it is a bit extreme. The good news is that the system
survives multiple runs of this without falling over (and remains
responsive), so that's a win :)
> I suspect that something weird was going on with your test> such that it wasn't getting the right PID. I don't know how this is> possible, but I have seen this same thing recently on the FUSE umount2> test where I needed to bump up the PID by one to catch the right task.
The wrong PID idea struck me as a bit curious, and since I have one
system where I can reproduce this reliably I thought I would play
around a bit more and see what I could find. The results sent me down
a bit of a rathole but the results are interesting ...
On my test system the PID of the flood ping was being recorded
correctly, every time I ran it with some instrumentation added - and I
ran it a quite a few times. While you may have seen this happen in
practice, but that doesn't seem to be the case on this particular
system.
As an aside, have you spent any time trying to debug that wrong PID
problem? While not strictly audit related, that seems like a pretty
serious Bash bug. Or maybe it's a problem with the test? I vaguely
remember a discussion between you and Ondrej and some confusion around
which Bash variable to use to fetch PIDs, but I may be mistaken.
Moving on from the PID idea I decided to dig a bit deeper into when
the kernel registers a lost record, and I think I found the core
issue. The kernel logs lost records, and bumps the counter, via the
audit_log_lost() function. The audit_log_lost() function is called
from a number of places (most uninteresting for this discussion: error
handlers and the like), but I think the callers we are most interested
in are the following:
* audit_log_start()
* kauditd_hold_skb()
Let's start with audit_log_start(). It calls audit_log_lost() in two
places: one place is allocation error handling code (not relevant) and
the other is if we are bumping into the backlog limit (relevant, and
interesting). In the latter case audit_log_lost() is called and
audit_log_start() returns NULL, effectively dropping the record on the
floor, lost forever.
Moving on to kauditd_hold_skb(), things are a bit more complicated.
As a reminder, kauditd_hold_skb() is the "point of last resort" for
audit records; if we are here it means there is no auditd connected.
The first thing kauditd_hold_skb() does it write the record to the
console since this may be the only way the record ever sees userspace.
Beyond that if "audit=1" was not specified on the kernel command line
we drop the record, if "audit=1" was specified on the command line we
try to queue the record but if the queue is full we call
audit_log_lost(). Ultimately I believe it was this logic, and a race
between the kauditd_thread() and the record generation, which caused
my system to never hit the backlog limit in audit_log_start() which
would have caused it to call audit_log_lost(). On my problematic
system, booting with "audit=1" caused the existing lost_reset test to
pass without problem.
Make sense?
This brings up to the next step: how do we want to address this?
Prior to the queue rework that started in v4.10 things were a bit
simpler and it looks like we always registered a lost record
independent of the "audit=?" setting on the kernel command line and
the current queue backlog. While this would have made this test
easier, it could result in some over counting problems in the cases
where an auditd instance came along and read the "lost" records from
the queue. I don't think reverting to this behavior is ideal.
I'm also not certain that recording lost records in the *not*
"audit=1" case is a good solution either. In the case where the
system is not running an audit daemon they are almost guaranteed to
hit the backlog at some point and then their system will start spewing
scary looking audit log lost messages; we would surely win lots of
friends this way.
We could move the "audit=1" check (really it's an audit_default check)
into audit_log_lost() and use it to squelch the printk() and then call
audit_log_lost in kauditd_hold_skb() if we can't queue the record.
This should preserve an accurate lost record count (well, until it
rolls over, but that's not a new concern), prevent unnecessary scary
lost record messages, and ensure a consistent audit_log_lost()
operation (all the other callers I didn't mention in this mail).
Or the simplest option is to just ignore this and require that the
audit-testsuite be run on a system booted with "audit=1" :)
I'm currently leaning towards moving the "audit=1" check into
audit_log_lost(), what do you guys think?

On 2018-12-14 10:53, Paul Moore wrote:
> On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:> > On 2018-12-13 18:23, Paul Moore wrote:> > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:> > > > If the point of the lost_reset test is to flood the system with audit> > > > records, why are we restricting ourselves with a filter? Let's log> > > > everything.> > > >> > > > tests/lost_reset/test | 4 ++--> > >> > > This resolves a problem I've been seeing with a fairly stubborn> > > system. I would be very curious to hear if any others were> > > experiencing problems with the lost_reset that this patch resolves.> >> > This saddistic test would have brought the system to its knees before> > the queue rework. This still works fine on my system, so this is progress.> > Yes, I'll admit it is a bit extreme. The good news is that the system> survives multiple runs of this without falling over (and remains> responsive), so that's a win :)> > > I suspect that something weird was going on with your test> > such that it wasn't getting the right PID. I don't know how this is> > possible, but I have seen this same thing recently on the FUSE umount2> > test where I needed to bump up the PID by one to catch the right task.> > The wrong PID idea struck me as a bit curious, and since I have one> system where I can reproduce this reliably I thought I would play> around a bit more and see what I could find. The results sent me down> a bit of a rathole but the results are interesting ...> > On my test system the PID of the flood ping was being recorded> correctly, every time I ran it with some instrumentation added - and I> ran it a quite a few times. While you may have seen this happen in> practice, but that doesn't seem to be the case on this particular> system.> > As an aside, have you spent any time trying to debug that wrong PID> problem? While not strictly audit related, that seems like a pretty> serious Bash bug. Or maybe it's a problem with the test? I vaguely> remember a discussion between you and Ondrej and some confusion around> which Bash variable to use to fetch PIDs, but I may be mistaken.
I haven't spent much time trying to debug that bash PID increment issue,
but it perplexed me since it was the identical technique used in
multiple tests in the audit-testsuite that has never caused an issue
previously on any of the same set of test machines. This was for the
missing mount umount2 hang bug test
https://github.com/linux-audit/audit-testsuite/pull/76
> Moving on from the PID idea I decided to dig a bit deeper into when> the kernel registers a lost record, and I think I found the core> issue. The kernel logs lost records, and bumps the counter, via the> audit_log_lost() function. The audit_log_lost() function is called> from a number of places (most uninteresting for this discussion: error> handlers and the like), but I think the callers we are most interested> in are the following:> > * audit_log_start()> * kauditd_hold_skb()> > Let's start with audit_log_start(). It calls audit_log_lost() in two> places: one place is allocation error handling code (not relevant) and> the other is if we are bumping into the backlog limit (relevant, and> interesting). In the latter case audit_log_lost() is called and> audit_log_start() returns NULL, effectively dropping the record on the> floor, lost forever.> > Moving on to kauditd_hold_skb(), things are a bit more complicated.> As a reminder, kauditd_hold_skb() is the "point of last resort" for> audit records; if we are here it means there is no auditd connected.> The first thing kauditd_hold_skb() does it write the record to the> console since this may be the only way the record ever sees userspace.> Beyond that if "audit=1" was not specified on the kernel command line> we drop the record, if "audit=1" was specified on the command line we> try to queue the record but if the queue is full we call> audit_log_lost(). Ultimately I believe it was this logic, and a race> between the kauditd_thread() and the record generation, which caused> my system to never hit the backlog limit in audit_log_start() which> would have caused it to call audit_log_lost(). On my problematic> system, booting with "audit=1" caused the existing lost_reset test to> pass without problem.> > Make sense?> > This brings up to the next step: how do we want to address this?> > Prior to the queue rework that started in v4.10 things were a bit> simpler and it looks like we always registered a lost record> independent of the "audit=?" setting on the kernel command line and> the current queue backlog. While this would have made this test> easier, it could result in some over counting problems in the cases> where an auditd instance came along and read the "lost" records from> the queue. I don't think reverting to this behavior is ideal.> > I'm also not certain that recording lost records in the *not*> "audit=1" case is a good solution either. In the case where the> system is not running an audit daemon they are almost guaranteed to> hit the backlog at some point and then their system will start spewing> scary looking audit log lost messages; we would surely win lots of> friends this way.> > We could move the "audit=1" check (really it's an audit_default check)> into audit_log_lost() and use it to squelch the printk() and then call> audit_log_lost in kauditd_hold_skb() if we can't queue the record.> This should preserve an accurate lost record count (well, until it> rolls over, but that's not a new concern), prevent unnecessary scary> lost record messages, and ensure a consistent audit_log_lost()> operation (all the other callers I didn't mention in this mail).> > Or the simplest option is to just ignore this and require that the> audit-testsuite be run on a system booted with "audit=1" :)> > I'm currently leaning towards moving the "audit=1" check into> audit_log_lost(), what do you guys think?
I'll need some time to digest all this. My first reaction is that
requiring "audit=1" for the audit-testsuite is not the right answer and
as we've discussed previously, it makes sense to run the entire
testsuite both with and without "audit=1".
> paul moore
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

On Fri, Dec 14, 2018 at 11:12 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-12-14 10:53, Paul Moore wrote:> > On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:> > > On 2018-12-13 18:23, Paul Moore wrote:> > > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
...
> > As an aside, have you spent any time trying to debug that wrong PID> > problem? While not strictly audit related, that seems like a pretty> > serious Bash bug. Or maybe it's a problem with the test? I vaguely> > remember a discussion between you and Ondrej and some confusion around> > which Bash variable to use to fetch PIDs, but I may be mistaken.>> I haven't spent much time trying to debug that bash PID increment issue,> but it perplexed me since it was the identical technique used in> multiple tests in the audit-testsuite that has never caused an issue> previously on any of the same set of test machines. This was for the> missing mount umount2 hang bug test> https://github.com/linux-audit/audit-testsuite/pull/76
Ah, I think I see the problem. Unless I'm mistaken, you are talking
about the shell/Bash command where the tests print the PID using "echo
$$" or similar, yes? As you likely already know, in Bash (and likely
other shells as well), $$ is the PID of the running Bash process; in
the two places where I saw it used in the existing audit-testsuite $$
is being used to reference the Bash instance itself or something it
exec's (which ends up inheriting the PID). It looks like you are
using $$ as a way to capture the PID of a child process being spawned
by the shell, yes? This may explain why you sometimes get lucky and
$$+1 works for the PID.
> > This brings up to the next step: how do we want to address this?> >> > Prior to the queue rework that started in v4.10 things were a bit> > simpler and it looks like we always registered a lost record> > independent of the "audit=?" setting on the kernel command line and> > the current queue backlog. While this would have made this test> > easier, it could result in some over counting problems in the cases> > where an auditd instance came along and read the "lost" records from> > the queue. I don't think reverting to this behavior is ideal.> >> > I'm also not certain that recording lost records in the *not*> > "audit=1" case is a good solution either. In the case where the> > system is not running an audit daemon they are almost guaranteed to> > hit the backlog at some point and then their system will start spewing> > scary looking audit log lost messages; we would surely win lots of> > friends this way.> >> > We could move the "audit=1" check (really it's an audit_default check)> > into audit_log_lost() and use it to squelch the printk() and then call> > audit_log_lost in kauditd_hold_skb() if we can't queue the record.> > This should preserve an accurate lost record count (well, until it> > rolls over, but that's not a new concern), prevent unnecessary scary> > lost record messages, and ensure a consistent audit_log_lost()> > operation (all the other callers I didn't mention in this mail).> >> > Or the simplest option is to just ignore this and require that the> > audit-testsuite be run on a system booted with "audit=1" :)> >> > I'm currently leaning towards moving the "audit=1" check into> > audit_log_lost(), what do you guys think?>> I'll need some time to digest all this. My first reaction is that> requiring "audit=1" for the audit-testsuite is not the right answer and> as we've discussed previously, it makes sense to run the entire> testsuite both with and without "audit=1".
Yes, requiring "audit=1" is not an answer. I put it there mostly as
an attempt at humor, hence the smiley at the end, but clearly my
abilities as a comic are lacking ;)
Anyway, no rush on looking it over; this isn't a -stable fix in my
eyes so it's not a candidate for merging until after the upcoming
merge window. I think it will make a lot more sense once you go back
and look at the code, reading about the problem in email probably
makes is sound more complicated than it is. I'm fairly certain the
right solution is to move the "audit=1" check into audit_log_lost(),
but I sent this out in case you or anyone else could think of a better
solution. In the meantime I'll throw together a quick patch (it
should be pretty small) and add it to my testing queue.

On 2018-12-14 15:35, Paul Moore wrote:
> On Fri, Dec 14, 2018 at 11:12 AM Richard Guy Briggs <rgb@redhat.com> wrote:> > On 2018-12-14 10:53, Paul Moore wrote:> > > On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:> > > > On 2018-12-13 18:23, Paul Moore wrote:> > > > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:> > ...> > > > As an aside, have you spent any time trying to debug that wrong PID> > > problem? While not strictly audit related, that seems like a pretty> > > serious Bash bug. Or maybe it's a problem with the test? I vaguely> > > remember a discussion between you and Ondrej and some confusion around> > > which Bash variable to use to fetch PIDs, but I may be mistaken.> >> > I haven't spent much time trying to debug that bash PID increment issue,> > but it perplexed me since it was the identical technique used in> > multiple tests in the audit-testsuite that has never caused an issue> > previously on any of the same set of test machines. This was for the> > missing mount umount2 hang bug test> > https://github.com/linux-audit/audit-testsuite/pull/76> > Ah, I think I see the problem. Unless I'm mistaken, you are talking> about the shell/Bash command where the tests print the PID using "echo> $$" or similar, yes? As you likely already know, in Bash (and likely> other shells as well), $$ is the PID of the running Bash process; in> the two places where I saw it used in the existing audit-testsuite $$> is being used to reference the Bash instance itself or something it> exec's (which ends up inheriting the PID). It looks like you are> using $$ as a way to capture the PID of a child process being spawned> by the shell, yes? This may explain why you sometimes get lucky and> $$+1 works for the PID.
I was always getting lucky with $$+1, but understandably uncomfortable
with it since others weren't so fortunate.
In the code that's there, that process is backgrounded, but I'm fairly
certain I tested without that and carefully checked the options (-f and
-s) to ensure it wasn't daemonizing or multithreading. I was pretty
careful to set up exactly the same conditions for running that process
as other tests use, but that looks like the first thing to check next
time I try it. It wouldn't be the first time I've missed something
obvious.
> > > This brings up to the next step: how do we want to address this?> > >> > > Prior to the queue rework that started in v4.10 things were a bit> > > simpler and it looks like we always registered a lost record> > > independent of the "audit=?" setting on the kernel command line and> > > the current queue backlog. While this would have made this test> > > easier, it could result in some over counting problems in the cases> > > where an auditd instance came along and read the "lost" records from> > > the queue. I don't think reverting to this behavior is ideal.> > >> > > I'm also not certain that recording lost records in the *not*> > > "audit=1" case is a good solution either. In the case where the> > > system is not running an audit daemon they are almost guaranteed to> > > hit the backlog at some point and then their system will start spewing> > > scary looking audit log lost messages; we would surely win lots of> > > friends this way.> > >> > > We could move the "audit=1" check (really it's an audit_default check)> > > into audit_log_lost() and use it to squelch the printk() and then call> > > audit_log_lost in kauditd_hold_skb() if we can't queue the record.> > > This should preserve an accurate lost record count (well, until it> > > rolls over, but that's not a new concern), prevent unnecessary scary> > > lost record messages, and ensure a consistent audit_log_lost()> > > operation (all the other callers I didn't mention in this mail).> > >> > > Or the simplest option is to just ignore this and require that the> > > audit-testsuite be run on a system booted with "audit=1" :)> > >> > > I'm currently leaning towards moving the "audit=1" check into> > > audit_log_lost(), what do you guys think?> >> > I'll need some time to digest all this. My first reaction is that> > requiring "audit=1" for the audit-testsuite is not the right answer and> > as we've discussed previously, it makes sense to run the entire> > testsuite both with and without "audit=1".> > Yes, requiring "audit=1" is not an answer. I put it there mostly as> an attempt at humor, hence the smiley at the end, but clearly my> abilities as a comic are lacking ;)
Sorry, email is limited... I read :) and ;) differently.
> Anyway, no rush on looking it over; this isn't a -stable fix in my> eyes so it's not a candidate for merging until after the upcoming> merge window. I think it will make a lot more sense once you go back> and look at the code, reading about the problem in email probably> makes is sound more complicated than it is. I'm fairly certain the> right solution is to move the "audit=1" check into audit_log_lost(),> but I sent this out in case you or anyone else could think of a better> solution. In the meantime I'll throw together a quick patch (it> should be pretty small) and add it to my testing queue.
I'll look at the code and your patch carefully when I can.
> paul moore
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

On Fri, Dec 14, 2018 at 10:00 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-12-14 15:35, Paul Moore wrote:> > On Fri, Dec 14, 2018 at 11:12 AM Richard Guy Briggs <rgb@redhat.com> wrote:> > > On 2018-12-14 10:53, Paul Moore wrote:> > > > On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:> > > > > On 2018-12-13 18:23, Paul Moore wrote:> > > > > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:> >> > ...> >> > > > As an aside, have you spent any time trying to debug that wrong PID> > > > problem? While not strictly audit related, that seems like a pretty> > > > serious Bash bug. Or maybe it's a problem with the test? I vaguely> > > > remember a discussion between you and Ondrej and some confusion around> > > > which Bash variable to use to fetch PIDs, but I may be mistaken.> > >> > > I haven't spent much time trying to debug that bash PID increment issue,> > > but it perplexed me since it was the identical technique used in> > > multiple tests in the audit-testsuite that has never caused an issue> > > previously on any of the same set of test machines. This was for the> > > missing mount umount2 hang bug test> > > https://github.com/linux-audit/audit-testsuite/pull/76> >> > Ah, I think I see the problem. Unless I'm mistaken, you are talking> > about the shell/Bash command where the tests print the PID using "echo> > $$" or similar, yes? As you likely already know, in Bash (and likely> > other shells as well), $$ is the PID of the running Bash process; in> > the two places where I saw it used in the existing audit-testsuite $$> > is being used to reference the Bash instance itself or something it> > exec's (which ends up inheriting the PID). It looks like you are> > using $$ as a way to capture the PID of a child process being spawned> > by the shell, yes? This may explain why you sometimes get lucky and> > $$+1 works for the PID.>> I was always getting lucky with $$+1, but understandably uncomfortable> with it since others weren't so fortunate.>> In the code that's there, that process is backgrounded, but I'm fairly> certain I tested without that and carefully checked the options (-f and> -s) to ensure it wasn't daemonizing or multithreading. I was pretty> careful to set up exactly the same conditions for running that process> as other tests use, but that looks like the first thing to check next> time I try it. It wouldn't be the first time I've missed something> obvious.
Right, let me chime into the PID issue again :) I am pretty sure the
code to get the child PID in Richard's umount2 test is wrong. I didn't
get back to that thread eventually because I tried to simplify the
fixed code to something better (and still 100% reliable), but I
couldn't get it to a state I'd like and eventually I forgot about it
and switched to other things...
So, let me try to explain the problem again. This is the relevant
snippet of code:
system("cd $basedir/$clientdir; echo \$\$ > $stdout; exec ./$client -f
-s $tmpdir &");
Let's rewrite this into pseudocode so it is clear what's actually going on:
1. run "cd $basedir/$clientdir"
2. write the PID of this bash process into "$stdout"
3. fork a child and in that child, execve() this command: "./$client
-f -s $tmpdir"
So the problem is that you log the PID of the parent bash process and
not the child that exec's into the "$client" program. Since the child
gets forked away very quickly after the parent bash process is
created, in 99.9% of the cases it gets a PID exactly one greater -
that's why your +1 trick works. The reason why a similar pattern
("echo $$; exec ...") works elsewhere in the testsuite is that there
is no forking going on in those cases (IIRC).
I just looked at the code in the lost_reset test that actually deals
with the forking scenario, and it uses the "$!" variable, which is the
correct way to get the PID of the last process forked by bash. I admit
I didn't know about this variable when I was pointing out the problem
in Richard's patch, but now I realize this is what I should have
suggested back then...
So, in the umount2 test this:
system("cd $basedir/$clientdir; echo \$\$ > $stdout; exec ./$client -f
-s $tmpdir &");
should be replaced with this (along with dropping the "$pid_fuse += 1;" line):
system("cd $basedir/$clientdir; exec ./$client -f -s $tmpdir & echo
\$! > $stdout;");
That said, I think the code in the lost_reset test is doing the right
thing and I wouldn't expect it to get the ping PID wrong.
Hope that helps,
--
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit