2016/01/30

Monitoring a file to get details about write operations

Today I've been answering questions on unix.stackexchange.com, when I've met one which at first sight seemed easy to answer, but after rethinking it a few times I realized that actually I didn't know the proper answer (That's the question in question). This made me investigate and led me down a rabbit hole which consumed most of this afternoon, and the next one, and some more time. The question asks about how to figure out which process is writing to a certain log file. The person asking can see that something is writing to it by reading the file, but she doesn't know which process is writing it. Obviously there are some mechanisms in Syslog which could solve that problem if the file is written by Syslog, but out of curiosity I wanted to know if there's a generic mechanism that works for any file, so I'm going to go through a few possible solutions without concluding on a best one.

Inotify

One of the first things that came to my mind when I read the question was Inotify. Inotify is a mechanism that's provided by the Linux kernel via a set of system calls, it can be used to set watchers on files and directories. If one of the watched files gets modified in a way that the watcher is waiting for then an event gets passed to the user space process. This is great, I can use it to put a watcher on a file and make it wait for anything writing to that file. There are many different wrappers around the Inotify API, one of my favorite ones is a CLI utility called inotifywatch which comes with the package inotify-tools.

mst@mst-nb1:~$ inotifywait 123
Setting up watches.
Watches established. # now i write to "123" in another terminal
123 MODIFY

Ok, so we got notified that there was a write event on the file, but we still don't know what has written to it. Unfortunately the Inotify API does not give any hints regarding which process has done a given modification on a file. Considering that the reading of the events in user space happens asynchronously this makes sense, because at the time the event is processed by the user space application the writing process might already have exited. The alternative would be to copy a lot of information about the process into the Inotify event structure, but this would increase it's size drastically and would hence decrease the performance.

Lsof

My next choice when trying to figure out what's writing to a file is the lsof utility. lsof is great to see any kind of status information about running processes and the file descriptors they're currently holding open.

The example shows that lsof correctly displays that there's a cat process holding file descriptor with the number 1 in write mode and that this file descriptor refers to the file /tmp/abc.

This is great, it shows us what we want to know. But what if the process that modifies the file only runs for 0.01 seconds and then exits or closes the file again? Even when applying the maximum of my typing skills while my finger joints are at room temperature I'd still have a good chance to miss it. Due to the way lsof collects it's data there is no way to watch a file for modifications over a time span, it can only capture snapshots. It's repeat mode, when passing the -r option, doesn't help us much because it only captures multiple snapshots at an interval. Looks like we still haven't found a really good solution to our problem.

LoggedFS

Let's get a little more exotic and venture into the land of non-standard utilities. Loggedfs is a fuse based file system, that means it runs in user space, and it greatly increases the verbosity of all operations running on the file system while still letting the "real" file system do it's job underneath. Luckily the Ubuntu repositories provide it as a package under the name loggedfs, so it's easy to install. That's how that looks like:

This is great, the output shows us exactly which process has written to our file and we could hardly wish for more details. But there are still some things to consider:

Mounting only works with an entire directory, it does not work on single files, so exclude and include patterns have to be used to specify which file exactly needs to be monitored.

Depending on company policy it might be problematic to install non-default packages on production systems just for troubleshooting.

Some people might dislike the idea of having to run fuse on a production box.

With LoggedFS we've found a really nice and workable solution which tells us what we wanted to know. But actually the Kernel already has all the information that we want, so why would we have to run something else on top of it to collect the same data a second time? Let's see if by digging down into the Kernel we can't find some mechanism that doesn't require external packages or user space tools and still provides us with all the information we are looking for.

SystemTap

When starting to look at the more advanced kernel features it gets interesting, I'll start with SystemTap. SystemTap is a utility that originally comes from RedHat, but it's Open Source and usable on all major distros. It makes it easy to generate simple kernel modules which can be used to inspect what's happening in the kernel by creating event handlers for various types of events.

Besides the ability to attach event handlers to most kernel functions which are exported as symbols, SystemTap comes with a large collection of "tapsets". Tapsets are predefined scripts and events that can be used in custom tap scripts. One of those available events is called vfs.write. vfs stands for Virtual File System and it's basically an abstraction layer on top of the actual file system implementation to handle generic operations like opening, closing, writing and reading files so they don't need to be reimplemented by each file system implementation again and again.

Once SystemTap is installed, the common way to use it is by writing script files with the mime type stp which are then read by the stap utility. This utility translates them into C code, compiles them as a kernel module and then loads them into the kernel. The structure of the scripts is usually rather simple, they can set probes on events and they define functions to make code reusable.

Identifying what we're looking for

In our case we want to monitor the write operations on a given file, so first we'll need to figure out how to identify a specific file in a vfs.write event. When querying the stap utility for the available variables provided by the event vfs.write we can see that there is a file variable.

But how to uniquely identify a file based on the available variables? Each file on the file system is associated with an inode number and each inode number is unique within it's file system, so within a file system we can use the inode number to identify a file. To identify the file system we can use the associated device's major and minor numbers, this means that the combination of ma/mi numbers plus the inode number should be unique in the whole system and give us a reliable identifier that we can use to identify any file.

In the kernel the major and minor numbers are merged using a bit shift operation and a logical or, the result is a device identifier which the vfs.write event provides as variable dev. The inode number is available as variable ino. One fact that will be important in a later chapter is that all of that data gets extracted from the file variable by the two functions __file_dev and __file_ino.

In order to be able to look for a certain value in the dev variable we need to reproduce the merge of major and minor numbers that the kernel does internally, luckily SystemTap comes with a function called MKDEV which does exactly that.

Great, we're monitoring the file for write operations and for each occurring operation we can get the process name and pid of the process that's writing! SystemTap, with all the tapsets that it ships with, really makes investigating kernel internals relatively easy and this by itself is quite a feature. It also doesn't require you to install it's entire tool chain on every system that you want to use it on because after SystemTap built your kernel module on one machine you can copy the module to every other machine with a compatible kernel and load it there. Unfortunately it can currently still be slightly troublesome to set the tool chain up on Debian based distributions like Ubuntu and it may require a small amount of hacking to make it work, on RedHat it should work out of the box though.

Seems we've found another pretty good solution for our problem, but I wish there was an easy way to achieve our goal of monitoring the writes to a file without having to load a custom kernel module on a production server... Let's try Auditd instead.

Auditd

Just like SystemTap, Auditd originated from RedHat but is Open Source and usable on other distributions too. It consists of several parts: The data collection happens inside the Kernel via a bunch of hooks placed at critical locations, then the collected data is first filtered through a set of filters that live inside the Kernel space, the resulting data is sent to a user space daemon (auditd) via a netlink socket (man 7 netlink). auditd then communicates with other user space processes which want to access the collected data or modify filter rules, it also maintains log files containing the collected data for other tools to read. If user space processes submit commands to modify the filter rules to auditd it uses the same netlink socket to configure the in-kernel data filters. The RedHat documentation illustrates that architecture nicely.

The easiest way to setup filter rules is by using the CLI utility auditctl, which comes together with auditd. When reading man 8 auditctl it becomes obvious that this is a quite mature and powerful tool to configure a wide range of settings and filters, but let's first focus on our rather simple task of watching writes on a file. Here are the relevant parts of auditctl's man page:

...
-w path
Insert a watch for the file system object at path. You cannot insert a watch to the
top level directory. This is prohibited by the kernel. Wildcards are not supported
either and will generate a warning. The way that watches work is by tracking the
inode internally. If you place a watch on a file, its the same as using the -F path
option on a syscall rule. If you place a watch on a directory, its the same as using
the -F dir option on a syscall rule. The -w form of writing watches is for backwards
compatibility and the syscall based form is more expressive. Unlike most syscall
auditing rules, watches do not impact performance based on the number of rules sent
to the kernel. The only valid options when using a watch are the -p and -k. If you
need to anything fancy like audit a specific user accessing a file, then use the
syscall auditing form with the path or dir fields. See the EXAMPLES section for an
example of converting one form to another.
...
-k key Set a filter key on an audit rule. The filter key is an arbitrary string of text
that can be up to 31 bytes long. It can uniquely identify the audit records produced
by a rule. Typical use is for when you have several rules that together satisfy a
security requirement. The key value can be searched on with ausearch so that no mat‐
ter which rule triggered the event, you can find its results. The key can also be
used on delete all (-D) and list rules (-l) to select rules with a specific key. You
may have more than one key on a rule if you want to be able to search logged events
in multiple ways or if you have an audispd plugin that uses a key to aid its analy‐
sis.
...
-p [r|w|x|a]
Describe the permission access type that a file system watch will trigger on.
r=read, w=write, x=execute, a=attribute change. These permissions are not the stan‐
dard file permissions, but rather the kind of syscall that would do this kind of
thing. The read & write syscalls are omitted from this set since they would over‐
whelm the logs. But rather for reads or writes, the open flags are looked at to see
what permission was requested.
...

Let's try that with the file /tmp/testfile that we've already used in the SystemTap test. First I'll set up a rule that places a watcher on writes to that file, then I'll write into it, and finally I'll use the ausearch utility to search through the audit log files for the key I've defined in the watcher rule. The ausearch utility gets delivered as part of the auditd package.

Nice, we got exactly the output that we were looking for again! The auditd way of creating those watcher rules is really amazingly powerful, user friendly and easy to use. Personally I think that the auditd utility should be at the front of every operations guy's utility belt, somewhere between tcpdump and strace.

I could be satisfied with all the already discovered solutions here, but I'm still curious if there isn't any way to get that data without having to use any user space utilities and just directly surgically extract them from the Kernel. That's where Ftrace and Kprobe come in.

Ftrace & Kprobe

Ftrace is a Linux Kernel internal tracer that captures information about the current context when certain events occur and conditions are met, it exports it's control interfaces via a virtual file system called tracefs which is usually mounted in the directory tracing of the debugfs file system. By default most distributions mount debugfs on /sys/kernel/debug so Ftrace's files are in /sys/kernel/debug/tracing. From now on, when I refer to paths of tracefs files, their paths should always be interpreted as relative to the mount point of tracefs. When browsing through that directory it might at first seem overwhelming due to the large number of files and directories, but luckily the kernel comes with a really great documentation.

Basic event tracking with Ftrace

The list of available tracers is in the file /available_tracers, but since we only want some basic information and do not need any advanced tracing mechanisms the simple nop tracer is sufficient for us. It can be selected by writing it's name into the file /current_tracer.

Before we start capturing anything let's first make sure the tracing is disabled and then clear the current trace output, that way we won't have any other output cluttering our trace output. The tracing can be enabled/disabled by writing 1/0 into the file /tracing_on. The trace output can be cleared by truncating the file /trace, which will later contain our trace output.

Ftrace knows a large number of events and it allows us to choose which events we're interested in. The events are shown to the user in two ways, as a directory hierarchy in the directory /events and as a long list in the file /available_events. Because I like the hierarchic structure that the /events directory uses I'll stick to that one for now.

On the first level of directories in /events there is one sub directory per type of event, since we're looking for write operations I'll first focus on the system calls in /events/syscalls. Doing an ls on /events/syscalls shows that for each system call there is one event when entering the call and one when exiting it, so let's look at the content of /events/syscalls/sys_enter_write.

Each of the event directories contains the same set of files and unsurprisingly enable is used to disable/enable the tracing of this event. Let's first disable the tracing of all events and then only enable that one that we're interested in. For convenience the upper level directories /events/syscalls and /events also contain a file enable which can be used to propagate a change through all the sub directories. The file /set_event always contains a list of the currently enabled events (and it could also be used to control the enabling/disabling of events) so we read that file to verify that now the only selected event is sys_enter_write.

Seems we're ready to start tracing, but we don't want to get flooded with data, so we'll only enable the tracing for about 5 seconds. Then we should be able to see the output of the trace in the file /trace and depending on how busy your system is there might be quite a lot of data.

Cool, so we get some data there! Unfortunately we don't know what that is, because the only file identifier given is the fd for file descriptor. Since file descriptors are only unique in process context we can't use them to filter for the specific file we're interested in because we don't know which process is writing to that file, that's what we're trying to find out! To solve that problem we can use Kprobe to define custom events and variables to filter by.

Custom events with Kprobe

Kprobes work by placing breakpoint instructions at specific places in the kernel and then collect data associated to the current context, if you want to know all the tiny details about how it works then the kernel documentation shall satisfy your curiosity. Defined Kprobes can be used by Ftrace just like any other event and they're controlled using the same files as described above, but the important feature that we need right now is that Kprobe allows us to dereference function arguments in order to later filter by members of structures in those arguments. This will help us to filter out events on a specific file.

Due to the flexibility of Kprobe we can get information about basically anything that happens in the kernel, we just need to know what we're looking for and how to filter it out of all the other stuff happening. Let's revert back to the function vfs_write that we've already used with SystemTap to monitor writes. If SystemTap is able to retrieve the criteria we require from the function arguments of vfs_write, then there should be some way how we can do the same with Kprobe if we just figure out where that data resides.

Looking at the kernel code, we can see in the function definition of vfs_write that the file variable is of type struct file.

You might remember from the SystemTap example above that the vfs.write tap extracted all the information about the inode number and the associated device's major and minor numbers from the file argument, so we need to inspect the kernel code that defines the involved structures to figure out where that data is.

What we see in those struct definitions is that the inode number is available from file->f_inode->i_ino as a plain unsigned long and the major and minor numbers are accessible as file->f_inode->i_sb->s_dev as type dev_t. The kernel uses macros to convert dev_t typed variables into major and minor numbers and back, we'll come back to that in a minute.

Defining the filter criteria by using custom variables

The kprobetrace documentation comes with a section describing the syntax used to define kprobe events and their custom variables via tracefs.

Since we are going to have to access nested members of struct file we'll need to know their offsets relative to their containing structs. The C standard library has a really useful macro called offsetof which can tell us exactly that, but since code running in the kernel can't access the standard library we'll need to use the kernel's own implementation of offsetof which is in include/linux/stddef.h. Obviously, if you're less lazy than I am, then it's also possible to just do the offset calculations in your head. But the lazy-people's way to get them is by writing a super minimalist kernel module that prints the offsets into dmesg as debug messages:

Since we are going to calculate the offsets relative to the pointer value of the first function argument (struct file *file) the address to add the offsets to should be in the register %di, just as this StackOverflow post nicely describes.

Using the above quoted syntax documentation we can now define custom variables for our Kprobe event that extract i_ino and s_dev from struct file by writing it's definition into the file kprobe_events.

Well that bit shift should be simple enough to reimplement, so let's shift some bits!

I'll use Python to make those examples shorter:

# Getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where the file
# to monitor resides
mst@mst-nb1:~$ sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:1
# Reimplementing the kernel MKDEV macro with the input of 252 and 1,
# the result should be what we need to look for as dev_t
mst@mst-nb1:~$ python -c 'print(252 << 20 | 1)'
264241153

There's exactly one line showing our Python processes calling vfs_write and we didn't have to taint the kernel with custom modules or install any non-standard tools to get to that result!

Conclusion

There are many answers to the above question, some might be better and some might be less good, but the only way to see the trade offs is to be familiar with all the concepts. Obviously the described tools can also be used for more than only monitoring write operations, that was only a basic example.

Writing this article was a great exercise for me and I'd be happy about feedback with suggestions for improvements or other solutions.