What the what!

Introduction

Stratis (which includes stratisd as well as stratis-cli), provides ZFS/Btrfs-style features by integrating layers of existing technology: Linux’s devicemapper subsystem, and the XFS filesystem. The stratisd daemon manages collections of block devices, and exports a D-Bus API. The stratis-cli provides a command-line tool which itself uses the D-Bus API to communicate with stratisd.

Debugging gobject reference leaks can be difficult, very difficult according to the official documentation. If you google this subject you will find numerous hits. A tool called RefDbg was created to address this specific need. It however appears to have lost effectiveness because (taken from the docs):

Beginning with glib 2.6 the default build does not allow functions to
be overridden using LD_PRELOAD (for performance reasons). This is the
method that refdbg uses to intercept calls to the GObject reference
count related functions and therefore refdbg will not work. The
solution is to build glib 2.6.x with the '--disable-visibility'
configure flag or use an older version of glib (<= 2.4.x). Its often
helpful to build a debug version of glib anyways to get more useful
stack back traces.

I actually tried the tool and got the error “LD_PRELOAD function override not working. Need to
build glib with –disable-visibility?”, so I kept looking. Another post lead me to investigate using systemtap which seemed promising, so I looked into it further. This approach eventually got me the information I needed to find and fix reference count problems. I’m outlining what I did in hopes that it will be beneficial to others as well. For the record, I used this approach on storaged, but it should work for any code base that uses gobjects.

The overall approach is:

Identify what is leaking

For each object that is leaking in step 1, collect the needed information to help figure out why

Seems simple…

Step 1, identification

Step 1 is pretty much straight out of my initial information gathered on this topic and using systemtap. This is my version of the system tap file I used to figure out what was leaking.

The end summary shows you a list of objects that are still present in memory when we exit the process. In this particular case we can see that something looks wrong as we have 1556 instances of GUdevDevice and UDisksLinuxDevice which is unexpected. At this point we have an idea of what’s leaking and where they were created and some destroyed, but not much else. What’s really needed is a full detail of every place an object reference gets incremented and decremented.

Step 2, getting more detailed information

To debug a reference counting leak you ultimately need to identify:

Where an object was created

Each place a reference count was incremented and decremented

My first approach was to edit my above systemtap file and simply dump all of this, eg:

However, even with all this data things weren’t totally clear, there was tons of data to look at. I spent time looking at back traces and eventually I realized that some leaking objects had multiple different paths in which they were created. How could I focus my attention to the correct ones? To help with this I wrote a python script that took the output of the first systemtap and summarized it:

We suspected that the biggest leaking object was GUdevDevice and based on this information we can see that there are two distinct paths where an object is created which leak, with one of them being the vast majority (lucky for me).

It’s interesting how many different code paths exist for some objects which are leaking, 23 different code paths, oh fun! With this new information we can look at the output from the second specific systemtap output, find an object instance that was created which matches the same code path we selected from the summary and then proceed to examine all the places where the object reference count was incremented and decremented, an example (back traces omitted for brevity):

At this point I wish I could share some magical way to identify the exact places where we are missing a reference count decrement and fix the code, but unfortunately you just end up looking at a lot of code and evaluating where the increment occurred, but a decrement is missing. Sometimes it will be easy because the incremented lifespan is short, but in others the span is large and across thread boundaries. This is where it would be good for developers to comment on the intended life cycle of the object in question when the code is being written.

Final thoughts

Fixing gobject reference count leaks is indeed very difficult. Even more so when you are working on code that others developed. If you find yourself writing reference counting code that has non-trivial lifespans, please do others a favor and document the intended lifespan. If you can’t document it, it’s a pretty good indication that you don’t know or understand for sure and likely the code will be plagued with leaks and/or segmentation faults during it’s lifetime.

Continuous integration (CI) support in github is a very useful addition. Not only can you utilize existing services like Travis CI, you can utilize the github API and roll your own, which is exactly what we did for libStorageMgmt. LibStorageMgmt needs to run tests for hardware specific plugins, so we created our own tooling to hook up github and our hardware which is geographically located across the US. However, shortly after getting all this in place and working it became pretty obvious that we provided a nice attack vector…

All it takes is a malicious user to fork our repository, add some malicious code and submit a pull request and *boom*! We just executed whatever code they wanted on each of our CI test nodes. The worst part is some of our tests require the node that is running the tests to run with root authority, so total box ownership is assured. The test nodes are stripped down boxes with nothing of value on them data wise, but just providing hardware and network resources in which to execute arbitrary code is indeed valuable.

Our solution was to limit automated CI testing to only those we trust with white listing. I don’t envy Travis CI and all the security issues that have to consider. They are basically providing a service which allows virtually anyone to execute arbitrary code on their boxes.

None of this is terribly new, just potentially new to those writing their own CI service.

While working on lvm-dubstep the question was posed if D-bus could handle the number of changes that could happen in a short period of time, especially PropertiesChanged signals when a large number of logical volumes or physical volumes were present on the system (eg. 120K PVs and 10K+ LVs). To test this idea I put together a simple server and client which simply tries to send an arbitrary number of signals as fast as it possibly can. The number settled upon was 10K because during early testing I was running into time-out exceptions when trying to send more in a row. Initial testing was done using the dbus-python library and even though numbers seemed sufficient, people asked about sd-bus and sd-bus utilizing kdbus, so the experiment was expanded to include these as well. Source code for the testing is available here.

Test configuration

One client listening for signals.

Python tests were run on a Fedora 22 VM.

Sd-bus was run on a similarly configured VM running Fedora 23 alpha utilizing systemd 222-2. Kdbus was built as a kernel module from the latest systemd/kdbus repo. F23 kernel version: 4.2.0-0.rc8.git0.1.fc23.x86_64.

I tried running all the tests on the F23 VM, but the python tests were failing horribly with journal entries: Sep 01 10:53:14 sdbus systemd-bus-proxyd[663]: Dropped messages due to queue overflow of local peer (pid: 2615 uid: 0)

The results:

Python utilizing dbus-python library was able to send and receive 10K signals in a row, varying payload size from 32-128K without any issues. As I mentioned before if I try to send more than that in a row, especially at larger payload sizes I do get time-outs on the send.

sd-bus without using kdbus I was only able to send up to 512 byte payloads before the server would error out with: Failed to emit signal: No buffer space available.

sd-bus with kdbus the test completes, seemingly without error, but the number of signals received is not as expected and appears to vary with payload size.

Messages/second is the total number of messages divided by the total time to receive them.

MiB/Second is the messages per second multiplied by the payload size.

Average time delta is the time difference from when the signal was placed on the bus until is was read by the signal handler. This shows quite a bit of buffering for the python test case at larger payload sizes.

Percentage of signals that were received by the signal handler. As you can see once the payload > 2048, kdbus appears to silently discard signals away as nothing appeared in kernel output and the return codes were all good in user space.

Conclusions

The C implementation using sd-bus without kdbus performs slightly poorer than the dbus-python which surprised me.

kdbus is by far the best performing with payloads < 2048 bytes

Signals are by definition sent without a response. Having a reliable way for the sender to know that a signal was not sent seems pretty important, so kdbus seems to have a bug in that no error was being returned to the transmitter or that there is a bug in my test code.

Code likely has bugs and/or does something sub optimal, please do a pull request and I will incorporate the changes and re-run the tests.

The following explanation may help to understand the above rules a bit better: consider that there are three possible kinds of reactions from users of your library to changes in a shared library:

Programs using the previous version may use the new version as drop-in replacement, and programs using the new version can also work with the previous one. In other words, no recompiling nor relinking is needed. In this case, bump revision only, don’t touch current nor age.

Programs using the previous version may use the new version as drop-in replacement, but programs using the new version may use APIs not present in the previous one. In other words, a program linking against the new version may fail with “unresolved symbols” if linking against the old version at runtime: set revision to 0, bump current and age.

Programs may need to be changed, recompiled, relinked in order to use the new version. Bump current, set revision and age to 0.

This was confusing to me until I played around with the -version-info value and looked at the library output on my linux development system.

The current:revision:age is an abstraction which allows libtool to do the correct thing for the platform you are building and deploying on. This wasn’t immediately evident when reading the documentation.

More info: http://code.google.com/p/google-authenticator/wiki/KeyUriFormat

6. Using google authenticator create a new software key using your created QR code. If you don’t want to create a QR code, then create a new entry, set to time based and enter the 64 characters. I’m guessing it may take a few tries to get it correct.

7. Test that google authenticator and oathtool –totp `cat ~/.totp-key` match. If they don’t make sure both the phone and the computer dates and time match.

8. Using a separate terminal, ssh to remote system with just the password, this should fail. Then try using OTP/password for your password and you should get authenticated.

My home network utilizes OpenBSD and the built in packet filter (pf). I use cable for broadband internet and found that if I tried to upload a large file my internet connecting became very unable with high amounts of latency. After utilizing tools such as http://netalyzr.icsi.berkeley.edu/blog/ it became obvious that I was suffering from buffer bloat.

After doing some searching I came across using altq support in pf to try some configuration changes to reduce the buffer bloat in my configuration.

and the corresponding rule to tag outgoing traffic for that interface to this queue

pass out on $ext_if keep state queue( bb )

My home connection is advertised as 5Mb down, 1Mb up. In testing I get about 1.1Mb up so I setup my outgoing queue to limit outgoing to 1Mb. A typical setting would be 97% of maximum. One of the most important values in the queue setup is the number of buffers. Mine is currently set at 9. This is how I determined what the value should be.

Maximum upstream bandwidth in packets is upstream bandwidth in bytes / size of a packet. In my case 1000000/8 = bandwidth in bytes / 1460 (size of packet) which yields 85 packets a second. So if I set my buffer size to 85 I should have about 1 second of latency. In my case I like my latency low so I divided by 10 to try and get a 100ms latency under full upstream use which is 8.5, which I rounded up to 9.

NOTE: Updated 4/2/2015 to reflect new project links and updated command line syntax.

Abstract

This paper discusses some of the advanced features that can be used in modern storage subsystems to improve IT work flows. Being able to manage storage whether it be direct attached, storage area network (SAN) or networked file system is vital. The ability to manage different vendor solutions consistently using the same tools opens a new range of storage related solutions. LibStorageMgmt meets this need.

Introduction

Many of today’s storage subsystems have a range of features. Some examples include: create, delete, re-size, copy, make space-efficient copies and mirrors for block storage. Networked file systems can offload copies of files or even entire file systems quickly while using little to no additional storage, or keep numerous read-only point-in-time copies of file system state. This allows users to quickly provision new virtual machines, take instantaneous copies of databases, and back up other files. For example a user could quiesce a data base, call to the array to make a point-in-time copy of the data, and then resume database operations within seconds. Then the user could take as much time as necessary to replicate the copy to a remote location or removable media. There are many other valuable features that are available through the array management interface. In fact, in many cases, it’s necessary to use this out-of-band management interface to enable the use of features that are available in-band, across the data interface.

Problem

To use these advanced features, users must install proprietary tools and libraries for each array vendor. This allows users to fully exploit their hardware, but at the cost of learning new command line and graphical user interfaces and programming to new application programming interfaces (APIs) for each vendor. Open-source solutions frequently cannot use proprietary libraries to manage storage because of incompatible licensing. In other cases, the open-source developer cannot redistribute the vendor libraries. Thus the end users must manually install all of the required pieces themselves. The Storage Network Industry Association (SNIA) and the associated Storage Management Initiative Specification (SMI-S) have an ongoing effort to address this need with a well-defined and established storage standard. The standard is quite large. Preventing administrators and developers from leveraging it easily. With the scope and complexity of such a large standard, it is difficult for vendors to implement it without variations in behavior. The SMI-S members’ focus is on being the providers of the API and not consumers of them, so the emphasis is from the array provider perspective. The SMI-S standard must define an API for each new feature. The specification always trails vendor defined APIs.

The LibStorageMgmt solution

The libStorageMgmt project’s goal is to provide an open-source vendor-agnostic library and command line tool to allow administrators and developers the ability to leverage storage subsystem features in a consistent and unified manner. When a developer chooses to use the library, their users will benefit by their ability to use any of the supported arrays or future arrays when they are added. The library is licensed under the LGPL which allows use of the library in open-source and commercial applications. The command-line interface (lsmcli) has been designed with scriptability in mind, with configurable output to ease parsing. The library API has language bindings for C and Python. The library architecture uses plug-ins for easy integration with different arrays. The plug-ins execute in their own address space allowing the plug-in developer to choose whatever license that is most appropriate for their specific requirements. The separate address space provides fault isolation in the event of a plug-in crash, which will be very helpful if the plug-in is provided in binary form only.

Support for additional arrays is in development and will be released as they become available.

Example: Live database backup

An administrator has a MySQL database that they would like to do a live “hot” backup to minimize disruption to end users. They also use NetApp filers for their storage, and would like to leverage the hardware features it provides for point-in-time space efficient copies. The database is located on an iSCSI logical disk provided by the filer. (These are referred to as volumes in libStorageMgmt.)

The overall flow of operations:

Craft a uniform resource identifier for the array (URI) for use with libStorageMgmt

Identify the appropriate disk and obtain its libStorageMgmt ID

Quiesce the database

Use libStorageMgmt to issue a command to the array to replicate the disk

Release the database to continue

Use libStorageMgmt to grant access to then newly created disk to an initiator so that it can be mounted and backed-up

Crafting the URI

As the admin is using NetApp they need to select the ontap plug-in by crafting a URI. The URI looks like “ontap+ssl://root@filer_host_name_or_ip/”. The beginning of the URI specifies the plug-in with an optional indicator that the user would like to use SSL for communication. The user “root” is used for authentication, and the filer can be addressed by hostname or IP address. This example will be using the command line interface. We can either specify the URI on the command line with ‘-u’ or set an environment variable LSMCLI_URI to avoid typing for every command. The password can be prompted with a “-P”, or supplied in the environmental variable LSMCLI_PASSWORD.

Identify the disk to replicate

The administrator queries the array to identify the volume that the database is located on. To correctly identify which disk the admin first takes a look to see where the file system is mounted by looking for the UUID of the file system. Then they look in /dev/disk/by-id to identify the specific disk.

This command displays all the available volumes for the array. It outputs a number of different fields for each volume on the storage array. The fields are separated by a space ( using -t” “) with the fields defined as: ID, Name, vpd83, block size, #blocks, status, size bytes, system ID and pool ID.

Definitions of each:

ID – Array unique identifier for the Volume (virtual disk)

Name – Human readable name

vpd83 – SCSI Inquiry data for page 0×83

block size – Number of bytes in each disk block (512 is common)

#block – Number of blocks on disk

status – Current status of disk

size bytes – Current size of disk in bytes

system ID – Unique identifier for this array

pool ID – Unique storage pool that virtual disk resides on

So, the array ID for the volume we are interested in is idWqJ4lOXQ0O.

Quiesce the database

Before issuing the replicate command, quiesce the database. For MySQL this can be done by establishing a connection and run “FLUSH TABLES WITH READ LOCK” and leaving the connection open.

Replicate the disk

To replicate the disk the user can issue the command (just outputting result ID for brevity):

This command creates a clone (space efficient copy) of a disk. The “-r” indicates replicate with the argument specifying which volume ID to replicate, “-type” is the type of replication to perform and “–name” is the human readable name of the copy. For more information about the available options type “lsmcli –help” or “man lsmcli” for additional information. The command line will return the details of the newly created disk. The output is identical to the information returned if you listed the volume, as shown above. In this example we just grabbed the volume ID as that is all we need to grant access to it in the following steps.

Release the database

Once this is done you can call “UNLOCK TABLES” or close the connection to the database.

Grant access to newly created disk

To access the newly created disk for backup we need to grant access to it for an initiator. There are two different ways to grant access to a volume for an initiator. Some arrays support groups of initiators which are referred to as access groups. For other arrays you specify individual mappings from initiator to volume. To determine what mechanism the arrays supports we take a look at the capabilities listed for the array.

To find out what capabilities an array has, we need to find the system ID:

The Ontap plug-in supports access groups. In this example, we know the initiator we want to use has iSCSI IQN iqn.1994-05.com.domain:01.89bd03. We will look up the access group that has the iSCSI IQN of interest in it.

List the access groups, looking for the IQN of interest to backup too.

At this point you need to re-scan for targets on the host. Please check documentation appropriate for your distribution. Once the disk is visible to the host it can then be mounted and then backed up as usual.

This sequence of steps would be the same regardless of vendor, only the URI would be different. Other operations that are currently available for volumes include: delete, re-size, replicate a range of logical blocks, access group creations and modification, and a number of ways to interrogate relationships between initiators and volumes. This coupled with a stable API allows developers and administrators a consistent way to leverage these valuable features.

Summary

Having a consistent and reliable way to manage storage allows for the creation of new applications that can benefit from such features. Quickly provisioning a new virtual machine by replicating a disk template with very little additional disk space is one such example. Having an open source project that can be improved, developed, and molded by a community of users will ensure the best possible solution. LibStorageMgmt is looking for contributors in all areas (eg. users, developers, reviewers, array documentation, testing).