Kill Bill deployment guide

Overview

Best practices

Kill Bill is fundamentally a backend system, so the following considerations should apply:

Don’t make the system visible to the outside world, instead you should have a front end system, or a reverse proxy in front of it (e.g. if you need to handle gateway notifications like PayPal IPN/Internal Payment Notification).

Look at the all the existing Kill Bill JMX metrics (using VisualVM, jconsole, …​) and set some alerts (for instance by using the following script).

Configure your system properly, especially when it comes to the settings of the bus and notification queue.

Pre- and Post-deployment checklist

Test, test, test: we work hard to make the core of the platform solid. But your deployment, with your own combination of plugins and configuration (catalog, overdue, etc.) is unique. Write business-level regression tests that you can run before each upgrade.

Timezone of your servers and databases has to be UTC and ensure NTP is properly configured to avoid any drift between instances.

Consider encrypting sensitive configuration properties, like database passwords, with Jasypt. See http://www.jasypt.org/cli.html for details on how to generate encrypted values. Example configuration:

Make sure the servers have enough entropy: /proc/sys/kernel/random/entropy_avail should be > 3k (otherwise install haveged / rng-tools). Kill Bill should also be started with -Djava.security.egd=file:/dev/./urandom.

Adjust org.killbill.security.shiroNbHashIterations as needed. This setting configures the number of iterations run to hash API secrets and user passwords. The default value is high for security reasons, but can be adjusted down if required (e.g. for Docker -e KILLBILL_SECURITY_SHIRO_NB_HASH_ITERATIONS=1) as this can have a significant performance impact. Note that changing the value requires re-hashing manually all tenants secrets and user passwords.

Make sure your database and queues configuration are adequate: the bus_events table should almost always be empty and the notifications table should never have any AVAILABLE entry with an effective date in the past. Otherwise, in both cases, the system will be late (invoices not generated, etc.). These two metrics should always be monitored in production (potentially a paging event).

Verify the integration with your payment gateway(s): very few payment transactions (if any) should be in an UNKNOWN state. Make sure to fix these manually via the Payment Admin API, if the plugin is unable to do it automatically.

Because a lot of pieces need to be in place, and because a lot can go wrong, users are strongly encouraged to use our pre-built Docker images and Docker Compose recipes. We also maintain recipes to deploy the open-source Elastic stack and the open-source InfluxData stack integrated with Kill Bill.

Experienced users can also opt for our Ansible playbooks (Ansible is an open-source IT Automation tool), to install and configure Apache Tomcat, KPM and Kill Bill individually.

Note that both of these deployment options rely on KPM, the Kill Bill Package Manager. KPM can fetch existing (signed) artifacts for the main killbill war and for each of the plugins, and deploy them at the right place. Using KPM gives you the most flexibility without having to re-invent the wheel, but significantly increases the deployment complexity. To be directly used by our most advanced users only.

Docker

Make sure to get familiar with Docker first, before attempting to install Kill Bill. The project has lots of great docs to help you get setup.

Failover

We’ve tested various failover scenarii (Aurora RDS, master/slave MariaDB Docker setup and master/slave Percona Server on real hardware) and could confirm that Kill Bill is behaving as expected, i.e. queries in-flight will fail during a failover, but reconnection is automatic.

Specifically for Aurora though, we did notice that:

Reconnection is r/o by default after the failover. jdbc:mysql:aurora: must be specified in the JDBC url for the reconnection to be r/w.

Triggering a failover in the RDS UI leads to a pretty short Kill Bill downtime (few secs). Terminating the master though ("delete instance") takes a bit longer (few minutes) — this could be mitigated with more aggressive timeouts in the JDBC pool.

Bus and Notification queues

Bus events

The notifications across Kill Bill core services rely on a proprietary bus event. There are actually 2 buses, the main bus which is used by core services and an external bus which is used by plugins. The main reason for having 2 buses is that the main bus is critical for internal operations to work, and so we want to prevent plugin code that could interact with 3rd party systems to block on long operations and impact the rest of the system.

There are 2 sets of two tables to manage those bus events:

For the main bus, a bus_events and a bus_events_history table.

For the external bus, a bus_ext_events and a bus_ext_events_history table.

Events are moved from the bus_events to the bus_events_history as they are processed. That allows to keep a history of what happened in the system and avoid having the bus_events table grow too much. The bus_events_history is only there for debugging and is never used by the system.

Bus Event Modes

The bus event can be run in multiple modes (instanceName below is either main or external):

POLLING: the bus will poll the database for new available entries and dispatch them across the nodes.

STICKY_POLLING: the bus will poll the database for new available entries and dispatch them to the same node that created the entry.

STICKY_EVENTS (default mode): in that mode, the bus now behaves as a blocking queue where entries are dispatched as soon as they have been committed to disk. This is a much more efficient mechanism both in terms of latency (because entries are picked up right away) and throughput (because there is no time for entries to accumulate).

In a cloud environment, where nodes are more prone to appear and disappear, the following choices are available:

Use the POLLING mode

Use the STICKY_EVENTS (or STICKY_POLLING) mode. In that scenario, you need to be cautious of Kill Bill instances restarting on a different node:

Each instance can be started with a specific system property org.killbill.queue.creator.name=<MY_VIRTUAL_INSTANCE_NAME>, which overrides the creating_owner value string associated with each entry which defaults to the hostname of the machine. When using that property, an instance that restarts on a different node but with the same property will continue processing the same entries.

Or, alternatively if failovers don’t occur too often, run a query to update rows associated with the instance that failed over so they get picked by an other node. Note that events are never lost because they are persistent, but in that case, they may linger until updated. The query to update the rows is the following (only showed for bus_events table, but similar query needs to happen for bus_events_history):

Future Notifications

Overview

In addition to the bus events, which are dispatched immediately, Kill Bill also manages future notifications. The mechanism is very similar to the POLLING we described earlier, but the main difference is that those notifications are dispatched when the effective_date of the notification has been reached. There is no STICKY_EVENTS mode for the future notifications.

The future notifications also rely on two tables: the notifications and notifications_history, and the mechanism to move processed entries is similar to what we described for the bus event.

Logging and GDPR

If you are using Tomcat, CATALINA_BASE/logs/catalina.out does not rotate. Make sure to make your main appender ch.qos.logback.core.rolling.RollingFileAppender instead of the default ch.qos.logback.core.ConsoleAppender (STDOUT/STDERR is redirected to CATALINA_BASE/logs/catalina.out).

Make sure also to install both the Felix Log bundle and the Kill Bill Log bundle in your platform directory (/var/tmp/bundles/platform by default), otherwise OSGI logs (including from JRuby plugins) will end up in STDOUT/STDERR (hence in CATALINA_BASE/logs/catalina.out). Both bundles are included in the defaultbundles package.

Mask PANs

Use the converter class org.killbill.billing.server.log.obfuscators.ObfuscatorConverter.

Service Discovery with Eureka

For easier integration into a microservice architecture, Kill Bill supports client-side service discovery via a Eureka registry. A module (disabled by default) is provided that allows Kill Bill to register with a Eureka server.

To register as a Eureka client, first add the following dependency to your profile:

Finally, make sure port 8443 is open (and exported from the Docker containers).

X-Forwarded headers support

When org.killbill.jaxrs.location.full.url=true, Kill Bill will build location headers using a full URL. In a typical load balancer scneario, which receives traffic on port 8443 and forwards it to port 8080 on the Kill Bill instances (i.e. SSL terminated at the load balancer), you probably want the headers to return something like https://killbill-vip.mycompany.net:8443 instead of http://10.1.2.3:8080.

To do so:

Enable the RemoteIpValve in your Tomcat’s configuration (/var/lib/tomcat/conf/server.xml in our Docker images). Make sure to configure correctly at least the internalProxies and trustedProxies attributes depending on your environment, see the docs.