Menu

Performance tuning of DMC

We’ve recently developed a real time charging mediation system (named DMC) for all our mobile data users on the Orange brand. This was to replace a system from an external vendor which was not performing very well. This system will handle charging for the majority of our customers, and this blog post is about how we performance tuned the system.

We initially developed the code to meet all the business requirements, without much focus on performance. The idea was to get the functionality correct, and then measure the performance. The high level view of the system is as shown below.

The basic charging flow consisted of the following messages.

A user profile request from the Packet Gateway (not the LTE P-GW) to determine the subscriber profile which involved an LDAP call to the subscriber database. Sent once per session.

A service authorisation request from the Packet Gateway to get an allocation of data, called a tranche.

Subsequent service reauthorisation requests to report data usage, and to request additional tranches of data

A service stop request to indicate that the session has ended.

DMC had to keep a record of session related information for the lifetime of the session. To simplify things, we decided not to use replicated databases for redundancy of session information.

To handle the GTP protocol, we used the excellent open-cgf library with some refactoring to meet our specific needs. For LDAP protocol support, we used the eldap application from OTP with a few enhancements (load balancing to a named pool of servers, windowing to limit number of concurrent requests on each connection, support for LDAP ExtendedRequest). One day when I get the time and figure out how to submit patches to OTP, I will contribute these enhancements.

Performance test environment

DMC test layout

Initial performance tests showed that the system handled a 1000 reqs/sec without any explicit performance tuning, but beyond that, we were starting to see timeouts on the client side. This is a testament to Erlang’s amazing performance characteristics, considering that the current vendor supplied solution uses about 10 servers for handling about 2500 reqs/sec along with a rack of servers for the Oracle database. I knew Erlang could do better, so the following performance “tweaks” were done.

Incrementing Counters

Obviously, for every data charging transaction, a CDR has to be produced. And for each CDR, a unique sequence number was required. We started off using mnesia:dirty_update_counter via a gen_server (as the maximum value for the counter was 1000000, so access had to be serialised). This didn’t scale very well as even at moderate loads, we started to see mnesia_overload alarms. So we implemented a counter_server which did prefetching of blocks of counter values. So for instance, if the counter value started to zero, the counter_server would change the value of the counter to 1000 in the DB, and in the interim, if processes requested a new unique value, it would return one from the 0-1000 block. Once this block was exhausted, it would “fetch” a new block by setting the counter value in mnesia to 2000. This cut down the number of mnesia write by a factor of a 1000! This ensured that even if this process crashed or the system shutdown, we wouldn’t lose more than a 1000 counter values.

Writing CDRs

Initially, we used an ordered_set mnesia table with an index value of now(), and every 15 minutes, a process would walk the table, and write CDRs to disk. At the initial load, we started to see mnesia_overload alarms. So we decided to write CDRs directly to disk rather than cache them in mnesia. So we switched to writing CDRs to an ordered_set ets table with the {write_concurrency, true} option, and a process walked the table every 5 seconds, and flushed the CDRs to disk. Worked like a charm and got rid of the mnesia_overload alarms.

UDP packet handling

We initially had a single UDP socket taking all the traffic from the Packet Gateway. Despite experimenting with various values for the recbuf and read_packets option of gen_udp, we couldn’t push performance beyond a certain point. The only way we could improve the performance was to accept packets on multiple ports.

Results

We now have a system which consistently handles 2000 reqs/sec at 40% CPU load running in a Solaris zone with one dedicated 2GHz quad core processor. Put four of these into the network, two in each site, and we’ve got ourselves covered for a couple of years at least. We haven’t put this system live yet, so fingers crossed that it will perform in production as it is in test!

Tools used

We developed our own load testing tool, which basically spawned processes at a certain rate, and collected performance stats to monitor success/failure/timeout. Each process then simulated a session. We will look into using Tsung in the future. I’ve heard good things about it.

The following configuration file was used to plot the above graph using gnuplot. All log entries for each request type were collated into a file dmc_load_timings.txt and separated by two blank lines to allow gnuplot to distinguish the different sections.

set terminal png size 1200,800
set xdata time
set timefmt "%Y-%m-%d_%H:%M:%S"
set output "dmc_graph.png"
set xlabel "Time"
set ylabel "Reqs/sec"
set title "DMC Traffic"
set datafile separator ","
plot 'dmc_load_timings.txt' using 2:1 index 0 title "UserProf" with lines, \
'dmc_load_timings.txt' using 2:1 index 1 title "Auth" with lines, \
'dmc_load_timings.txt' using 2:1 index 2 title "Reauth" with lines, \
'dmc_load_timings.txt' using 2:1 index 3 title "Stop" with lines, \
'dmc_load_timings.txt' using 2:1 index 4 title "Total" with lines