Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite's processing backend, carbon, which stores the data in Graphite's specialized database. The data can then be visualized through graphite's web interfaces.

Who should use Graphite?

Graphite is actually a bit of a niche application. Specifically, it is designed to handle numeric time-series data. For example, Graphite would be good at graphing stock prices because they are numbers that change over time. However Graphite is a complex system, and if you only have a few hundred distinct things you want to graph (stocks prices in the S&P 500) then Graphite is probably overkill. But if you need to graph a lot of different things (like dozens of performance metrics from thousands of servers) and you don't necessarily know the names of those things in advance (who wants to maintain such huge configuration?) then Graphite is for you.

How scalable is Graphite?

From a CPU perspective, Graphite scales horizontally on both the frontend and the backend, meaning you can simply add more machines to the mix to get more throughput. It is also fault tolerant in the sense that losing a backend machine will cause a minimal amount of data loss (whatever that machine had cached in memory) and will not disrupt the system if you have sufficient capacity remaining to handle the load.

From an I/O perspective, under load Graphite performs lots of tiny I/O operations on lots of different files very rapidly. This is because each distinct metric sent to Graphite is stored in its own database file, similar to how many tools (drraw, Cacti, Centreon, etc) built on top of RRD work. In fact, Graphite originally did use RRD for storage until fundamental limitations arose that required a new storage engine.

High volume (a few thousand distinct metrics updating minutely) pretty much requires a good RAID array. Graphite's backend caches incoming data if the disks cannot keep up with the large number of small write operations that occur (each data point is only a few bytes, but most disks cannot do more than a few thousand I/O operations per second, even if they are tiny). When this occurs, Graphite's database engine, whisper, allows carbon to write multiple data points at once, thus increasing overall throughput only at the cost of keeping excess data cached in memory until it can be written.

^-- from http://graphite.wikidot.com/faq

What I really like about Graphite is the fact that you can push data to it, instead of using a poller, like Cacti for example.

Here is a step-by-step guide on how to install and configure Graphite on Ubuntu Server:

Graphite is comprised of two components, a webapp frontend, and a backend (Carbon) storage application. Data collection agents connect to carbon and send their data, and carbon's job is to make that data available for real-time graphing immediately and try to get it stored on disk as fast as possible. Carbon is made of up three processes: carbon-agent.py, carbon-cache.py, and carbon-persister.py. The primary process is carbon-agent.py, which starts up the other two processes in a pipeline. Carbon-agent accepts connections and receives time series data in the appropriate format. This data is sent through the pipeline to carbon-cache, who stores the data in cache where data points are grouped by their associated metric. Carbon-cache constantly attempts to write the largest such group of data points down the pipeline to carbon-persister. Carbon-persister reads these data points and writes them to disk using Whisper. The reason carbon is split into three processes is actually because of Python's threading problems. Originally carbon was a single application where these distinct functions were performed by threads, but alas Python's GIL prevents multiple threads from actually running concurrently. Since the initial deployment of Graphite was done on a machine with lots of rather slow CPU's, we needed true concurrency for performance reasons. Thus it was split into three processes connected via pipes.

Graphite is built on fixed-size databases (see Whisper) so we have to configure in advance how much data we intend to store and at what level of precision. For instance you could store your data with 1-minute precision (meaning you will have one data point for each minute) for say 2 hours. Additionally you could store your data with 10-minute precision for 2 weeks, etc. The idea is that the storage cost is determined by the number of data points you want to store, the less fine your precision, the more time you can cover with fewer points.

Once you have picked your naming scheme you need to create a schema by creating/editing the /opt/graphite/conf/storage-schemas.conf file.

Let's say we want to store data with minutely precision for 30 days, then at 15 minute precision for 10 years. Here are the entries in the schemas file:

Basically, when carbon receives a metric, it determines where on the filesystem the whisper data file should be for that metric. If the data file does not exist, carbon knows it has to create it, but since whisper is a fixed size database, some parameters must be determined at the time of file creation (this is the reason we're making a schema). Carbon looks at the schemas file, and in order of priority (highest to lowest) looks for the first schema whose pattern matches the metric name. If no schema matches the default schema (2 hours of minutely data) is used. Once the appropriate schema is determined, carbon uses the retention configuration for the schema to create the whisper data file appropriately.

Now back to our schema entry. The server_load stanza is just a name for our schema, it doesn't really matter what you call it. The first parameter below that is priority, this is an integer (I usually just use 100) that tells carbon what order to evaluate the schemas in (highest to lowest). The purpose of priority is two-fold. First it is faster to test the more commonly used schemas first. Second, priorities provide a way to have different retention for a metric name that would have matched another schema. The pattern parameter is a regular expression that is used to match a new metric name to find what schema applies to it. In our example, the pattern will match any metric that starts with servers.. The retentions parameter is a little more complicated, here's how it works:

retentions is a comma separated list of retention configurations. Each retention configuration is of the form seconds_per_data_point:data_points_to_store. So in our example, the first retention configuration is 60 seconds per data point (so minutely data), and we want to store 43,200 of those (43,200 minutes is 30 days). The second retention configuration is 900 seconds per data point (15 minutes), and we want to store 350,400 of those (there are 350,400 15-minute intervals in 10 years).

There are all kinds of tools that gather stats and send them to graphite - one good example is logstash. Or you can write your own script. All they need to do is send the data in the right format - conatiner data date as I showed in the examples above.