The Neo4j Manual v2.2.0-RC0111.5. Aggregation

11.5. Aggregation

Introduction

To calculate aggregated data, Cypher offers aggregation, much like SQL’s GROUP BY.

Aggregate functions take multiple input values and calculate an aggregated value from them.
Examples are avg that calculates the average of multiple numeric values, or min that finds the smallest numeric value in a set of values.

Aggregation can be done over all the matching subgraphs, or it can be further divided by introducing key values.
These are non-aggregate expressions, that are used to group the values going into the aggregate functions.

So, if the return statement looks something like this:

RETURN n, count(*)

We have two return expressions: n, and count(*).
The first, n, is no aggregate function, and so it will be the grouping key.
The latter, count(*) is an aggregate expression.
So the matching subgraphs will be divided into different buckets, depending on the grouping key.
The aggregate function will then run on these buckets, calculating the aggregate values.

If you want to use aggregations to sort your result set, the aggregation must be included in the RETURN to be used in your ORDER BY.

The last piece of the puzzle is the DISTINCT keyword.
It is used to make all values unique before running them through an aggregate function.

An example might be helpful.
In this case, we are running the query against the following data:

In this example we are trying to find all our friends of friends, and count them.
The first aggregate function, count(DISTINCT friend_of_friend), will only see a friend_of_friend once — DISTINCT removes the duplicates.
The latter aggregate function, count(friend_of_friend), might very well see the same friend_of_friend multiple times.
In this case, both B and C know D and thus D will get counted twice, when not using DISTINCT.

percentileDisc

percentileDisc calculates the percentile of a given value over a group, with a percentile from 0.0 to 1.0.
It uses a rounding method, returning the nearest value to the percentile.
For interpolated values, see percentileCont.

Query

MATCH (n:Person)
RETURN percentileDisc(n.property, 0.5)

The 50th percentile of the values in the property property is returned by the example query. In this case, 0.5 is the median, or 50th percentile.

percentileCont

percentileCont calculates the percentile of a given value over a group, with a percentile from 0.0 to 1.0.
It uses a linear interpolation method, calculating a weighted average between two values, if the desired percentile lies between them.
For nearest values using a rounding method, see percentileDisc.

Query

MATCH (n:Person)
RETURN percentileCont(n.property, 0.4)

The 40th percentile of the values in the property property is returned by the example query, calculated with a weighted average.

stdev

stdev calculates the standard deviation for a given value over a group.
It uses a standard two-pass method, with N - 1 as the denominator, and should be used when taking a sample of the population for an unbiased estimate.
When the standard variation of the entire population is being calculated, stdevp should be used.

Query

MATCH (n)
WHERE n.name IN ['A', 'B', 'C']
RETURN stdev(n.property)

The standard deviation of the values in the property property is returned by the example query.

stdevp

stdevp calculates the standard deviation for a given value over a group.
It uses a standard two-pass method, with N as the denominator, and should be used when calculating the standard deviation for an entire population.
When the standard variation of only a sample of the population is being calculated, stdev should be used.

Query

MATCH (n)
WHERE n.name IN ['A', 'B', 'C']
RETURN stdevp(n.property)

The population standard deviation of the values in the property property is returned by the example query.