Search The Web

Today's Headlines

Monday, March 22, 2010

My post on finding percentile values using Microsoft Access received a comment from an interested reader who had an interesting twist on the idea. Instead of finding percentile values for all the data in the table, the commenter wanted to find out percentile values for individual groups in the data. After some experimentation, I found a way to calculate what he/she wanted. But it led me to think about how one can calculate other aggregate statistics of one field like medians, modes, etc. while grouping by other fields in the table. This post is a result of that research.

Obviously, finding aggregate statistics like average, minimum, maximum, etc. while grouping by other fields is quite trivial in Access. These aggregate statistics have built-in aggregate functions, making their computation simply a matter of using the correct aggregate function in the query. Thus, to find the average of myField while grouping by groupField would involve a query like the one below:

select groupField, avg(myField) from myTable
group by groupField

But what happens when you want to find the median, mode or different percentile values? These don't have built-in aggregate functions, so I devoted entire posts to tricking Access to cough up these statistics using SQL constructs and tricks. It turns out that grouping while finding these kinds of aggregate statistics can be tricky at first. Once you figure out the limitations of Microsoft Access SQL, and learn how to live within them, it is not too bad.

I have provided some solutions to calculating aggregates while grouping below. I have kept these individual sections short so that I can accommodate as many of them as possible in this post. Given the links to the posts that deal with how to calculate these statistics without grouping, you should not have any problem seeing how the SQL there has been extended to accommodate grouping. Still, if you have questions or concerns, please don't hesitate to let me know with your comments.

Also note that even though I have grouped by only one field in all the queries below, you can actually group by as many fields as you want. You just have to add the additional fields to the main query in the GROUP BY clause, and add additional conditions to the subquery in the WHERE clause. In all the queries below, we are calculating aggregates on myField while grouping on groupField. Both of them are fields in myTable.

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS Median
FROM myTable AS T group by T.groupField

or

select T.groupField, 0.50*(select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.50*(select min(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS Median
from myTable AS T group by T.groupField

SELECT groupField, (select max(myField) from myTable where myField in
(SELECT top 1 myField
FROM myTable where myTable.groupField = T.groupField
GROUP BY myField
ORDER BY count(*) DESC)) as Mode
from myTable as T group by T.groupField

SELECT groupField, (select exp(avg(log(myField))) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

SELECT groupField, (select count(*)/sum(1.0/myField) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

Medians are just a special case of percentile values, the median being the 50th percentile value. Thus, the SQL for a percentile value while grouping is almost identical to that for the median. In the SQL below, PV represents the percentile value we are trying to find.

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS PVPercentile
FROM myTable AS T group by T.groupField

or

select T.groupField, (1-PV/100)*(select max(myField) from myTable where myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
(PV/100)*(select min(myField) from myTable where myTable.myField in
(select top (100-PV) percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS PVPercentile
from myTable AS T group by T.groupField

Thus, the 25th percentile would be found as below:

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS PVPercentile
FROM myTable AS T group by T.groupField

or

select T.groupField, 0.75*(select max(myField) from myTable where myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.25*(select min(myField) from myTable where myTable.myField in
(select top 75 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS 25Percentile
from myTable AS T group by T.groupField

Notice the pattern in all the queries above: We select groupField first in the main query so that we know which group the aggregate statistic belongs to. The main query also sets an alias for myTable that can be used in the subquery. The main query also has a GROUP BY clause to group by groupField.

The second field selected in the main query is a calculated field. The calculated field is our aggregate statistic for myField. We use a correlated subquery to perform the calculation. The correlated subquery uses another subquery with an IN clause to calculate the value. The query inside the IN clause uses the value of groupField from the main query to subdivide the table into multiple temporary tables that contain only rows that have the same value of groupField as in the outer query.

The aggregate statistic we need is then calculated inside this subquery. Since the subquery has segregated the table to include only rows with the value of groupField equal to the value of groupField in the main query, we end up calculating the aggregate statistic for that group of rows.

Using this pattern, it should now be possible for you to calculate other aggregate statistics while grouping if you have the need to. If there are any that you have problems with, I would love to hear about it so that I can take it on as a challenge while helping you out at the same time. Good luck!

Hope this post has been helpful in solving any problems you might have had with aggregating with grouping in Access. If you have any problems or concerns with the SQL in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

23 comments:

James
said...

Brilliant. I'm a relative novice to SQL, and I've shown your work to a few folks at my job that know more. They say the work is thoughtful and elegant. Thanks for the help. My one question: does T in the above represent table name? You refer to mytable explicitly in other places....

In the above queries, T is an alias for myTable in the outer query. You will find the line "from myTable as T" at the end of each query. That is where T is defined. Essentially, for the grouping, we pick each grouped value from the table in the outer query, and then calculate statistics for it in the inner query using a WHERE clause to limit values chosen by the inner query to be corresponding to the grouped value in the outer query. And you need an alias for myTable in the inner query to do this selection. Hence the definition of T in the outer query, and then its use in the inner queries. Hope that makes sense.

Hello!First of all: Thank you very much for this blog and your posts about access, it helped me a lot in the last days.Now I stuck with above solution of finding percentile in grouped data. When I test it with a regular table it works quite fine but when I use a query/view instead it says something like " " is not part of the aggregate function. " Any ideas on this?

Unfortunately, without looking at the query, it is impossible for me to figure out what is wrong with it. Cut the query down as much as possible to include only the absolute minimum necessary to generate the error. Just the process of doing this might give you a solution to the problem.

This page was very helpful. Could you provide an example using your second option for Median that would add a second grouping field (groupField2). I understand adding it in the GROUP BY clause in the main query but cannot get the WHERE clause to work with the added condition.Thank you.

That error usually tells me that there is some spelling error or something like that where the variables in the group by clause are not the same as the variables in the select clause. Check them thoroughly (copy and paste them from one place to another to eliminate such errors), and see if the problem is still there. If the problem persists, post the query here and I will also take a look at it to see if I can help.

Pasted below is the query that I am working with. It works with grouping on one field (Interval) but I can't seem to make it work once I add the second group field (Timeframe). Thanks for your help. Here is the SQL:

Is [Calc Query 1] a table or another query? If it is another query, have you checked whether that query runs fine by itself?

I created a mock database with a table called "Calc Query 1", and created fields in it called "Interval", "Duration" and "Timeframe" (all doubles). I copied and pasted your query into the SQL view of a new query in this database, and it works perfectly fine (no syntax errors. I don't have any data in the table, so I have no idea what output it would produce, but logically, I think it would produce the median if there were data in the table).

Do you have any recommendations on making this query run quicker (i.e., more efficiently)? I have a rather sizeable table (~165,000 records) that I need to calculate 95th percentile values grouped by a field (for which there are about 1,400 unique values). I tried indexing the table, on both groupField and myField (both with duplicates OK) and the query has been running for 30+ minutes and doesn't show signs of stopping. Eventually, I want to run this statistic on a much larger table (upwards of 25x larger). I've seen other ways of calculating percentile values using a custom function...any thoughts on whether that might be quicker? Any insight that you can give would be most helpful!

1)In your text above for Percentil Values, the "or" section, shouldn't the second "select top" statement (the one associated with "select min(myField)") be "select top (100-PV) percent? Note that the same comment goes to your example of the 25th percentile value.

2)I'm using Access 2010, and trying to get a 95th percentile value. Access is returning the same record for the max of "SELECT TOP 95 PERCENT..." as it does for "SELECT TOP 5 PERCENT...order...DESC". I created a test table that only had one of my unique values, and thus only had 114 records. A "SELECT TOP 95 PERCENT..." query returns 109 records while a "SELECT TOP 5 PERCENT...order...DESC" returns 6 records. The 109th record of the ascending sort is the exact same record as the 6th record in the descending sort. I would have thought that the "SELECT TOP 95 PERCENT" query would only return 108 records (0.95*114 = 108.3).

1) The logic for how to weight the two values when calculating percentile values is explained in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

2) Access rounds up the number of records it returns to the next higher integer. If it did not do that, you would never be able to find the median (or any percentile values) when the table contains a single record, for instance.

As far as speed is concerned, I have posted a VBA solution to the problem of finding percentile values in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

You have to modify that VBA code to find percentiles with grouping. I think it may be faster than a query, but Access is not the best solution when you have a lot of records in a table. You might be better off using an actual enterprise RDBMS product like SQL server or a free solution like MySQL.

1) Yes, I read the post on percentile values first, and the example that you give there says "select 0.90*(SELECT MAX(myField)FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)) +0.10*(SELECT MIN(myField)FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc))from myTable". I note that the "MAX(myField)" is associated with the "SELECT top 10 PERCENT" and is multiplied by 0.90, while the "MIN(myField)" is associated with the "SELECT top 90 PERCENT" and is multiplied by 0.10.

However, in the code you give on this page, both of the "SELECT TOP" statements say "SELECT TOP PV PERCENT". My point was that, following the logic on the post on percentile values, the second one should read "SELECT TOP (100-PV) PERCENT".

2) If Access is rounding up the number of records, it will always return the same record for the "SELECT TOP PV PERCENT" as the "SELECT TOP (100-PV) PERCENT...ORDER...DESC", so there is no need to weight one by PV% and the other by (100-PV)%.

3) Thanks for pointing me to the VB code. I'll give that a read and post further comments if I have any questions.

1) Ah, got it. I was looking for the mistake in the wrong place, and I misread your original comment. The text has been corrected, thanks for pointing out the error.

2)Not necessarily. In the case of 114 records, it just worked out that way. If there were 4 records in the table, and you were trying to find the median, one part of the query would pull up the second record and the other would pull up the 3rd record, and you have to average them to get the correct median. Similarly, if you have 100 records, and you are trying to find the 95th percentile, you will pull up the 95th and 96th records with the two queries and the results have to be weighted appropriately to get the correct value.

In general it is better to assume that two different records will be pulled up rather than rely on them being identical.

1) Don't forget to also fix the example of the 25th percentile (the second "select top" should be 75 instead of 25).

2) The "TOP PV PERCENT" and "TOP (100-PV) PERCENT" selects only pull up two different records in certain specific instances: when the number of records is a multiple of either (100/PV) or (100/(100-PV)). (In your two examples, 4 is a multiple of 2 (100/50) and 100 is a multiple of 20 (100/5). When you're just looking for medians (PV=50), that's fine, because either the number your looking for is a record, or it's between two records. However, if PV is anything other than 50, the technique doesn't appear to give you an accurate percentile value. For example, I created myTable with two fields, myGroup and myField and created 100 records, with myField being 1 through 100, and myGroup being all "A". I then created a query as:

SELECT T.myGroup, (1-95/100)*(select max([myField]) from myTable where myTable.myField in (select top 95 percent myField from myTable where myTable.myGroup = T.myGroup order by myField))+(95/100)*(select min([myField]) from myTable where myTable.myField in (select top 5 percent myField from myTable where myTable.myGroup = T.myGroup order by myField desc)) AS Expr1 FROM myTable AS T GROUP BY T.myGroup;

For exactly 100 records, this gives a result of 95.95. However, if I add a 101th record (myField of 101), the same query gives the result of 96. Adding 102 give 97, 103 gives 98, and so on until 119 records results in a value of 114. However, adding the 120th record gives a value of 114.95. I'm not really sure this is the behavior that is expected.

3) I was able to successfully modify your DMedian code to return percentiles, but I decided to use the method of "linear interpolation between closest ranks" (see http://en.wikipedia.org/wiki/Percentile). I'll paste the code in the next comment. The query that I created is as follows:SELECT myTable.myGroup, DPercentile(95,"myField","myTable","myGroup = '" & [myGroup] & "'") AS _95PercentileFROM myTableGROUP BY myTable.myGroup;

This works MUCH MUCH faster than using the nested SELECT TOP PV PERCENT technique, even with relatively small datasets.

If numberOfRecords = 0 Then DPercentile = 0'We assume that the percentile is 0 when the number of records is zeroElseIf numberOfRecords = 1 Then DPercentile = rst(expr)'If the number of records is 1, the value of the expression in that record is the percentileElse trueIndex = (PV / 100 * numberOfRecords) + 0.5 lowIndex = Int(trueIndex) 'lowIndex now points to the position below (or at) the correct percentile value. rst.MoveFirst rst.Move (lowIndex - 1) DPercentile = rst(expr) 'if the number of records is odd, we are done If lowIndex <> trueIndex Then 'the percentile value doesn't fall exactly on a record rst.MoveNext DPercentile = DPercentile + (trueIndex - lowIndex) * (rst(expr) - DPercentile) 'Do the linear interpolation between the two values. End IfEnd Ifrst.CloseSet rst = Nothingdbs.CloseSet dbs = Nothing'Cleanup everything before leaving the functionEnd Function

Sorry about the delay in responding. Vacation, then computer problems, you know the drill.

1) Thanks for pointing that out. It is now fixed.

2) I am not disagreeing with you. But as long as at least one case exists where the two queries pull up different records, I need to take that possibility into account. I don't want to write a query that will work 95 or 99% of the time, and give the wrong results the rest of the time. The way things are written now, it will give you the correct answers 100% of the time.

I have a data set with 600k records. My groupField has range 0 to 50 in steps by 0.1 (500 values in total). I have finished analysis using partition function and average on myField like this:SELECT Partition([groupField],0,50,1) AS Range, Avg(myTable.myField) AS AvgOfmyFieldFROM myTable GROUP BY Partition([groupField],0,50,1);Output was 50 rows with two fields, Range and AvgOfmyField, like this: 0:0 nnn, 1:1 nnn, itd.Range 0:0 covers range from 0.0 to 0.5, Range 1:1 covers range from 0.6 to 1.4, 2:2 range 1.5 to 2.5, 3:3 range 2.6 to 3.4 itd. A little bit strange and uneven ranges but let say ok.Similar result I have with pivot table with groupField in RawArea and myField in DataArea with Average. Here I have also rows 0 to 50 automatically grouped, but range for row 0 is from 0.0 to 0.9, row 1 range is from 1.0 to 1.9 itd.These two results differ because of different ranges of bins, but both would be acceptable for me.My main problem is how to calculate median (and percentile) instead of average on the same distribution (histogram). I can apply your solution directly grouping on my groupField but this will produce 500 bins which is too fine resolution. Some of bins at the end of the scale will hold too few samples to be statistically confident, so I prefer to stick with grouping by 1.I’m thinking of adding new calculated field in myTable by which I can group.Databases are not my primary field, but problems like this I often encounter so I’m looking for simplest solution. I believe you will have some advice.

Thank you for your detailed explanation of the problem you are facing. I can think of a few solutions: 1. Have you looked at the article on histograms (http://blogannath.blogspot.com/2009/12/microsoft-access-tips-tricks-histograms.html)? You may be able to replace the count() in that post with a subquery that calculates medians. (2) You can create a calculated field in the table to group by and then use the concepts in this post to calculate medians. (3) You can use your query above as a subquery (it already has a calculated field called RANGE) in a median query and group by the RANGE field in the outer query or (4) you can create a temporary table using the results of your histogram query, and then calculate medians out of this temporary table.

I like all of the solutions better than (2) because including calculated fields in a database is very error-prone (the calculations have to be redone as soon as any data in the table changes, none of the calculations update automatically, etc.). But it may be acceptable if you are going to use the table once and then you have a new data feed the next time you need to do this or something like that.

Search The Web

Made Possible By The Generous Donations Of Readers Like You

If you like the content of this blog, please consider a donation to keep it going. This blog is also supported by my sponsors to whom I am ever so grateful. You can visit my sponsors' websites without leaving this page by holding down the "Ctrl" key while clicking on the sponsors' messages. Thank you!

Read This Blog In Your Language

Double-click on any word in this blog to get a dictionary definition of that word in a new window, courtesy of Free Online Dictionary. You can also look up any word (including words in several languages other than English) using the Dictionary Lookup element below.

Jump To A Random Post

Popular Pages

Recommended Reading

Live Traffic Feed

Live Traffic Map

Visitor Countries

Visitor States

Locate IP Address

Blog Exchange

Website Tracking and Monitoring

Charity Banners

The banners below are not from sponsors or advertisers. These are banners I choose to display on my blog's sidbar to show my support for these charitable organizations and their causes.

If you choose to donate to any of these organizations by clicking on the banners below, such donations DO NOT go through me. I do not handle the donations or get a cut of them. They go in full directly to the organization whose banner you clicked. Thank you for your support of these worthy organizations and causes.

If you have a blog or other website, considering hosting such banners supporting your favorite charities and other causes. These organizations can use all the publicity they can get. Thank you.

Help in the fight against hunger

Support The Red Cross

Wall Against Hunger

Red Cross Safety Classes

Human Rescue Plan

Red Cross Racing

Teaching Hunger?

Make Every Day Earth Day

Red Cross: Change a Life

WFP On The Road

WFP In The Horn Of Africa

Red Cross: Donate Blood

WFP School Meals

Red Cross: Give Life

Disclaimer

This blog features reviews of several products and services. I have not received any compensation for most of these reviews. Some reviews have been paid for by the provider of the products or services. However, all the reviews are completely honest, and reflect my true opinion about the product or service being reviewed.