I need to rank/order a set of data by date and period type (blocktype) for groups of worker records. Where the blocktype is 'Ind(ividual)', I need to increment the rank by 1. Where the type is 'Cont(inuous)', I don't want to increment the rank until the next 'Ind' record. This is so I can then get the min/max start/end date for that rank number.

I'm not sure if that's clear, but it's really obvious what I need to do with the attached SQL below (please see 'Desired' column). I'm guessing rank or dense_rank is the way forward, but I'm not sure what combination of columns to partition/order by?

So you want to keep a count, but only count rows where either:
(a) blocktype = 'ind' or
(b) blocktype on the previous row for the same pid was 'ind'. (I assume "Previous" means in order by startdate.)

Here's one way:

WITH got_prev_blocktype AS
(
SELECT m.*
, LAG (blocktype) OVER ( PARTITION BY pid
ORDER BY startdate
) AS prev_blocktype
FROM mytable m
-- WHERE ... -- If you need any filtering, this it where it goes
)
SELECT p.* -- Or list all columns except prev_blocktype
, COUNT ( CASE
WHEN 'ind' IN (blocktype, prev_blocktype)
THEN 1
END
) OVER ( PARTITION BY pid
ORDER BY startdate
) AS cnt
FROM got_prev_blocktype p
ORDER BY pid
, startdate
;

You can't nest analytic functions, so we need a sub-query to compute the analytic LAG value, which we then use in the analytic COUNT function.

Thanks for posting the CREATE TABLE and INSERT statements! That's very helpful.

No, it's me that needs to thank you! I've just checked it out and it looks like it's working well. I didn't think about using lag a second time. I actually already used the lead function to create the existing cont/ind flag to see if the next start date is equal to the current row's end date.

If you don't mind me asking, how does the count() over (partition by) part of the syntax work? I don't quite get how it increments the value over the group (worker pid). I understand it only counts if 'ind' is equal to the row's value in both the block type and prev_blocktype columns but not sure how it keeps the cumulative total from one row to the next?

user9363122 wrote:
... If you don't mind me asking, how does the count() over (partition by) part of the syntax work? I don't quite get how it increments the value over the group (worker pid). I understand it only counts if 'ind' is equal to the row's value in both the block type and prev_blocktype columns

Actually, it counts if 'ind' is equal to either blocktype or prev_blocktype. 'ind' doesn't have to be equal to both.

but not sure how it keeps the cumulative total from one row to the next?

Remember (or look up) how the analytic ORDER BY clause works.
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#i97469

COUNT (x) OVER (PARTITION BY pid) tells how many rows with this pid have a non-NULL x.
COUNT (x) OVER (PARTITION BY pid ORDER BY startdate ) tells how many rows with this pid and startdate less than or equal to this row's stratdate have a non-NULL x.
Without an ORDER BY clause, the function operates over the whole partition.
With an ORDER BY clause, the function operates over only part of the partition. That's how it gets the cumulative COUNT.

Now we have a block of 2 rows (with startdates 26/11/2007 and 25/04/2008) followed immediately by a block of 5 rows (with startdates between 01/04/2009 and 19/11/2010). Does that mean all 7 rows would have blocktype='cont', and therefore they would all have the same desired value, even though there's a gap (from 28/08/2008 to 01/04/2009) in the middle of that section?

You probably don't need blocktype to compute desired. You can probably compute desited from startdate and enddate, using only 1 sub-query.

I'm actually really glad you picked up on that, good call. I was just reviewing my input data and yes, there's a flaw to my logic.

If you have time, can you have a look at the create/insert statement below, please?

Again, the desired column is what I'm trying to arrive at. This is so I can get the min/max start/end dates of the consecutive blocks where the end date of the current row = the start date of the next row.

WITH got_grp_date AS
(
SELECT mytable2.*
, enddate - SUM (enddate - startdate) OVER ( PARTITION BY pid
ORDER BY enddate
) AS grp_date
FROM mytable2
)
SELECT got_grp_date.* -- After development, list all columns except grp_date
, DENSE_RANK () OVER ( PARTITION BY pid
ORDER BY grp_date
) AS d
FROM got_grp_date
ORDER BY pid
, startdate
;

Whenever you have a problem, it really helps if you explain what you're doing. Keep posting the results you want, like you've always done, but also describe what those results are. Things that are "really obvious" to you might not be quite so obvious to people who aren't familiar with your data or your business requirements.
For example, a description of this problem (as I understand it) might be: "The desired column is the number of continous blocks of time. When enddate of of row equals startdate of the next row (for the same pid), then those two rows are part of the same block, and desired will be the same. When endate of one row is less than startdate of the next row (for the same pid), then they are in different blocks, and desired for the earlier row will be 1 less than desire of the later row."
A more mathematical, and equally good, way of describing the same results would be: "For each pid, the row with the earliest startdate has desired=1. On other rows, let N be desired from the previous row (in order by startdate) for the same pid. If startdate on a given row is equal to enddate of the previous row, then desired on the given row is N. If startdate on a given row is later than enddate on the previous row, then desired on the given row is N+1."
I'm assuming that the combination (pid, startdate) is unique, and that, within a pid, rows will never overlap. That is, if startdate on row A is less than startdate on row B, then enddate on row A must be less than or equal to startdate on row B.
Results from the query above:

Thanks for that and letting me know how to make things easier for you to interpret. I've just ran through a handful (of about 40,000 records) and it looks like it's working like a dream. Yes, your assumptions are correct. The dataset is about children in education who may move class but remain in a continuous period of education. They can't be in 2 classes at once so the end date of row N must be less than or equal to the start date of row N+1. Therefore, the start date of row N+1 <= start date of row N.

I like your approach to this. I would never have considered using dates to 'floor' the range - is that correct? - to identify contiguous blocks. Sorry to ask this again, but would you mind describing how that works? It's fascinating.

Thanks for that and letting me know how to make things easier for you to interpret.

I don't think it's just me. I think most people would prefer to talk about "when a class begins on the same day the child's last class ended" rather than "when startdate on row N equals enddate on row N-1 for the same pid". Don't you find that easier to say, to hear, and to understand?

I've just ran through a handful (of about 40,000 records) and it looks like it's working like a dream. Yes, your assumptions are correct. The dataset is about children in education who may move class but remain in a continuous period of education. They can't be in 2 classes at once so the end date of row N must be less than or equal to the start date of row N+1. Therefore, the start date of row N+1 <= start date of row N.

I like your approach to this. I would never have considered using dates to 'floor' the range - is that correct? - to identify contiguous blocks.

Exactly! The values in grp_date identify contiguous blocks. Those values don't mean much in absolute terms; they only have meaning relative to other values of grp_date. Equal grp_dates mean the same block. A later grp_date means a later block.

Sorry to ask this again, but would you mind describing how that works?

To give a hint as to how this was working, I posted a query that displayed grp_date, not because you need to show it, but because the query needs to use it. Let's make it even clearer by showing some more expressionss, just because they are important in computing grp_date. Grp_date is defined as "enddate - SUM (x)". We're already displaying enddate; let's display SUM (x) (which I called diff_so_far below) as well. In case that's not really obvious, let's show x itself (diff in the query below).

WITH got_grp_date AS
(
SELECT mytable2.*
, enddate - SUM (enddate - startdate) OVER ( PARTITION BY pid
ORDER BY enddate
) AS grp_date
, SUM (enddate - startdate) OVER ( PARTITION BY pid
ORDER BY enddate
) AS diff_so_far -- For debugging only
, (enddate - startdate) AS diff -- For debugging only
FROM mytable2
)
SELECT got_grp_date.* -- After development, list all columns except grp_date
, DENSE_RANK () OVER ( PARTITION BY pid
ORDER BY grp_date
) AS d
FROM got_grp_date
ORDER BY pid
, startdate
;

What are we trying to do in this problem? We're trying to derive the last column, d, which is a "block number" (1, 2, 3, ...) for each child, where a block is a series of clasess without any gaps between them; that is, one class begins exactly when that child's previous class ended. DENSE_RANK is an obvious choice for computing d, since it assigns numbers (1, 2, 3, ...) with a different series for each partition, repeat numbers when there is a tie in the ordering, and never skipping a number. The problem is, what can we use for an ORDER BY expression in DENSE_RANK? We need something that uniquely identifies each block. We need something that will be the same for row A and row B when those two rows are in the same block, and somehting that is less for row A than for row B if row A belongs in an earlier block than row B. That something in the query above is grp_date.
This is an example of a Fixed Difference problem. When two rows belong in the same block, they do not necessarily have any column of actual data in common (except pid, of course), but what they do have in common is the difference between two expressions, enddate and diff_so_far in this example. Imagine that, when a child begins his first class, we start a stopwatch ticking, and that stopwatch continues to run as long as the child is in a class. If the child stops taking classes, then the stopwatch stops. If the child starts taking classes again, then the stopwatch resumes from the point where it left off. The difference between the real time and the stopwatch time will be fixed (or constant) as long as the child stays in some class. When the child takes a break, the difference will increase. If the child resumes classes, the stopwatch with start again, and the difference between real time and the stopwatch time will stay at the new level.
Let's see how this applies to your sample data. This child started taking classes on December 11, 2008. The first class ended on December 24, 2008, that is, 13 days later. The stopwatch stopped, but immediately start again, because that child started another class on December 24, and that class continued unitl May 22, 2009, or 149 days later. Again the stopwatch but immediately started again (you might think of it as staying stopped for 0 days) because the child started another class, which lasted for 187 days. On November 25, 2009, the stopwatch stopped again, and this time it stayed stopped for a while, because the child did not immediately start another class. As on November 25, 2009, the child had been in class a total of 13+149+187 = 349 days. (This is the diff_so_far column.) At any poit so far, if you subtract the number of days that child had been in class from the current date, you would get the same result: November 12, 2008. (This is the grp_date column).
Now let's see what happens when the child returns to school on May 4, 2010. The stopwatch starts ticking again, starting from the point where it left off, which was 349 days. When that class ends, 44 days later, the stopwatch reads 349_44 = 393. At that point, the child has had as much schooling as he would have had if he had started classes 393 days earlier and never stopped. 393 days before the end of that class (June 17 2010) happens to be May 20, 2009. Not that anything special happend on May 20, 2009, it just happens to be a date that reflects how long the child has been in class.
The child's next class begins immeidatley, and when it ends 32 days later (on July 19, 2010) the child now has 425 of schooling to his credit, as if he had started school on May 20, 2009. As long as the child keeps taking classes without a break, it will be as if the child had started on May 20, 2009. As long as the child stays in class, the difference between the current date (enddate) and that theoretical point when the child could have started, had he not taken any breaks, will stay fixed. Every day the child stays in class adds 1 day to the current date, and also adds 1 day to diff_so_far, so the difference remains the same as long as the child stays in class.
If the child takes a break, then, when he resumes, the difference will have increased, but then, as long as the child keeps taking classes, the difference won't get any greater.

It's fascinating.

If you think so, you might be interested in another example of the Fixed Difference technique, with another long-winded explanation. If you feel you understand how the query above solves this problem, then see if you can solve the problem in the thread below without reading the posted solution:Analytic Question with lag and lead
If you don't think you understand the solution to your problem, then read the solution to this other problem for another look at how the Fixed Difference tecnique works. I think this other example is a little easier because it doesn't use a DATE column in such an un-intuitive way.

Thanks for posting such a comprehesive explanation of your workings - that's really great and makes it really clear how you achieved it. Have you ever thought about writing a book (if you haven't already!) - It would probably be the best selling SQL book ever! I know I'm only an amateur compared to you guys but this kind problem really shows how versatile SQL is - i.e. so much more than just 'select from where'.

Do you mind if I ask you a further question on this? A bit of a variation on a theme....

I have a similar dataset, this time with a start code indicating the start of a child's period of education and an end code indicating the end of a child's period of education. This is still dependent on a contiguous block of dates, with the start code and the end code acting as book ends for the period.

In this case, I need to get the start date and end date for the child's period of education where the start code is in (S ,B) and the End code != X.

However, a child may have more than one S,B code (S means start of new period, B means both change of class and start of new period) and therefore more than one E (End) code. Where the End Date/code is null, the period is still ongoing. This could be substituted for sysdate+1/'E' though.

Therefore I need to assign a dense_rank based on the first start code to the first end code, the second start code to the second end code, the third start code to the third end code etc ordered by start date.

It doesn't have to be a DENSE_RANK, does it? As long as you get the right results, you don't care if DENSE_RANK or some other function produces those results, right?

based on the first start code to the first end code, the second start code to the second end code, the third start code to the third end code etc ordered by start date.
... I was thinking it might be possible using row_number()?

Good thought, but there might be even better ways of doing it.
The analytic ROW_NUMBER fucntion counts every row .
The analytic COUNT function counts only certain rows . In the query below, it only counts rows where startcode in ('B', 'S'):

SELECT m.*
, COUNT ( CASE
WHEN startcode IN ('B', 'S')
THEN 1
ELSE NULL -- This is the default, but you can explicitly say it if you want to
END
) OVER ( PARTITION BY pid
ORDER BY startdate
) AS period
FROM mytable3 m
ORDER BY pid
, startdate
;

To find the startdate and enddate of the whole period, do this in a sub-query, and then use the analytic MIN and MAX functions, like this:

WITH got_period AS
(
SELECT m.*
, COUNT ( CASE
WHEN startcode IN ('B', 'S')
THEN 1
ELSE NULL -- This is the default, but you can explicitly say it if you want to
END
) OVER ( PARTITION BY pid
ORDER BY startdate
) AS period
FROM mytable3 m
)
SELECT p.*
, MIN (startdate) OVER (PARTITION BY period) AS period_startdate
, MAX (enddate) OVER (PARTITION BY period) AS period_enddate
FROM got_period p
ORDER BY pid
, startdate
;

Thanks for getting back to me so quickly. Sorry, you're right. In the amended table I posted period 2 should have had an end code and not an ongoing code. You've got such a keen eye.

Ah excellent, that looks like it's working a treat. Kind of a bit annoyed I didn't think of it earlier, especially as you posted a similar approach when you first started helping me.

As for why we need to use a sub query, I'm not sure but I think it's because grouping functions are executed ahead the select statement which would mean it would try to find the min/max of the group before the analytical function had defined the product for the aggregate function to group on? Would love to know if I'm wrong though.

Tiny Penguin wrote:
... As for why we need to use a sub query, I'm not sure but I think it's because grouping functions are executed ahead the select statement which would mean it would try to find the min/max of the group before the analytical function had defined the product for the aggregate function to group on? Would love to know if I'm wrong though.

You're on the right track. What you said about aggreagate functions and analytic functions is true: the GROUP BY clause (if there is one) is applied and aggregate functions are computed before analytic functions are computed, so the only way to GROUP BY the results of an analytic function is to compute the analytic function first, in a sub-query. In the query above, however, there are no aggregate functions, only analyitc functions. Analytic functions can't be nested, or depend on one another, so we can't say "MIN (startdate) OVER (PARTITION BY period)" if period is another analytic function.