As previously explained, the devices turn ON and OFF and this time I wish to extract a specific sequence:

show all new ON status records that aren't duplicate (same device twice in a row)

show appropriate OFF status from currently ON device

The closest I could get is this:

SELECT * FROM (
SELECT *
,lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) last_val
,lag(status, 1, -1) OVER (PARTITION BY val ORDER BY id) last_status
FROM t1
) x
WHERE (last_val <> val OR last_status <> status)
AND (status = 1 OR last_status <> -1)
ORDER BY id

This filters out more bogus data that the sample doesn't include but essentially it's about taking out subsequent duplicates (regardless of status) and the top OFF records which don't match. Records 3, 4, 5 and 6 are returned, but I don't want the fifth, it's an OFF which came after a new ON. So I need to jump that gap and look for the next proper OFF for the currently active device.

#10 turns OFF -- bogus in this context, but messes up the lag()

#11 turns OFF -- bogus in this context, but messes up the lag()

#11 turns ON -- OK, new sequence, include in SELECT

#10 turns ON -- OK, new sequence, include in SELECT

#11 turns OFF -- message came in late, need to ignore gap

#10 turns OFF -- OK, proper OFF to row 4, need to include in SELECT

Once filtered properly, I would like to use lead() on top of it all to get the next row's id (imagine a timestamp) as well as filter out all records that aren't ON statuses. I imagine this will require three embedded SELECT statements. This would get me a clear understanding of how long a device was active, until the condition of either another ON or a proper turn OFF.

1 Answer
1

Query with window functions

SELECT *
FROM (
SELECT *
,lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) AS last_val
,lag(status, 1, 0) OVER w2 AS last_status
,lag(next_id) OVER w2 AS next_id_of_last_status
FROM (
SELECT *, lead(id) OVER (PARTITION BY status ORDER BY id) AS next_id
FROM t1
) AS t
WINDOW w2 AS (PARTITION BY val ORDER BY id)
) x
WHERE (last_val <> val OR last_status <> status)
AND (status = 1
OR last_status = 1
AND ((next_id_of_last_status > id) OR next_id_of_last_status IS NULL)
)
ORDER BY id

This is... amazing. Where do I send the check? Seriously though, one extra question. How come the EXPLAIN ANALYZE for the query takes longer to execute (3 times as long) than the actual execution? For reference the table has about 8 million records.
–
denpanosekaiSep 19 '12 at 23:52

@denpanosekai: EXPLAIN ANALYZE has to do everything the query does plus some extra work, so it can be slower. I think this has improved since version 8.4, though. Also: when it finally finished, what were the results? I'd expect the function to be about 3 - 4 times faster than the query .. ?
–
Erwin BrandstetterSep 19 '12 at 23:59

Actually, both perform in 13 seconds, but the function returns 381K records, while the Query returns 327K records. Damn, there must be some extra fringe condition I didn't consider... I'm basically sorting through garbage at this point. I think this has to do with duplicate ON/OFF records on the same timestamp (not actually using an ID field as per my example). But your help was invaluable and I thank you muchly.
–
denpanosekaiSep 20 '12 at 0:18

More to the point, when testing clean data sets (same number of records, no duplicate timestamps), the Query is much faster (able to fetch 500 records out of 8 million in 20ms, vs 35ms by function) From what I can see none of them are using indexes, so I can't say where the difference comes from.
–
denpanosekaiSep 20 '12 at 0:24

@denpanosekai: That's unexpected. In a quick test with a small dataset on PostgreSQL 9.1.5 the function was the winner by far. It always depends on many factors. Cool either way, I love both my children equally. ;)
–
Erwin BrandstetterSep 20 '12 at 1:11