about turning data into insightful knowledge – for business and personal curiosity

Main menu

Post navigation

LOCF and Linear Imputation with PostgreSQL

This tutorial will introduce various tools offered by PostgreSQL, and SQL in general – like custom functions, window functions, aggregate functions, WITH clause (or CTE for Common Table Expression) – for the purpose of implementing a program which imputes numeric observations within a column applying linear interpolation where possible and forward and backward padding where necessary. I’m going to progressively add and explain those constructs, step by step, so no problem if you are new to the scene. I am very much interested in input regarding potential downsides of the implementation and possible improvements.

Imputing Observations

A standard situation for somebody working with data is being faced with missing observations. Depending on the type of observation imputation might be possible. The most simple methods of padding those gaps are to forward or backfill observations up and down an ordered and related sequence of records. This is often referred to as LOCF (last observation carried forward) and FOCB (first observation carried backward). Another often reasonable method is to assume a linear transformation between observations.

Setting Up the Toy Data

(GitHub / SQL Fiddle – at the beginning of every section you will find a link to the code on GitHub and an SQL Fiddle session)

We start with a table ( tbl) of four columns t::float, a::int, b::int and v::float. a and b hold the keys for identifiying related records – this is what we will partition over. t represents the time and v the observations that we want to impute. For example, think of b as a town in country a where temperature v was measured on 12pm on date t.

A Very Simple Custom Function

An LOCF is at the end of the day just aCOALESCE(b,a)of two observations with a having been observed before b. If bisnotNULL we take it, if it is we use the value of the observation before, i.e. a. If we perform those steps iteratively starting with an initial observation (initial condition) of NULL then we will inductively fill all gaps after stumbling upon a first non-NULL observation. The following function implements this mechanism, but first we will apply it in a non-aggregating way. The script will just add another column v_or_t which holds v in case vISNOTNULL and t if vISNULL.

1

2

3

4

5

6

7

8

9

10

11

12

CREATEORREPLACEFUNCTIONfun(aFLOAT,bFLOAT)

RETURNSFLOAT

LANGUAGESQL

AS'

SELECT COALESCE(b, a)

';

selecta,b,t,v,

fun(t,v)asv_or_t

fromtbl

orderbya,b,t

;

Here we CREATE ORREPLACE a FUNCTION named fun, which takes two FLOAT typed arguments a and b and RETURNS FLOAT as well. The function is implemented by means of the LANGUAGE SQL (instead of PL/pgSQL or some other language). An SQL programmed function will return the result of the last evaluated SELECT statement. The code itself has to be delimited by single quotes ( ') or a string like f.x. “$$” and follows after the keyword AS.

Implementing LOCF

Now instead of applying the function (which I renamed to locf) horizontally within a record we tell PostgreSQL to apply it iteratively in a specified ORDER and within specified boundaries (the PARTITION) vertically across two records:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

createorreplacefunctionlocf_s(afloat,bfloat)

returnsfloat

languagesql

as'

select coalesce(b, a)

';

dropaggregateifexistslocf(float);

CREATEAGGREGATElocf(FLOAT)(

SFUNC=locf_s,

STYPE=FLOAT

);

selecta,b,t,v,

locf(v)over(PARTITIONbya,bORDERbyt)asv_locf

fromtbl

orderbya,b,t

;

As you can see the process of creating the LOCF function is now split into two steps. First we define the transformation of two values into one ( locf_s) and then we create the actual aggregating function ( locf) by specifying how to apply the transformation. Here we just define what in this context is referred to as the “state transition function” ( SFUNC) which returns a value of the “state type” ( STYPE) FLOAT.

In the final query we apply locf() over the PARTITION defined by a and b and traverse the records by ordering t ascending.

Filling all the Gaps by Applying First LOCF and then FOCB

Now we go one step further and fill all the missing oberservations by padding a missing oberservation preferredly with the last seen value and if that is not possible we use the first available value. Technically we fill one column v_locf with the LOCF values, another column v_focb with the FOCB values and then choose for v_final the value from v_locf if it is not NULL and otherwise the value from v_focb.

1

2

3

4

5

6

7

8

9

10

11

12

13

selecta,b,t,v,

locf(v)overt_ascasv_locf,

locf(v)overt_descasv_focb,

COALESCE(

locf(v)overt_asc,

locf(v)overt_desc

)asv_final

fromtbl

WINDOW

t_ascas(partitionbya,borderbyt),

t_descas(partitionbya,borderbytdesc)

orderbya,b,t

;

First of all, the WINDOW keyword simply allows us to to name an ordered partition and then use it in the corresponding select query. Second of all, the FOCB mechanism is just LOCF applied to the inversely ordered partition ( t_desc). And finally, it might be worth noting that you can use multiple aggregate functions within another function – COALESCE in this case.

Putting the Pieces Together for Linear Interpolation

This query is composed of two main steps. First we unite all the necessary values spread across multiple rows, so those values are available locally to every row and then we calculate row based the imputed value.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

WITH

tbl0as(

SELECTa,b,t,v,

locf(v)overt_ascasv_locf,

locf(v)overt_descasv_focb,

locf(casewhenvisnot nullthentelsenullend)

overt_ascast_locf,

locf(casewhenvisnot nullthentelsenullend)

overt_descast_focb

fromtbl

windowt_ascas(partitionbya,borderbyt),

t_descas(partitionbya,borderbytdesc)

),

tbl1as(

SELECTa,b,t,v,v_locf,v_focb,t_locf,t_focb,

case

when

t_focb!=t_locf

andv_locfisnot null

andv_focbisnot null

then

v_locf+(

(v_focb-v_locf)

*

(t-t_locf)/

(t_focb-t_locf)

)

elsecoalesce(v_locf,v_focb)

endasv_final

fromtbl0

orderbya,b,t

)

SELECT*fromtbl1orderbya,b,t;

The WITH clause is simply a way to chain temporary tables. The result of the first SELECT is named tbl0 and used within the second SELECT whose result set is named tbl1 and referred to by the final (technically not necessary) SELECT query. The statements formulated using a WITH clause are also referred to as CTEs or Common Table Expressions. They can make complex queries more readable and allow the formulation of recursive queries, for example to traverse a tree. CTEs are controversial with respect to their performance (“optimization fence“).

Now with the different parameters assembled we can interpolate with the following simple formula:

Updating the Table Directly

Now if we subsitute above’s last SELECT query with the following UPDATE statement then the imputed set of values for v are directly written to tbl: