Exceedingly Simple Monotone Regression (with Ties)

Jan de Leeuw

Version 01, April 01, 2017

Abstract

A C implementation of Kruskal’s up-and-down-blocks monotone regression algorithm for use with .C() is extended to include the three classic ways of handling ties. It is then compared with other implementations.

Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory gifi.stat.ucla.edu/jbkTies has a pdf version, the bib file, the complete Rmd file with the code chunks, and the R source code.

1 Introduction

In a recent Rpub (De Leeuw (2017a)) we presented a .C() version of Kruskal’s up-and-down-blocks algorithm for monotone (or isotone) regression. In De Leeuw (2016) there is R code that extends monotone regression by adding the three classical ways of handling ties. See Kruskal (1964) for the primary and secondary approach, and De Leeuw (1977) for the tertiary approach, as well as for proofs that all three methods are indeed optimal. The R code for tie handling in De Leeuw (2016) uses lists for blocks of ties, and includes a number of loops over blocks. Thus there is some room for improvement by implementing tie-handling using .C().

The appendix of this follow-up paper presents .C() code for monotone regression with the three approaches. We have chosen to make it modular, in the sense that the basic operations all have a separate routine written in C, with correspondng R wrapper. All C routines are written using .C() conventions, i.e. they pass by reference and return a void. Internally they do not use any R-based functions and include files, so they can be easily used in other contexts as well. Thus extra storage need in the routines is allocated and freed internally, without involving R.

2 Small Example

2.1 Sorting

We use the function mySortDouble(), a .C() interface to a C routine of the same name, to sort x and to put y and w in the correct order. In R parlance, the routine returns order(x) as well as sort (x) = x[order (x)] , y[order (x)] and w[order (x)]. The sorting routine is C code taken from De Leeuw (2017b). It uses the system qsort() routine, which implements quicksort. Quicksort is not a stable sorting routine, in the sense that it does not guarantee that tieblocks will be sorted in a unique way. This, however, is not a problem in the monotone regression context, because that form of instability will not change the outcome of the regression routine.

## [1] 1.9 1.9 1.9 2.1 2.1 2.1 3.5 3.5 3.5

## [1] 8 3 5 2 1 9 6 7 4

## [1] 2 1 2 1 1 1 2 2 2

## [1] 7 9 4 1 2 8 3 6 5

The function tieBlock(), again a .C() interface to a C routine of the same name, uses the sorted x to return the tie-blocks (a vector of the same length as x, using integers to indicate block membership).

## [1] 1 1 1 2 2 2 3 3 3

We emphasize that in a non-metric multidimensional scaling context (or, more generally, an alternating least squares context) the sorting only has to be done only once, not in every iteration. Thus in repeated calls of the algorithm this first part can be skipped.

2.2 Primary Approach

The primary approach to ties starts with sorting y within blocks. This is done with sortBlocks(), again a .C() interface. Along with y we sort w and the order vector.

2.2.1 Sort within Blocks

## [1] 3 5 8 1 2 9 4 6 7

## [1] 1 2 2 1 1 1 2 2 2

## [1] 9 4 7 2 1 8 5 3 6

The newly sorted y and w are entered into jbkPava(), the monotone regression routine, again a .C() interface to C code.

2.2.2 Monotone Regression

## [1] 3.000 4.833 4.833 4.833 4.833 5.667 5.667 6.000 7.000

2.2.3 Restore Order

Finally mySortInteger() ,another .C() interface, is used to find the inverse permutation that puts the monotone regression output back in the correct order.

## [1] 5 4 8 2 7 9 3 6 1

## [1] 4.833 4.833 6.000 4.833 5.667 7.000 4.833 5.667 3.000

2.2.4 Residual Sum of Squares

Equal to 46.6666666667.

2.3 Secondary Approach

2.3.1 Make Blocks

In the secondary approach we use makeBlocks(), another .C() interface, to create block averages and block weights.

## [1] 5.800 4.000 5.667

## [1] 5.000 3.000 6.000

2.3.2 Monotone Regression

Then we perform monotone regression on the block values using block weights.

## [1] 5.125 5.125 5.667

2.3.3 Expand and Restore Order

Finally we use mySortInteger(), another .C() sorting routine, to expand the monotone regression values in the correct order to a vector of length (x).

## [1] 5.125 5.125 5.667 5.125 5.667 5.667 5.125 5.125 5.125

2.3.4 Residual Sum of Squares

Equal to 59.2604166667.

2.4 Tertiary Approach

2.4.1 Adjust Means

The tertiary approach forms blocks and does a monotone regression as in the secondary approach. It then adjusts the block means optimally.

## [1] 7.325 2.325 4.325 3.125 2.125 10.125 6.000 7.000 4.000

2.4.2 Restore Order

## [1] 3.125 2.125 6.000 4.325 4.000 7.000 7.325 10.125 2.325

2.4.3 Residual Sum of Squares

Equal to 5.16375.

3 Timings

We compare two different implementations of monotone regression with ties. The first is monregR(), which does the data-handling and tie-handling in R, using lists. It is basically the code from De Leeuw (2016), with two changes. In the first place the Fortran code for monotone regression from Cran (1980) is replaced by the C code from De Leeuw (2017a). Secondly, the R code in De Leeuw (2016) does not work correctly in the case of unequal weights, and we have corrected that in the current version of monregR(). The second function is monregRC(), which does data-handling, tie-handling, and monotone regression in C, using the functions we have illustrated earlier in our small example.

In our comparion we use

x <- sample (1:blks, n, replace = TRUE)
y <- rnorm (n)

where n = 10,000. Since we sample with equal probabilities from 1:blks the vector x will have (at most) blks different values, so if blks is small there will be many ties, if blks is large there will be only a few ties. We let blks take the values c(2,10,100,1000,10000). Combined with the three options for tie-handling this gives 15 different analyses, and each one is repeated 100 times. We report the elapsed time from system.time().

4 Conclusion

Running time for monregR() heavily depends on the number of blocks, because the R code has loops over blocks. For a small number of blocks monregR() is even faster than monregRC(). If n= 10,000 then up to five times as fast for an expected block size of 5000 (two blocks). But for more blocks monregR() rapidly loses its advantage. For n = 500 we find that monregRC() is almost always faster, and usually much faster.