Fitting censored log-normal data

Data are censored when samples from one or both ends of a continuous distribution are cut off and assigned one value. Environmental chemical exposure datasets are often left-censored: instruments can detect levels of chemicals down to a limit, underneath which you can’t say for sure how much of the chemical was present. The chemical may not have been present at all, or was present at a low enough level that the instrument couldn’t pick it up.

There are a number of methods of dealing with censored data in statistics. The crudest approach is to substitute a value for every censored value. This may be zero, the detection limit, the detection limit over the square root of two, or some other value. This method, however, biases the statistical results depending on what you choose to substitute.

A more nuanced alternative is to assume the data follows a particular distribution. In the case of environmental data, it often makes sense to assume a log-normal distribution. When we do this, while we may not know what the precise levels of every non-detect, we can at least assume that they follow a log-normal distribution.

In this article, we will walk through fitting a log-normal distribution to a censored dataset using the method of maximum likelihood.

Fitting a censored log-normal distribution

First, we need to generate a contrived dataset. Here we take 10000 samples from a log-normal distribution with a log-mean of 5 and log-standard deviation of 1; call those values \( y \). Any value below \( L = 50 \) will be censored. Let \( y^{\ast} \) be the censored dataset in which every censored point is fixed to the censoring limit. Formally, \( y^{\ast} \) is defined as:

The spike on the left represents the censored values. The detection frequency, or the proportion of detected values, is 0.8566.

First, let’s do a better job of defining the problem we’re trying to solve. We set out to fit a log-normal distribution to some censored data. Our fundamental problem here is that we want to find the right parameters (the mean and standard deviation) for a log-normal distribution that “fits”” the observed data. But how do you tell what the “right” parameters are, or what the best “fit” is? The maximum likelihood approach is to say “the best distribution is the one that has the highest likelihood of generating the data we’ve observed.” To actually find those parameters, you use an optimization algorithm that tries a bunch of different parameters and hones in on the one that defines a distribution with the highest likelihood of generating the observed data.1

Let’s develop some intuition about how we are going to handle this censored data. With censored data, we don’t know the exact value of each datapoint that is below a censoring limit. The true value could be anywhere below the limit. To deal with this, we assume that the entire dataset is log-normally distributed, so even if we don’t know the exact value of the censored data, we know something about how those points are distributed.

If the point is above the censoring limit, then its likelihood is defined by the log-normal PDF. If it below the censoring limit, we don’t know where the real value was exactly; all we know is that it’s less than \( L \). As such, its likelihood is the cumulative probability of a value being below \( L \) in a log-normal distribution with mean \( \mu \) and standard deviation \( \sigma \).

The total likelihood is found by multiplying all the individual likelihoods.2 Formally:

It’s often more convenient, especially for computers, to work with the log-likelihood function. It’s easy to get it, just take the log of both sides:

The R translation uses the built in dlnorm and plnorm functions for the log-normal PDF and CDF, choosing which one to use based on whether the data point is censored. It also checks the parameters to make sure they make sense: a negative standard deviation is invalid, so the function rejects that parameter set.

We use the maxLik function from the maxLik3 package to perform the estimation. We need to supply starting values for the parameters, so let’s have it start with a mean of 0 and standard deviation of 1.

The core idea with this overall approach is that we assume something about the distribution of the censored data points so that we can get more nuanced parameter estimates than if we replaced them with a constant, like zero or the detection limit.

Future posts will go into regression techniques with censored datasets.

In “generative” statistical modelling you assume that the data you have was generated by some definable process. The task of the modeller is to write down how they think the data was generated, this is your “model” of how the world works. Everything that follows (parameter fitting, inference) is based off of this core assumption. In our case, we assume the data was generated from a log-normal distribution; that’s our “model” of how the data came to be. As it happens, this is exactly where the data came from! ↩

This comes from the assumption that each measurement is independent. Recall that the probability of two independent events occurring is the product of the individual probabilities: \( P(A \, \mathrm{and} \, B) = P(A) \cdot P(B) \) ↩