Abstract

We have initially developed a time-independent forecast for southern California by smoothing the locations of magnitude 2 and larger earthquakes. We show that using small m ≥2 earthquakes gives a reasonably good prediction of m ≥5 earthquakes. Our forecast outperforms other time-independent models (Kagan and Jackson, 1994; Frankel et al., 1997), mostly because it has higher spatial resolution. We have then developed a method to estimate daily earthquake probabilities in southern California by using the Epidemic Type Earthquake Sequence model (Kagan and Knopoff, 1987; Ogata, 1988; Kagan and Jackson, 2000). The forecasted seismicity rate is the sum of a constant background seismicity, proportional to our time- independent model, and of the aftershocks of all past earthquakes. Each earthquake triggers aftershocks with a rate that increases exponentially with its magnitude and decreases with time following Omori's law. We use an isotropic kernel to model the spatial distribution of aftershocks for small (m ≤5.5) mainshocks. For larger events, we smooth the density of early aftershocks to model the density of future aftershocks. The model also assumes that all earthquake magnitudes follow the Gutenberg-Richter law with a uniform b-value. We use a maximum likelihood method to estimate the model parameters and test the short-term and time-independent forecasts. A retrospective test using a daily update of the forecasts between 1 January 1985 and 10 March 2004 shows that the short-term model increases the average probability of an earthquake occurrence by a factor 11.5 compared with the time-independent forecast.

Introduction

Several studies show that many earthquakes are triggered in part by preceding events. Aftershocks are the most obvious examples, but many large earthquakes are preceded by and probably triggered by smaller ones. Recent studies indeed suggest that we can explain the triggering of a large earthquake by a previous smaller one using the same laws as for the triggering of small earthquakes (aftershocks) by a large one (mainshock) (Helmstetter and Sornette, 2003c; Helmstetter et al., 2005). The physics of earthquake triggering, which we use in our forecasts, is probably sufficient to explain the acceleration of seismicity sometimes observed before large earthquakes (e.g., Bufe and Varnes, 1993).

Typically, the seismicity rate just after and close to a large m ≥7 earthquake can increase by a factor 104, and stay above the background level for several decades. Small earthquakes also have a significant contribution in earthquake triggering because they are much more numerous than larger ones (Helmstetter, 2003; Helmstetter et al., 2005). As a consequence, many large earthquakes are triggered by previous smaller earthquakes (foreshocks). Also, many events are apparently triggered through a cascade process in which triggered quakes trigger others in turn.

We have devised and implemented another method for issuing daily earthquake forecasts for southern California. Short-term effects may be viewed as temporary perturbations to a long-term earthquake potential. This long-term forecast could be a Poisson process or a time-dependent process including, for example, stress shadows. It can include any geologic information based on fault geometry and slip rate, as well as data from geodesy or paleoseismicity. As a first step, we have measured the time-independent seismic activity using instrumental seismicity (1932–2003) only. We show that this simple model performs better than a more sophisticated model that incorporates geology data and characteristic earthquakes (Frankel et al., 1997).

In distinction from the Gerstenberger et al. (2005) prediction scheme, we use a particular stochastic point process (Daley and Vere-Jones, 2004), the epidemic type earthquake sequence (etes) model (Kagan and Knopoff, 1987; Ogata, 1988), to obtain short-term earthquake forecasts including foreshocks and aftershocks. This model is usually called “epidemic type aftershock sequence” (etas), but in addition to aftershocks this model also describes background seismicity, mainshocks, and foreshocks, using the same laws for all earthquakes. We use a maximum likelihood approach to estimate the parameters by maximizing the forecasting skills of the model (Kagan and Knopoff, 1987; Gerstenberger et al., 2005; D. Schorlemmer et al., unpublished manuscript, 2005).

Time-Independent Forecasts

Definition of the Model

We have developed a method to estimate the probability of an earthquake as a function of space and magnitude from early instrumental seismicity. We estimate the density of seismicity μ() by declustering and smoothing past seismicity. We use the composite seismicity catalog from the Advanced National Seismic System (anss), available at http://quake.geo.berkeley.edu/anss/catalog-search.html. We selected earthquakes above md 3 for the period 1932–1979 and above md 2 since 1980. We computed the background density of earthquakes on a grid that covers southern California with a resolution of 0.05° × 0.05°. The boundary of the grid is 32.45° N to 36.65° N in latitude and 121.55° W to 114.45° W in longitude. Before computing the density of seismicity, we need to decluster the catalog, to remove the largest clusters of seismicity, which would give large peaks of seismicity that do not represent the time-independent average. We used Reasenberg's (1985) declustering algorithm with parameters rfact = 20, xmeff = 2.00, p1 = 0.99, τmin = 1.0 day, τmax = 10.0 days, and with a minimum cluster size of five events. The parameters of the declustering procedure were adjusted so that the resulting catalog is close to a Poisson process. In particular, we checked that there is no residual change in seismicity rate after large earthquakes in the declustered catalog.

The declustered catalog is shown in Figure 1. Note that a better method of declustering exists, which does not need to specify a space-time window to define aftershocks and uses an etes-type model to estimate the probability that each earthquake is an aftershock (Kagan and Knopoff, 1976; Kagan, 1999; Zhuang et al., 2004). This method is more complex to use, however, and time-consuming for a large number of earthquakes. The declustering procedure is done only for the data used to build the time-independent model μ(), also used as the input of the short-term model. The catalog used to test both models is not declustered.

Declustered catalog, obtained with Reasenberg's (1985) algorithm, including 6,861 m ≥3 earthquakes in the time window 1932–1979 and 46,937 earthquakes with m ≥2 for 1980–2003.

We estimate the density of seismicity in each cell by smoothing the location of each earthquake i with an isotropic adaptive kernel Kdi(r) (Izenman, 1991). The bandwidth di associated with earthquake i decreases if the density of seismicity at the location i of this earthquake increases, so that we have a better resolution (smaller di) where the density is higher.

To estimate di for each earthquake, we need an initial estimation of the density μ*(i) at the location of this earthquake. We estimate μ*(i) using a smoothing kernel with a fixed bandwidth d = dmin = 0.5 km (current location accuracy), and summing over all earthquakes 1 We use an isotropic kernel Kd(r) given by 2 where C(d) is a normalizing factor, so that the integral of Kd(r) over an infinite area equals 1.

The bandwidth di associated with each earthquake is then given by 3 so that di is proportional to the average distance between events around this earthquake.

The background density at any point (given as a number of m ≥ md events per year and per km−2) is then estimated by where T is the duration (in years) of the catalog.

Our forecasts are given as an average number of events in each cell. The background density defined by (4) has spatial variations at scales smaller than the grid resolution (≈5 km). Therefore, we need to integrate the value of μ() defined by (4) over each cell to obtain the background rate in this cell. The advantage of the function (2) is that we can compute analytically the integral of Kd(x, y) over one dimension x or y and then compute numerically the integral in the other dimension.

We estimate the parameter d0 in (3) by optimizing the likelihood of the model. We use the data from 1932 to 1995 to compute the density μ() on each cell and the data from 1996 until 2003 to evaluate the model.

The log likelihood of the model is given by the sum over all cells: 5 where n is the number of events that occurred in the cell (ix, iy).

Assuming a Poisson process, the probability p(μ(ix, iy), n) of having n events in the cell (ix, iy) is given by 6 where μ(ix, iy) represents the seismicity density integrated over the cell ix, iy per time unit. The optimization gives d0 = 0.0045. The background density for the declustered catalog is shown in Figure 2.

Time-independent density obtained by declustering and smoothing the anss catalog.

Note that using two different data sets to compute μ() and LL is important, otherwise the optimization of LL gives a smoothing distance di = dmin for all earthquakes (d0 = 0), that is, all the weight is at the location of observed earthquakes (Kagan and Jackson, 1994).

Note also that it is not necessary to introduce bins to test the models. We could have computed the LL of the continuous model μ(), given by the sum over all events: 7 where E(N) is the expected number of target earthquakes. The binning was introduced to follow the rules adopted by relm for the real-time tests of earthquake forecasts in southern California. In these tests, each author must submit a table of the number of events in each cell, not a program. The earthquake numbers cannot be easily tested for a continuum model. The binning is also introduced to avoid problems due to location errors.

KJ94 use larger m ≥5.5 earthquakes since 1850, without declustering the catalog.

KJ94 introduce a weight proportional to the logarithm of the moment of each earthquake, whereas we use the same weight for all earthquakes.

KJ94 use an anisotropic power-law kernel (maximum density in the direction of the earthquake rupture), with a slower decay with distance K(r) ≈ 1/(r + Rmin). Because the integral of K() over the space does not converge, they truncate the kernel function after a distance Rmax = 200 km. Their kernel has a larger bandwidth Rmin = 5 km than the present model (smoother density).

KJ94 add a constant uniform value to take into account “surprises”: earthquakes that occur where no past neighboring earthquake occurred.

We modified our model to compare with KJ94 model, to use the same grid (from 31.95° to 37.05° in latitude and from −122.05° to −113.95° in longitude) with the same resolution of 0.1°. Both models were developed by using only data before 1990 to estimate the parameters and the density μ. We estimated the parameter d0 (defined in equation 3) of our time-independent model by using the data until 1 January 1986 to compute μ and the data from 1 January 1986 to 1 January 1990 to estimate the likelihood LL of the model. The optimization gives d0 = 0.004. We then use this value of d0 and the data until 1 January 1990 to estimate the average density μ().

We use the log likelihood defined in (5) to compare the KJ94 model with our model. Because we want to test only the spatial distribution of earthquakes, not the predicted total number, we normalized both models by the observed number of earthquakes (N = 56). We obtain LL = −433 for the KJ94 model and LL = −389 for our model. Both models are shown in Figure 3 together with the observed m ≥5 earthquakes since 1990. The present work thus improves the prediction of KJ94 by a factor (ratio of probabilities) exp((433 − 389)/56) = 2.2 per earthquake, despite being much simpler (isotropic and point-source model). This result suggests that including small earthquakes (m ≥2) to predict larger ones (m ≥5) considerably improves the predictions, because large earthquakes, in general, occur at the same location as smaller ones (Kafka and Levin, 2000).

Time-independent seismicity density for the Kagan and Jackson (1994) model (left) and for the present work (right). White circles represent m ≥5 earthquakes that occurred between 1990 and 2004.

For comparison, a purely uniform model, with an expected number of 4.0 m ≥5 events per year, has a likelihood of −461. The prediction gain relative to this uniform model is 3.6 for our model and 1.6 for KJ94.

The Frankel et al. (1997) (F97) model is a more complex model that includes both a smoothed historical and instrumental seismicity (using m ≥4 earthquakes since 1933 and m ≥6 earthquakes since 1850) and characteristic earthquakes on known faults, with a seismicity rate constrained by the geologic slip rate and a rupture length controlled by the fault length. The magnitude distribution follows the Gutenberg-Richter (gr) law with b = 0.9 for small magnitudes (m ≤6.2) and a bump for m >6.2 due to characteristic events. We adjusted our model to use only data before 1996 to build the model and the same grid as F97 with a resolution of 0.1°. We assumed a gr distribution with b = 1 and with an upper magnitude cutoff at m 8 (Bird and Kagan, 2004). We used the anss catalog for the period 1932–1995 to estimate the average rate of m ≥4 earthquakes (without declustering the catalog). We then estimate the average rate of m ≥5 earthquakes from the number of m ≥4 events by using the gr law. We use m ≥5 earthquakes in the anss catalog that occurred since 1996 to compare the models. Both models are illustrated on Figure 4.

We test how each model explains the number of observed events, as well as their location and magnitude, by comparing the likelihood of each model. The log likelihood is defined by 8 where n is the number of events that occurred in the cell (ix, iy) and in the magnitude bin im. The magnitude range (5.0–8.0) is divided in bins of 0.1 unit. The expected number in this bin is μ(ix, iy) P(im)T.

The log likelihood is LL = −155 for our model and LL = −161 for the F97 model. For comparison, a time- independent model (with a uniform density, a gr magnitude distribution with b = 1, and the same expected number of events as our model) gives LL = −168. Our model has a probability gain of 1.5 compared with F97 and a gain of 2.4 compared with the uniform model. Our model thus better predicts the observed earthquake occurrence since 1996 than the F97 model. F97, however, better predicts the observed number than our model, because the number of m ≥5 earthquakes in the period 1996–2004 was smaller than the average rate between 1932 and 1995 (predicted number N = 14.6 for F97 and N = 26.6 for our model, compared with the observed number N = 15). The difference in likelihood between the two models is mainly due to the choice of the kernel and of the minimum magnitude used to estimate the seismicity rate. F97 use a smoother kernel, with a fixed characteristic smoothing distance of 10 km and with an approximately 1/r decay, and only m ≥4 earthquakes.

Time-Dependent Forecasts

Definition of the etes Model

The etes model is based on two empirical laws of seismicity, which can also be reproduced by a multitude of physical mechanisms: the G-R law to model the magnitude distribution and Omori's law to characterize the decay of triggered seismicity with the time since the mainshock (Kagan and Knopoff, 1987; Ogata, 1988; Kagan, 1991; Kagan and Jackson, 2000; Helmstetter and Sornette, 2003a; Rhoades and Evison, 2004). This model assumes that all earthquakes may be simultaneously mainshocks, aftershocks, and possibly foreshocks. Each earthquake triggers direct aftershocks with a rate that increases exponentially ∼10αm with the earthquake magnitude m and that decays with time according to Omori's law. We also assume that all earthquakes have the same magnitude distribution, which is independent of the past seismicity. Each earthquake thus has a finite probability of triggering a larger earthquake. An observed “aftershock” sequence in the etes model is the sum of a cascade of events in which each event can trigger more events.

The global seismicity rate λ(t, , m) is the sum of a background rate μb(), usually taken as a spatially nonhomogeneous Poisson process, and the sum of dependent events of all past earthquakes 9 where Pm(m) is a time-independent magnitude distribution (see equation 13). The function φm(,t) gives the spatiotemporal distribution of triggered events at point and at time t after an earthquake of magnitude m10 where ρ(m) is the average number of earthquakes triggered directly by an earthquake of magnitude m ≥ md11 the function ψ(t) is Omori's law normalized to 1 12 and f(,m) is the normalized aftershock density at a distance relative to the mainshock of magnitude m. We have tested different choices for f(, m), which are described in the Spatial Distribution of Aftershocks section. We fix c-value in Omori's law (12) equal to 0.0035 day (5 min). This parameter is not important as long as it is much smaller than the time window T = 1 day of the forecast.

The exponent α has been found equal or close to 1.0 for the southern California seismicity (Felzer et al., 2004; Helmstetter et al., 2005), equal to the grb-value, showing that small earthquakes are collectively as important as larger ones for seismicity triggering. Note that in the sum in (9) we consider only earthquakes above the detection magnitude md. Smaller undetected earthquakes may also have an important contribution to the rate of triggered seismicity. These undetected earthquakes may thus bias the parameters of the model, that is, the parameters estimated by optimizing the likelihood of the models are “effective parameters,” which more or less account for the influence of undetected small earthquakes.

The etes model assumes that each primary aftershock may trigger its own aftershocks (secondary events). Secondary aftershocks may themselves trigger tertiary aftershocks and so on, creating a cascade process. The exponent p, which describes the time distribution of direct aftershocks, is larger than the observed Omori exponent, which characterizes the whole cascade of direct and secondary aftershocks (Helmstetter and Sornette, 2002).

As a first step, we use a simple gr magnitude distribution 13 with a uniform b-value equal to 1.0, a upper cutoff at mmax = 8 (Bird and Kagan, 2004), and a minimum magnitude md = 2. If we want to predict relatively small m ≤ 4 earthquakes, we must take into account the fact that small earthquakes are missing in the catalog after a large mainshock. The procedure for correcting for undetected small earthquakes is described in the Threshold Magnitude section.

The background rate μb in (9) is given by 14 where μ0() is equal to our time-independent model μ() (4) normalized to 1, so that μs represents the expected number of background events per day with m ≥ md.

We use the qdds and the qdm java applications (available at http://quake.wr.usgs.gov/research/software) to obtain the data (time, locations, and magnitude) in real time from several regional networks (southern and northern California, Nevada) and to create a composite catalog. We automatically update our forecast each day. The model parameters are estimated by optimizing the prediction (maximizing the likelihood of the model) using retrospective tests. The inversion method and the results are presented in the Definition of the Likelihood and Estimation of the Model Parameters section.

Application of etes Model for Time-Dependent Forecasts

By definition, the etes model provides the average instantaneous seismicity rate λ(t) at time t given by (9), if we know all earthquakes that occurred until time t. To forecast the seismicity between the present time tp and a future time tp + T, we cannot use directly expression (9), because a significant fraction of earthquakes that will occur between time tp and time tp + T will be triggered by earthquakes that will occur between time tp and time tp + T (see Fig. 5). Therefore, the use of expression (9) to provide short-term seismicity forecasts, with a time window T of 1 day, may significantly underestimate the number of earthquakes (Helmstetter and Sornette, 2003a).

Plot of the magnitude versus time for a few days in the anss catalog, which illustrates the fact that a significant fraction of earthquakes that will occur in the next day (between the present time tp and tp + T) may be triggered by earthquakes that will occur in the next day (tp < t < tp + T).

To solve this problem, Helmstetter and Sornette (2003a) proposed to generate synthetic catalogs with the etes model to predict the seismicity for the next day by averaging the number of earthquakes over many scenarios. This method provides a much better estimation of the number of earthquakes than the direct use of (9) but is much more complex and time consuming. Helmstetter and Sornette (2003a) have shown that, for synthetic etes catalogs, the use of to predict the number of earthquakes between tp and tp + T underestimates the number of actually occurred earthquakes by an approximate constant factor, independent of the future events number. This means that the effect of yet unobserved seismicity is to amplify the aftershock rate of past earthquakes by a constant factor.

This result suggests a simple solution to take into account the effect of yet unobserved earthquakes. We can use the etes model (9) to predict the number of earthquakes between tp and tp + T but with effective parameters k, μs, and α, which may be different from the true etes parameters. Instead of using the likelihood of the etes model to estimate these parameters, as done by Kagan (1991), we will estimate the parameters of the model by optimizing the likelihood of the forecasts, defined in the Definition of the Likelihood and Estimation of the Model Parameters section. These effective parameters depend on the duration (horizon) T of the forecasts.

Threshold Magnitude

An important problem when modeling the occurrence of relatively small earthquakes m ≤ 4 in California is that the completeness magnitude significantly increases after large earthquakes (Kagan, 2004). One effect of missing earthquakes is that the model overestimates the observed number of earthquakes because small earthquakes are not detected. But another effect of missing early aftershocks is to underestimate the predicted seismicity rate, because we miss the contribution from these undetected small earthquakes in the future seismicity rate estimated from the etes model (9). Indeed, secondary aftershocks (triggered by a previous aftershock) represent an important fraction of aftershocks (Felzer et al., 2003; Helmstetter and Sornette, 2003b).

We have developed a method to correct from both effects of undetected small aftershocks. We first estimate the threshold magnitude as a function of the time from the mainshock and of the mainshock magnitude. We analyzed all aftershock sequences of m ≥6 earthquakes in southern California since 1985. We propose the following relation between the threshold magnitude mc(t,m) at time t (in days) after a mainshock of magnitude m: and 15 Of course, some fluctuations occur between one sequence and another one, but relation (15) is correct within ≈0.2 magnitude units. This relation is illustrated on Figure 6 for 1992 Joshua Tree m 6.1, 2003 San Simeon m 6.5, and 1992 Landers m 7.3 aftershock sequences.

Magnitude versus time since mainshock for aftershocks of Joshua Tree m 6.1 (a), San Simeon m 6.5 (b), and Landers m 7.3 earthquakes (c). The continuous line represents the threshold magnitude estimated from (15) and includes the effect of all m ≥5 earthquakes. The vertical lines in (c) are due to the increase of mc(t) after large m ≥5 aftershocks. Dates of these earthquakes are shown in Table 2.

We use expression (15) to estimate the detection magnitude mc(t) at the time of each earthquake. The time- dependent detection threshold mc(t) is larger than the usual threshold md for earthquakes that occurred at short times after a large m ≥5 earthquake. We select only earthquakes with m > mc to estimate the seismicity rate (9) and the likelihood of the forecasts (21).

We can also correct the forecasts for the second effect, missing contribution from undetected aftershocks in the sum (9). We can take into account the effect of missing earthquakes with md < m < mc(t) by adding a contribution to the number ρ(m) of aftershocks of detected earthquakes m > mc(t), that is, by replacing ρ(m) in (11) by 16 where mc(t) is the detection threshold at the time t of the earthquake, estimated by (15), due to the effect of all previous m ≥5 earthquakes. The second contribution corresponds to the effect of all earthquakes with md < m < mc(t) that occur on average for each detected earthquake. Practically, for a reasonable value of α ≈ 0.8, this correction (16) is of the same order as the contribution from observed earthquakes, because a large fraction of aftershocks are secondary aftershocks (Felzer et al., 2003), and because small earthquakes are collectively as important as larger ones for earthquake triggering if α = b.

Spatial Distribution of Aftershocks

We have tested different choices for the spatial kernel f(m), which models the aftershock density at a distance r from the mainshock of magnitude m. We used a power-law function 17 and a Gaussian distribution 18 where Cpl and Cgs are normalizing factors, such that the integral of f(, m) over an infinite surface is equal to 1. The spatial regularization distance d(m) accounts for the finite rupture size and for location errors. We assume that d(m) is given by 19 where the first term accounts for location accuracy and the second term represents the aftershock zone length of an earthquake of magnitude m. The parameter fd is adjusted by optimizing the prediction and should be close to 1.0 if the aftershock zone size is equal to the rupture length as estimated by Wells and Coppersmith (1994).

The Gaussian kernel (18), which describes the density of earthquakes at point , is equivalent to the Rayleigh distribution ∼r exp[−(r/d)2/2] of distances || used by Kagan and Jackson (2000). The choice of an exponent 1.5 in (17) is motivated by recent studies (Ogata, 2004; Console et al., 2003; Zhuang et al., 2004) who inverted this parameter in earthquake catalogs by maximizing the likelihood of the etes model, and who all found an exponent close to 1.5. This choice is also convenient because the function (17) is integrable analytically. It predicts that the aftershock density decreases with the distance r from the mainshock as 1/r3 in the far field, proportionally to the static stress change.

For large earthquakes, which have a rupture length larger than the grid cell size of 0.05° (≈ 5 km) and a large number of aftershocks, we can improve the model by using a more complex anisotropic kernel, as done previously by Wiemer and Katsumata (1999), Wiemer (2000), and Gerstenberger et al. (2005). We use the location of early aftershocks as a witness for estimating the mainshock fault plane and the other active faults in the vicinity of the mainshock. We compute the distribution of later aftershocks of large m ≥5.5 mainshocks by smoothing the location of early aftershocks 20 where the sum is on the mainshock and on all earthquakes that occurred within a distance Daft (m) from the mainshock and at a time smaller than the present time tp and not larger than Taft from the mainshock. We took Daft (m) = 0.02 × 100.5m km (approximately two rupture lengths) and Taft = 2 days.

The kernel f(, m) in (20) used to smooth the location of early aftershocks is either a power law (17) or a Gaussian distribution (18), with an aftershock zone length given by (19) for the mainshock, but fixed to d = 2 km for the aftershocks. The density of aftershocks estimated using (20) is shown in Figure 7 for the Landers earthquake, using a power-law kernel (Fig. 7a) or a Gaussian kernel (Fig. 7b). The distribution of aftershocks that occurred after more than 2 hr after Landers (black dots) is in good agreement with the prediction based on aftershocks that occurred in the first 2 hr (white circles). In particular, the largest aftershock (Big Bear, m 6.4, latitude = 34.2°, longitude = −116.8°), which occurred about 3 hr after Landers, was preceded by other earthquakes in the first 2 hr after Landers, and is well predicted by our method. The Gaussian kernel (18) produces a density of aftershocks which is more localized than with a power-law kernel.

Density of aftershocks estimated by smoothing the location of early aftershocks (white circles) that occurred less than 2 hr after the Landers mainshock (m 7.3, 28 June 1992), using either a Gaussian kernel (18) (a) or a power-law kernel (17)(b).

The advantage of using the observed aftershocks to predict the spatial distribution of future aftershocks is that this method is completely automatic and fast, and it uses only information from the time and location of aftershocks that are available soon after the earthquake. It provides an accurate prediction of the spatial distribution of future aftershocks after less than 1 hr after the mainshock when enough aftershocks have occurred. Our method also has the advantage of taking into account the geometry of the active-fault network close to the mainshock, which is reflected by the spatial distribution of aftershocks.

Therefore, even if the spatial distribution of aftershock is controlled by the Coulomb stress change, it may be more accurate, much simpler, and faster to use the method described previously rather than to compute the Coulomb stress change. Indeed, the Coulomb stress-change calculation requires the knowledge of the mainshock fault plane and the slip distribution, which are available only several hours or days after a large earthquake (Scotti et al., 2003; Steacy et al., 2004). Felzer et al. (2003) have already shown that a simple forecasting model (simplified etes model), based on the time, location, and magnitudes of all previous aftershocks, better predicts the location of future aftershocks than the Coulomb stress-change calculations do.

Definition of the Likelihood and Estimation of the Model Parameters

We use a maximum likelihood method to test the forecasts and to estimate the parameters. We have five parameters to estimate: p (Omori exponent defined in equation 12), k and α (see equation 11), μs (number of background events per day, defined by equation 14), and fd (parameter defined by equation 19, which describes the size of the aftershock zone).

The log likelihood (LL) of the forecasts is defined by (Kagan and Jackson, 2000; Kagan et al., 2003b; D. Schorlemmer et al., unpublished manuscript, 2005] 21 where n is the number of events that occurred in the bin (it, ix, iy, im).

The expected number of events per bin Np(it, ix, iy, im) is given by the integral over each space-time-magnitude bin of the predicted seismicity rate λ(, t, m) 22 We take a step of T = 1 day in time, 0.05 degree in space, and 0.1 in magnitude. The forecasts are updated each day at midnight Los Angeles time. We assume a Poisson process (6) to estimate the probability p(Np, n) of having exactly n events in a bin for which the expected number of events is Np.

We can simplify the expression of LL, by noting that we need to compute the seismicity rate only in the bins (ix, iy, im) that have a nonzero number of observed events n. We can rewrite (21) and (6) as 23 where Np(it) is the total predicted number of events in the time bin it (between t(it) and t(it) + T) 24 The factor fi in (24) is the integral of the spatial kernel fi( − t) over the grid, which is smaller than 1.0 due to the finite grid size.

We maximize the log likelihood LL defined by (21) using a simplex algorithm (Press et al., 1992, p. 402), and using all earthquakes with m ≥2 since 1 January 1985 and until 10 March 2004 to test the forecasts. We take into account in the seismicity rate (9) the aftershocks of all earthquakes with m ≥2 since 1 January 1980 that occurred within the grid ([32.45° N to 36.65° N] in latitude and [121.55° W to 114.45° W] in longitude) or at less than 1° outside the grid. There are 65,664 target earthquakes above the threshold magnitude mc in the time and space window used to compute the LL. We test different models for the spatial distribution of aftershocks, a power-law kernel (17) or a Gaussian (18).

We use the probability gain per earthquake G to quantify the performance of the short-term prediction by comparison to the time-independent forecasts 25 where LLTI is the log likelihood of the time-independent model, LLetes is the likelihood of the etes model, and N is the total number of target events. The time-independent model is obtained by taking the background density μ() described in the Time-Independent Forecasts section and normalizing μ() so that the total forecasted number of m ≥ mmin events is equal to the observed number. The gain defined by (25) is related to the information per earthquake I defined by Kagan and Knopoff (1977) (see also Daley and Vere-Jones [2004] and Harte and Vere-Jones [2005]) by G = 2I.

A certain caution is needed in interpreting the probability gain for the etes model. Earthquake temporal occurrence is controlled by Omori's law, which diverges to infinity for time approaching zero. Calculating the likelihood function for aftershock sequences illustrates this point: the rate of aftershock occurrence after a strong earthquake increases by a factor of thousands. Because log(1000) = 6.9, one early aftershock yields a contribution to the likelihood function analogous to about seven additional free parameters. This means that the likelihood optimization procedure as well as the probability gain value strongly depends on early aftershocks. As Figure 6 demonstrates, many early aftershocks are missing from earthquake catalogs (Kagan, 2004); therefore, the likelihood substantially depends on poor-quality data in the beginning of the aftershock sequence.

Similarly, earthquake hypocenters are concentrated on a fractal set with a correlation dimension slightly above 2.0 (Helmstetter et al., 2005). Due to random location errors for small interearthquake distances the dimension increases close to 3.0. This signifies that the likelihood would substantially depend on location uncertainty, because kernel width Kd() (equations 2 and 4) can be made smaller if a catalog with higher location accuracy is used.

Results and Discussion

Model Parameters and Likelihood

The model parameters are obtained by maximizing the LL. The optimization usually converges after about 100 iterations. We have checked that the final values do not depend on the initial parameters. The results are given in Table 1 and in Figure 8. We have tested different versions of the model (various spatial kernels, unconstrained or fixed α value, and different values of the minimum magnitude).

Results of the optimization of the log likelihood LL for the unconstrained model 1 by using a Gaussian kernel. Value of LL as a function of each model parameter (a)–(e) and as a function of the number of iterations (f).

An example of our daily forecasts (using model 3 in Table 1) is shown in Figure 9, for the day of 23 October 2004. All six earthquakes that occurred during that day are located in areas of high predicted seismicity rates (large values of Np). All except one occurred close enough in time and space from a recent earthquake, so that the short-term predicted number Np() is larger than the average rate μ(). The probability gain (25) per earthquake for this day is 26.

(a) Forecasted number of events with m ≥2 per cell for 23 October 2004 (logarithmic scale). Black circles represent observed earthquakes with m ≥2 that occurred during this day. Two of these events are aftershocks of the m 6.5, 22 December 2003 San Simeon earthquake (located at latitude 35.7° and longitude −121.1°), three are associated with the m 6 28 September 2004 Parkfield mainshock (latitude 35.81° and longitude −120.37°), and one is an aftershock of a m 3.7 9 September 2004 earthquake (latitude 35.09° and longitude −117.52°). The predicted number of events for this day was 8.39 and the observed number was 6. Most of these earthquakes are better predicted by the etes model than by the time-independent model (i.e., Np > μ in the cells within which these earthquakes occurred). Only one event, which occurred at 11 km away from the San Simeon earthquake (latitude 35.81° and longitude −121.02°), was better predicted by the time-independent model, because it occurred just outside of the main aftershock zone. (b) Ratio of the forecasted number of events estimated using the etes and the time-independent models (logarithmic scale). High values of Np/μ (up to 800) are associated with recent large earthquakes, such as Parkfield, San Simeon, Landers, Hector Mine, and Northridge.

Figure 8 shows the LL of the daily forecasts, for the period from 1 January 1985 to 10 March 2004, and for each iteration of the optimization as a function of the model parameters. The variation of the LL with each model parameter gives an idea of the resolution of this parameter. The unconstrained inversion gives a probability gain G = 11.7, and an exponent α = 0.43, much smaller than the direct estimation α = 0.8 ± 0.1 (Helmstetter, 2003) or α = 1 (Felzer et al., 2004; Helmstetter et al., 2005) obtained by fitting the number of aftershocks as a function of the mainshock magnitude. The optimization with α fixed to 0.8, closer to the observed value, provides a probability gain G = 11.1 slightly smaller than the best model. Note that there is a negative correlation between the parameters k and α (defined by equation 11) in Table 1: k is larger for a smaller α to keep the number of forecasted earthquakes constant.

Comparison of Predicted and Observed Aftershock Rate

Figure 10 compares the predicted number of events following the Landers mainshock, for the unconstrained model 2 (see Table 1) and for models 3 and 5 with α fixed to 0.8. Model 3 underestimates the number of aftershocks but predicts the correct variation of the seismicity rate with time. In contrast, model 2 (with α = 0.43) greatly underestimates the number of aftershocks until 10 days after Landers, because the low value of α yields a relatively small increase of seismicity at the time of the mainshock. Model 2 then provides a good fit to the end of the aftershock sequence, when enough aftershocks have occurred so that the predicted seismicity rate increases because of the importance of secondary aftershocks. The saturation of the number of aftershocks at early times in Figure 10 (for both the model and the data) is due to the increase of the threshold magnitude mc (see equation 15), which recovers the usual value md 2 about 10 days after Landers. Adding the corrective term ρ*(m) defined by (16), to account for the contribution of undetected early aftershocks in the rate of triggered seismicity, better predicts the rate of aftershocks just after Landers but gives, on average, a smaller probability gain than without including this corrective term (see models 3 and 5 in Table 1 and Fig. 10).

Observed (solid black line) and predicted number of m ≥2 earthquakes per day as a function of the time since the Landers mainshock, for model 2 (circles), model 3 (crosses), and model 5 (diamonds). The saturation at t ≤ 10 days is due to the incompleteness of the catalog for small magnitudes (see Fig. 6).

Figure 11 shows the predicted number of earthquakes and the probability gain (see equation 25) in the time window 1992–1995 for model 3. The model underestimates the rate of aftershocks for Joshua Tree (m 6.1) and Landers (m 7.3) mainshocks, slightly overestimates for Northridge (m 6.6), and provides a good fit (not shown) for Hector Mine (m 7.1) and for San Simeon (m 6.5). All models overestimate by a factor larger than 2 the aftershock productivity of the 1987 m 6.6 Superstition Hills earthquake. This shows that there is a variability of aftershock productivity that the model does not take into account, which may in part be due to errors in magnitudes. This implies that a model that estimates the parameters of each aftershock sequence (aftershock productivity, Omori p exponent, and the grb-value), such as the step model (Gerstenberger et al., 2005) may perform better than the etes model that uses the same parameters for all earthquakes (except for the increase in productivity ρ(m) with magnitude).

(a) Observed (black) and predicted (gray) number of m ≥2 earthquakes per day in southern California for model 3 (see Table 1). Dashed line is the background rate μs = 2.81/day. (b) Probability gain per earthquake defined in (25).

Figure 12 shows the predicted number of m ≥2 earthquakes per day for model 3 (see Table 1) as a function of the observed number. Most points in this plot are close to the diagonal, that is, the model, in general, gives a good prediction of the number of events per day. A few points however have a large observed number of earthquakes but a small predicted number. These points correspond to days on which a large earthquake and its first aftershocks occurred, whereas the preceding seismicity was close to its background level, and the predicted seismicity rate was small.

Predicted number of m ≥2 earthquakes per day for model 3 (see Table 1) as a function of the observed number, for the period 1985–2003. The dashed line represents the perfect fit. The horizontal line is the background rate μs = 2.81/day.

We can complexify the model to take into account fluctuations of aftershock productivity, as done in the step model, by using early aftershocks to estimate the productivity ρ(m) of large earthquakes. Whether magnitude errors, biases, and systematic effects significantly contribute to prediction efficiency needs to be investigated, however. A method that adjusts parameters to available data may seemingly perform better, especially in retrospective testing when various adjustments are possible. But if aftershock rate fluctuations are being caused by various technical factors and biases, this forecast advantage can be spurious.

Proportion of Aftershocks in Seismicity

The background seismicity is estimated to be μs = 2.81 m ≥2 earthquakes per day for model 3, compared with the average seismicity rate μ = 9.4, that is, the proportion of triggered earthquakes is 70%. This number underestimates the actual fraction of triggered earthquakes, because it does not count the early aftershocks that occur a few hours after a mainshock, between the present time tp and the end of the prediction window tp + T (see Fig. 6). We have also removed from the catalog aftershocks smaller than the threshold magnitude mc(t, m) given by (15).

Because the background rate represents only a small fraction of the total seismicity rate, the declustering procedure involved to estimate μ() has only a minor influence on the performance of our short-term forecast.

Scaling of Aftershock Productivity with Mainshock Magnitude

There may be several reasons for the small value α = 0.43 selected by the optimization, compared with the value α = 1 estimated by Felzer et al. (2004) and Helmstetter et al. (2005). A smaller α value corresponds to a weaker influence of large earthquakes. A model with a small α has thus a shorter memory in time and can adapt faster to fluctuations of the observed seismicity. A smaller α predicts a larger proportion of secondary aftershocks after a large mainshock. Therefore, it can better account for fluctuations of aftershock productivity. Indeed, if the rate of early aftershocks is low, a model with a small α will predict a small number of future aftershocks (less secondary aftershocks).

A model with a smaller α is also less sensitive to magnitude errors. An error on the mainshock magnitude of 0.3 gives an error for the rate of direct aftershocks of a factor 2.0 for α = 1 and a factor 1.3 for α = 0.4. Finally, a model with a smaller α may provide a better forecast for the spatial distribution of aftershocks. Because the aftershock spatial distribution is significantly different from the isotropic model (used for m ≤5.5 earthquakes), a model with a smaller α may perform better than the model with the true α. A small α gives more importance to secondary aftershocks, and can thus better model the heterogeneity of the spatial distribution of aftershocks. In contrast, a larger α value produces a quasi- isotropic distribution at short times, dominated by the mainshock contribution.

The corrective contribution ρ*(m) > ρ(m) (16), introduced to take into account the contribution of missing aftershocks, can also bias the value of α. Using this term ρ*(m) with a value of α smaller than the true value overestimates the contribution of small earthquakes just after a large earthquake when mc > md. For this reason we did not use this contribution (except for model 5 in Table 1).

The main interest of short-term forecasts is to predict the rate of seismicity after a large mainshock, when the best model with α = 0.4 clearly underestimates the observations. Therefore, we constrain the value of α = 0.8 (models 3 and 4 in Table 1). This model gives a slightly smaller likelihood than the best model but provides a best fit just after a large mainshock.

Spatial Distribution of Aftershocks

The power-law kernel (17) gives a slightly better LL than the Gaussian kernel (18) (see Table 1) for the unconstrained models 1 and 2 (α is an adjustable parameter), but the Gaussian kernel works a little better when α is fixed to 0.8 (see models 3 and 4 in Table 1). The parameter fd defined in (19) is the ratio of the typical aftershock zone d(m) (19) and of the mainshock rupture length L(m) = 0.01 × 100.5m km. For the Gaussian kernel (18) fd ≈ 1, that is, the average distance between a mainshock and its (direct) aftershocks is close to the mainshock rupture length.

For the power-law kernel (17), the average distance is not defined. In this case, d(m) is the distance at which fpl(r) starts decreasing with r. The inversion of fd using a power- law kernel gives an unrealistically small value fd ≤ 0.06 for model 2 (see Table 1), so that d(m) ≈ 0.5 km (fixed minimum value of d(m) equal to the location error) independently of the magnitude of the triggering earthquake for m ≤5. It gives short-range interactions, with most of the predicted rate concentrated in the cell of the triggering earthquake. Using a complex spatial distribution of aftershocks for m ≥5.5 earthquakes (obtained by smoothing the location of early aftershocks; see the Spatial Distribution of Aftershocks section) slightly improves the LL compared with the simple isotropic kernel (see models 3 and 6 in Table 1).

Probability Gain as a Function of Magnitude

Table 1 shows the variation of the probability gain G (25) as a function of the minimum magnitude of target events mmin. We used m ≥2 earthquakes in models 7–11 to estimate the forecasted rate of m ≥ mmin earthquakes, with mmin ranging between 3 and 6, and using the same parameters as in model 3 (but multiplying the background rate μs by to estimate the background rate for m ≥ mmin earthquakes). The probability gain is slightly larger for mmin = 3 than for mmin = 2, but then G decreases with mmin for mmin ≥ 4. For mmin = 6 (only eight earthquakes), the time- independent model (with a rate adjusted so that it predicts the exact number of observed events) performs even better than the etes model (G < 1) for model 10 in Table 1.

We think that this variation with mmin does not mean that our model predicts only small earthquakes (aftershocks), or that larger earthquakes have a different distribution in space and time than smaller ones, but that these results simply reflect the large fluctuations of the probability gain from one earthquake to another one; the difference in likelihood between etes and the time-independent model is mainly due to a few large aftershock sequences. We thus need a large number of earthquakes and aftershock sequences to compare different forecasts (see also discussion at the end of the section on Definition of the Likelihood and Estimation of the Model Parameters).

Table 2 compares the predicted seismicity rate at the time and location of each m ≥6 earthquake, estimated for the etes model and for the time-independent model. For each earthquake, Table 2 gives two values of the predicted number of earthquakes, using the same parameters of the etes model, but changing the time at which we update the forecasts, either midnight (universal time) for model 10 (see line 10 in Table 1) or at 1:00 p.m. for model 11. The large differences in the predicted seismicity rate between these two models show that the forecasts are very sensitive to short-term clustering, which has a large influence on the predicted seismicity rate. This suggests that the number of m ≥6 earthquakes in the catalog (eight earthquakes from 1985 to 2004) is too small to compare our short-term and time-independent models for this magnitude range.

Comparison of the Predicted Number of m ≥6 Events per Day, for the Days When a m ≥6 Earthquake Occurred, at the Location of the Earthquake (i.e., within the Cell of 0.05° × 0.05°), for the etes Model (Netes, using Models 10 and 11 in Table 1) and for the Time-Independent Model (NTI)

While some of these m ≥6 earthquakes are preceded by a short-term (hours) increase of seismicity (Superstition Hill, Joshua Tree, Landers, Big Bear, Hector Mine), the time- independent model performs better than the etes model if the forecasts are not updated between the foreshock activity and the mainshock (e.g., with model 10, between Elmore Ranch and Superstition Hill, and between Landers and Big Bear). Landers occurred about two months after Joshua Tree, and its hypocenter was just outside the Joshua Tree aftershock zone, so that the predicted seismicity rate at the location of Landers hypocenter, and before the precursory foreshock activity (which started 6 hr before Landers) was slightly lower than the average rate. Joshua Tree had foreshocks, which started 2 hr before the mainshock and thus were not included in the daily forecasted rate for both etes models. Hector Mine was also preceded by foreshocks, with m ≥3.6, which started about 20 hr before the mainshock. Therefore, the predicted seismicity rate (using etes model 10) is 120 times larger than the average rate for Hector Mine. Other large m ≥6 earthquakes (Elmore Ranch, Northridge, San Simeon), were not preceded by any significant foreshock activity. Therefore the forecasted seismicity rate was smaller than the average rate.

Updating the forecasts more often (each hour, or after each earthquake) would of course improve the performance of our short-term forecasts. But optimizing and testing the forecasts would then be much more difficult and time consuming if the duration of the forecasts (one day) is different from the interval between two forecasts. Moreover, preliminary earthquake catalogs are much less accurate in the first few hours, especially after a strong earthquake.

Discussion and Conclusion

We have first developed a time-independent model of seismicity in southern California, obtained by smoothing the location of previous m ≥2 earthquakes. Including small earthquakes improves the spatial resolution of our model; therefore, our forecasts outperform the previous model of Kagan and Jackson (1994), which used only m ≥5.5 events (both historical and instrumental). Our model also performs better than a more complex one, which incorporates geological data (Frankel et al., 1997), when tested on m ≥ 5 earthquakes since 1996. Note that the difference between those models may be negligible for hazard assessment because of the smoothing inherent in forecasting ground motion. The better resolution obtained with our method may be important however for testing and understanding the physical mechanisms of earthquake triggering.

We have then developed daily earthquake forecasts, which use our time-independent model for the background seismicity level, by adding a time-dependent contribution to model triggered seismicity. Our model is based on empirical laws of seismicity: the G-R magnitude distribution, Omori's law, and the exponential increase of triggered seismicity with the mainshock magnitude. Our model includes only data from earthquake catalogs (time, magnitude, and locations).

Our model also forecasts well the spatial distribution of future aftershocks by smoothing the locations of early aftershocks. We can obtain a good forecast of the aftershocks within a few hours of a large m ≥5.5 earthquake, based on plentiful early aftershocks. Even if the spatial distribution of aftershocks is controlled by Coulomb stress changes, our empirical method may be more accurate and faster than direct calculations of the Coulomb stress change. Our method is accurate because the distribution of early aftershocks represents well the mainshock rupture surface and because our method accounts for secondary aftershocks.

Retrospective tests for m ≥2 earthquakes in the period 1 January 1985 to 10 March 2004 show that our short-term model realizes a probability gain of 11.5 over a stationary Poisson forecast. Several features of our model could be improved. First, geologic slip rate and geodetic strain rate data could be used to better constrain the time-independent seismicity. Second, a better estimate of the magnitude distribution, resulting from statistical studies of the relationship between fault geometry and earthquakes, could improve the forecasting of large quakes. Third, other research (e.g., Gerstenberger et al., 2005) suggests that aftershock productivity and magnitude distribution may vary considerably from one sequence to another. Comparing our model with others proposed to the relm working group (Kagan et al., 2003a; Jackson et al., 2004; D. Schorlemmer et al., unpublished manuscript, 2005) should help to improve all available models. For example, seismicity forecast for the step model of Gerstenberger et al. (2005) can be compared directly with our model.

Both our models have been tested on the same data as the data used to build the models. The time-independent model (also used as the background rate in etes) depends on the location of all earthquakes until 2003 (with aftershocks removed). A better test would be a pseudo-real-time test, using completely different data to estimate the model parameters and compare the models. But there are unfortunately not enough data to do so. The value of the likelihood for a real-time prediction will thus probably be smaller (for both models) than the tests performed in this article. But the results should not vary too much, because the number of adjusted parameters (4) is much smaller than the number of target earthquakes (65,664).

We have tested our models on relatively “small” earthquakes, using a minimum magnitude mmin 5 for the time- independent model and a mmin ranging from 2 to 6 for our daily forecasts. Damaging earthquakes are usually m ≥6, but there are not enough large earthquakes to perform meaningful tests. The fact that our time-independent model, obtained by smoothing m ≥2 earthquakes, correctly predicts the location of m ≥5 earthquakes is encouraging, however, and suggests that our model also applies to larger damaging earthquakes, because large events are likely to occur at the same location as smaller ones.

In addition to the epicenter, seismic-hazard estimation also requires the specification of the fault plane. Kagan and Jackson (1994) have developed a method to forecast the orientation of the fault plane by smoothing the focal mechanisms of past m ≥5.5 earthquakes. As for forecasting epicenters, it may be useful to include small earthquakes in the forecasts to improve the resolution.

Acknowledgments

We acknowledge the Advanced National Seismic System for the earthquake catalog. We are grateful to the Editor Andrew Michael, to the reviewer Kristy Tiampo, and to an anonymous reviewer for useful suggestions. This work is partially supported by NSF-EAR02-30429, NSF-EAR- 0409890, the Southern California Earthquake Center (SCEC), the James S. McDonnell Foundation 21st century scientist award/studying complex system, and the Brinson Foundation. SCEC is funded by NSF Cooperative Agreement EAR-0106924 and USGS Cooperative Agreement 02HQAG0008. The SCEC contribution number for this article is 895.