Cyclic variations of variables are ubiquitous in biomedical science. A number of methods for detecting rhythms have been developed, but they are often difficult to interpret. A simple procedure for detecting cyclic variations in biological time series and quantification of their probability is presented here. Analysis of rhythmic variance (ANORVA) is based on the premise that the variance in groups of data from rhythmic variables is low when a time distance of one period exists between the data entries. A detailed stepwise calculation is presented including data entry and preparation, variance calculating, and difference testing. An example for the application of the procedure is provided, and a real dataset of the number of papers published per day in January 2003 using selected keywords is compared to randomized datasets. Randomized datasets show no cyclic variations. The number of papers published daily, however, shows a clear and significant (p<0.03) circaseptan (period of 7 days) rhythm, probably of social origin.

Cycles and periodicities can be found nearly everywhere in nature. Nevertheless, chronobiology still has the reputation of something strange and curious. Most rhythms in biomedicine can be classified according to their period lengths into circadian (period ~ 24 h), ultradian (period < 24 h), and infradian (period > 24 h). The best known is the circadian rhythm, e.g. the sleep-wake cycle. Its genetic basis and the interactions between the internal Zeitgeber (pacemaker) and external influences are well understood. Interested readers can find comprehensive information in a number of recent reviews (2; 4; 9; 10; 15; 21). An example for ultradian cycles is the variation of gonadotropin releasing hormone levels (8). Infradian rhythms include circannual variations and the circalunar menstrual cycle in women (13; 16). In a narrow

interpretation, infradian characterizes the cycles with shorter periods than the circalunar cycle. These infradian rhythms are rarely described. One possible reason for this is the number of methodological problems regarding cycle detection (7; 17). A growing number of statistical methods exist that should make the search for cycles in datasets easier (9; 11; 17; 18; 20). On the other hand, the variety of statistical procedures used indicates that there is no clear answer regarding which method is correct. I would like to add some entropy to this chaos and present a new method that is easy to use and understand for detecting periodicities in biological time series.

METHODS

Most methods are very difficult to understand and use without specific commercial software. Moreover, the complexity of the procedures makes it difficult to accurately interpret the results, particularly when considering that every method has its own characteristics and requires specific data conditions. My method is based on the premise that groups of values from cycling time series have a low variance (theoretically zero) if the time distance between the data entries is one period of the cycle. That is why I call the method Analysis of Rhythmic Variance (ANORVA). The detailed step-by-step procedure is presented here so it can be reproduced and used by other scientists.

Step 1 _ Data entry and preparation

The minimum number of subjects/objects or datasets is 5. The quantity of observed data in every dataset determines the provable period lengths. Minimum provable period length (Min PPL) is triple the time distance between the two closest data entries in one subject/object, and the maximum provable period length (Max PPL) is half of the whole minimum observation period. Therefore, when a dataset consisting of daily measurements for 30 days, possible period lengths between 3 (3x1) and 15 (30/2) days can be proven. Randomized number sets represent the negative control used for the comparison with the real data. To show how this method works, two negative controls are used here and compared to each other. The randomized datasets should have the same dimensions as the proven variable. The positive control used was the number of papers published daily in January 2003 using selected keywords according to the PubMed Medline database. See Figure 1 for details.

Datasets must be full; no blank data entries are feasible. If no other possibility exists, entries should be filled using interpolation and averages of the data surrounding the blank entry. Datasets are not required to be of the same extent in every subject, although similar extents are desirable. Values that are clearly different from the rest should be excluded if they are affected by external influence of a known non-cyclic factor.

Figure 1. The keywords 'DNA,' 'Heart,' 'PCR,' 'Brain,' 'RNA,' and 'Kidney' were chosen for analysis. The number of papers published on these topics each day in January 2003 were estimated using the PubMed Medline database search. Data from January 1-2, 2003 were excluded due to the extremely high values probably influenced due to the retention of the papers during Christmas holidays. Following data show a 7-day cyclic variation with clear nadirs on weekends, particularly on Sundays as indicated by arrows.

The values of January 1-2, 2003 were excluded as they were higher by order of magnitude in comparison with other data entries, probably due to the accumulation of the papers over the Christmas holidays. As this exclusion affects the size of the tested negative control dataset, the randomized datasets must be cut off in the same way.

If needed, the original data must be transformed to achieve normal distribution in biological data, often by the log-normal transformation. Data must be transformed into a non-dimensional variable by dividing the data entries through their average value. Thus, the average value of the new dataset must be 1. Moreover, if a significant trend is present, the time series must be detrended using the results of the relevant regression analysis.

Step 2 _ Variance calculating

The process begins with the Min PPL, calculating the variance of a group consisting of data entries with the Min PPL time distance between them. The same is repeated for the remaining phase shifts until the first phase shift is reached.

The first group of data will consist of our example of the 1, 4, 7, 10, , 3n+1 data entry. The second group (next phase shift) will consist of the 2, 5, 8, 11, , 3n+2. With 3 as the Min PPL, the third group (3, 6, 9, 12, , 3n) is the last group.

The average variance is calculated from the results of all phase shiftsand is repeated with every subject/object. The entire procedure is repeated for every period length between Min PPL and Max PPL.

Subjects/objects are represented in our example by the keywords. With Min PPL=3, the average variance can be calculated from 3 variances that respond to 3 possible phase shifts. This must be repeated with every keyword. Testing for the 4-day period is the same except that there are 4 possible phase shifts (4n+1, 4n+2, 4n+3 and 4n). Other periods are treated in the same way with an increasing number of possible phase shifts up to the Max PPL=15.

To obtain a similar dispersion of the results in randomized and real datasets, the resulting mean variances must be divided by the average variance and the dynamic weight of the results increased. A similar calculation was done in Step 1. Repeat the entire procedure with the randomized dataset. The results are the dynamic weighted average variances for phase shifts of every provable period for every subjects/object.

Step 3 _ Testing the difference

If n is the number of subjects/objects in the dataset, then for every tested period p there is a set of n dynamic weighted average variances for real data and n dynamic weighted average variances for randomized data. These two new groups of values must be compared using a simple Student t-test or a non-parametric test such as the Mann-Whitney U test. The t-test is preferred due to its statistical power and robustness; the one-tailed t-test is used here, although the choice of test type used in this step is not crucial for ANORVA. The resulting p-value is a quantification of the probability that a possible observed cycle is based only on coincidence. Thus, a low p-value, particularly one lower than 0.05, indicates a low periodic variance and a high probability for the existence of a cycle with the given period length in the proved dataset. Calculating p-values for every period makes it possible to compare the probabilities of the potential cycles and to find a cycle with a period length between Min PPL and Max PPL.

The process of calculating the p-values is simplified by using the built-in functions of office software packages; sophisticated statistical software is not necessary for these simple calculations, but an adjustment for multiplicity in testing, such as the Bonferroni correction, is needed to control the type I error rate. The resulting p-values are shown in graphs (Fig. 2). The randomized dataset is clearly free of cyclic variations, data on the papers published daily show an apparent and statistically significant rhythm of 7 days. This circaseptan cycle is the best-known rhythm of social origin (1, 6, 15). I believe that this cycle can be explained by the variations of publications due to the work-free weekend days, and indeed, the lowest number of publications coincides with Sundays.

Figure 2a. Periodicities in random data. The P-values in this graph are the result of the Student t-test that compared average variances in period lengths between 2 sets of randomized data, each representing a negative control. Although the p-values express great variability (explained by the small amount of data), they do not reach the alpha level of 0.05.

Figure 2b. Periodicities in real data. The P-values in this graph are the result of the Student t-test that compared average variances in period lengths between the dataset of daily published papers and of randomized data that represent a negative control. P-value for the 7-day period reaches the level of 0.0278. This means that the 7-day cycle variation is low enough to be presented as significant. Low p-value for a 14-day period can easily be explained, as it is twice the significant 7-day period length. Larger datasets might prove its existence.

It should be noted that further extensions of the method can be also used for synchronization of data and thus for preparing periodograms with an average cycle curve. The procedure is far more complicated, however, and beyond the scope of this article. The example contains synchronized data. The sole quantitative detection of rhythmic variations using ANORVA is independent from the de-synchronization of the subjects/objects as all possible phase shifts are analyzed.

CONCLUSION

The aim of this article was to offer a simple presentation of the basic principles and use of a new method for detecting cyclic variations in biological time series. Although other methods do exist, such as the widely-used Cosinor method and specific procedures based on Fast Fourier transformations (9; 11; 14;), they are difficult to interpret because they are dependent upon the morphology of the cycle (comparison to sinus wave) or have particular problems with noisy data. ANORVA is straightforward and easy to use and is independent of the morphology of the cycle curve. Its low specificity improves with the higher dataset sizes, but it can also handle small datasets.

An earlier study indicating the possibilities of the basic principle of ANORVA was the first to describe the circavigintan (period of 20 days) and circatrigintan (period of 30 days) cycles of salivary testosterone levels in men (5). Other studies on hormonal variations found in our lab using ANORVA have been submitted and are currently under consideration. Hopefully, the currently complicated chronobiological analysis will soon be easily understood and reproduced by interested scientists. Chronobiological research is a growing area of research throughout the world (19), and a simple method for the detection of periodical components of biological datasets is now more necessary than ever before.

Of course the ANORVA method must still be subjected to an objective evaluation, a detailed comparison with other methods, and have its weaknesses traced. However, this is a long-term process that requires time and feedback from other groups with experience using ANORVA on other datasets. Problems may occur with the number of randomized datasets, the choice of arithmetical rather than geometrical or other averages, or sensitivity toward hidden cycles. Identifying the method's limitations will help improve it and make it more universal, which is one of the reasons for providing a detailed description of the method's relatively simple calculation steps here. I believe that this clear and simple explanation will prove more helpful to readers than a complicated manuscript that could obscure more than it shows.

ACKNOWLEDGMENTS

I would like to express my appreciation to my mentors Assoc. Prof. Daniela Ostatnikova, MD, PhD and Assoc. Prof. Bozena Chovancova, Ing, PhD for the support, to my co-workers Julius Hodosy, MD and Rastislav Kostka, Ing for the contribution to this manuscript, to Katarina Melisova, Ing, Michaela Prihodova, MD and Ingrid Jurkovicova, MD for the inspiration and especially to Matus Kudela, MSc for the fruitful discussions and sustained criticism, which were most important in the development of ANORVA. This work was supported by Grant of the Comenius University No. 116/2004.