I know I made a thread asking a statistics-related question a while back, but I think this is sufficiently unrelated that it deserves its own thread. Mods, if you see it differently, by all means merge them.

I have been keeping a diary for three years, of which the last half-year is typed, with each day's wordcount recorded in an Excel file. Over the course of the next year I plan to type up all of my old handwritten entries in readiness for (perhaps) posting old entries in a sort of "blog from the past", so I will end up with about 1500 data points. Already I have enough data to notice a few trends:

As expected for a random walk, the cumulative wordcount over time is fractal in nature. Curiously, on various scales it also seems to be fairly close to piecewise linear if the finer detail is ignored. This could just be my perception, but if not then I'd be interested to know what the significance is, if any.

The wordcount probability density function (considered independently of time) seems, so far, to be surprisingly close to uniform up to about 1250 (2/3 of data), and fairly close to uniform but with a lower density from there up to about 2000 (1/12 of data), with a couple of percent of entries longer than that and the rest empty, i.e. no entry written on that day. Again, I'm interested as to why this is. I would have expected a more bell-like distribution.

I'm also curious about how entry lengths on given days are dependent on those from previous days. In short, I'm interested in the various sorts of observations and predictions which could be made about the time series based on the data I have so far.

As well as having the data in raw form, I've represented it with four graphs: three scatter graphs showing entries by actual wordcount, percentage ranking and logarithm of wordcount, and a cumulative graph. The first three currently have moving averages showing trends, but I've been wondering whether there might be a better way to do this. It occurred to me to represent the trend as a function of time which minimized the average distance from any point on the trendline to the data points (using mean squared distance wouldn't work, as it would result in a constant function). Might this idea, or something like it, work?

This is a placeholder until I think of something more creative to put here.

And, of course, a simple correlation measure between adjacent days might be interesting.

I would have expected a more bell-like distribution.

Bell curves occur when you take the distribution of the averages of large numbers of uncorrelated samples, each drawn from the same distribution (aka, the central limit theorem).

It seems unlikely that the probability of each character being added is very uncorrelated.

Now, presuming the length of each "book" of diaries is uncorrelated with the others (or lightly correlated), and each book is large enough, then the average of the lengths of entries in each book should form a bell curve (once again, CLT).

One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

You might want to try to find an underlying trend using a neural network.There are some NN designs that are specialized for time-series prediction, but I don't have much experience working with them so I don't really know where to point you.

Using a common perceptron network would probably require you to make some assumptions about the trend function, like splitting up the time dimension into a couple of cyclic parameters, where you expect there to be repetitive trends in the data.I've predicted time-series that were kind of similar to what you described using this method, with pretty good results.The data in my case was the number of customers in line for the checkouts at a warehouse, sampled every hour for a couple of hundred days. This data showed rough cyclic behavior over each day, week, month and year (as one might expect o.c)

This isnt directly related to what you describe, but something cool which you could do to your data once youve got it all typed up is to try running some clustering algorithm over it to see whether you can make it group 'similar' diary entries together (eg group together all the entries which are about girlfriend trouble, and find another cluster for entries where you talk about your job, and so on).

Theres a few papers on this sort of unsupervised text-clustering where they run algorithms over a database of Reuters news stories and try to make them group together all the stories covering certain events (eg separating israel-palestine stories from ones about the US election), I could link/send if youre interested.

The document that basically explains the approach I used when I was doing time series analysis is an Australian Bureau of Statistics publication entitled An Introductory Course on Time Series Analysis. It's mostly concerned with monthly and quarterly series, though, so you'd have to do some work to adapt it for daily data. You could try creating a spline-based trend, probably a cubic one to identify turning and inflection points. You probably also want to look into intervention analysis, where you apply corrections to your series to account for things like one-off extreme values and level shifts.