Overlord is trying to figure out why one of his servers suddenly stopped running all the tasks (which was nicely detected by that neat network monitoring tool that he wrote)...

Overlord already installed a debugger on the server and attached it to the process.

He now suspects that the mean QueryPerformanceCounter() suddenly jumped in value for no apparent reason...

--That's my holidays

UPDATE: Looks like I was right. The difference between current QPC time and what's in memory is about 64 days. Argh! That's not very nice... Now I have to worry about ridiculous timings in every place I use it.

By the way, x64dbg is a very good program, and they accept Bitcoin donations. If you are a programmer and will find it useful, don't forget to send them some love.

OK, part of the mystery is solved - this is how Windows Time actually works:

Quote

When the time service has determined which time sample is best, based on the above criteria,it adjusts the local clock rate to allow it to converge toward the correct time. If the timedifference between the local clock and the selected accurate time sampleis too large to correct by adjusting the local clock rate, the time service sets the local clockto the correct time.

What is not clear is why it seems to be adjusting the time in the opposite direction.

It moved the time backwards more than 2 minutes during one hour, before giving up and finally just setting it to the correct value. Need to add more debug logging and run the servers again for a day or two...

This is, by the way, a new and improved time-sync algorithm, completely rewritten. I solved the small drift problem and increased performance by about 3 times.

You can see that when nobody intervenes it holds the value dead still for hours, which means all nodes are synchronized within 1 second.

One thing is clear - VPS sucks. The system time and QPC on it immediately start to diverge in giant steps, compared to dedicated servers. That's probably the reason why Windows Time needs to constantly adjust it.

So when we will deploy the real network we should strive to avoid VPS.

This means we will probably still need some monitoring group or two, which will estimate node quality and either provide independent rating or reward the best N nodes.

This single VPS forces the other two nodes to constantly compensate for its antics. With only 3 servers, this leads to constant time drift; it should be more smooth with more servers.

time_diff = ((time_diff << 3) - time_diff + new_diff) >> 3;To my surprise, when I switched to it the standard deviation of time between nodes dropped from about 10 ms to less than 1! An order of magnitude improvement!

What the...? How is this possible?!

Let's investigate.

====

The first thing I noticed is that the dead zone is not symmetrical:

Oops. But since both formulas give exactly the same results for positive numbers, this can't be it.

Let's look at negative numbers:

Aha. These formulas stop being completely equivalent for negative numbers:

But the difference is 1, how can this create such a dramatic effect? Let's dig deeper.

Plotting both functions side by side shows that when the value drops, the original function (red) has a dead zone, but the new one (yellow) does not:

What if the value grows? Then the picture flips:

Now the original doesn't have the dead zone and the new one does. So switching to the new function flips the asymmetric dead zone for negative numbers. So what?

Turns out this aligns dead zones for positive and negative numbers. This means that nodes above network time will not move in one direction on slight changes, and the nodes below network time will not move in the same direction! This creates a resistance wall, to which the value "sticks", greatly reducing variation.

Added clock drift + Windows Time correction into time sync simulator and did some tests.

One of the things it revealed is that using median is a poor choice. It should provide robust result, and most of the time it does, but it has a nasty property: if the central node drifts a little, but not enough to change its position in the median buffer, it can drag the whole network with it!

Starting to think that using a single time value won't cut it due to conflicting requirements: for convergence you need to arrive at a single value, as close between nodes as possible, but to anchor it to real time and prevent drifting you need to spread the votes apart so they pull in different directions.

Will probably have to split the algorithm into two parts: one for convergence and one for anchoring. This will require sending some additional data with each time vote. Will also probably replace median with interquartile mean or something similar...

The reason I spend so much time on time sync is because it's one of the two absolutely crucial parts of the system. The other is tx consensus.

Having <1 sec confirmations is impossible without tight time synchronization.

Median is not a very good choice for time sync because when the values converge closely, small random variations make it less stable.

The most dramatic case is when the central node drifts a little, but not enough to change its position in the median array: it then can drag the whole network with it since everybody uses its value as the new network time.

What we want is something like a mean; something that integrates over a large number of votes. But the mean has its own problem: outliers (either honest or malicious) can skew the network time and/or make it fluctuate significantly.

It would be nice to have something that behaves like a median when the values are far apart, but like a mean when they converge close together.

Enter the "Mean Median" I haven't read about it before, so I will assume I just invented it

Here's how it should be calculated:

1. Find the median of the data set.2. Cap all the values at some distance from it (in the example above it's +-200).3. Calculate the mean.

You can immediately see that when all the values are outside of the core range it behaves exactly like the median.But once more and more values start to converge and fall within the core range it starts behaving like the mean.

The core range should be big enough to cover any random variations and measurement errors, but no bigger.

There is a similar technique called "winsorizing". The problem is that it uses arbitrary criteria (like Nth percentile) to cap the values.

When the user or Windows Time adjust node's clock the time can jump suddenly and mess up our tight synchronization.

That's why we need to correctly handle time jumps.

On Windows there is the WM_TIMECHANGE message precisely for that, but we can't use it because it would require the core DLL to create a hidden window just to listen to Windows events and I don't think that's something a library should do. It's also less portable.

So we will do it manually: we will measure the time since the last call to our main time function and compare it to the time returned.

If the time returned is more than 100 ms in the past or grew by more than 100 ms AND this growth is twice as big as what our time measurement returned, we consider it to be a time jump and calculate the correction value.

Once time jump was detected we need to adjust our time_diff variable (the difference between local time and network time), as well as all the time votes we received because we store them relative to our local time.

In simulation it works perfectly. The rest of the program is completely oblivious that something has changed and the time values remain consistent.