Switch StartupTimeline from PRTime to mozilla::TimeStamp

Categories

(Toolkit :: Startup and Profile System, defect)

The Mozilla Toolkit is a set of APIs, built on top of Gecko, which provide advanced services to XUL applications. These services include Profile Management, Chrome Registration, Browsing History, Extension and Theme Management, Application Update Service, and Safe Mode. (More info)

I've got a WIP patch I started developing while working on bug 689256 for this but I'm still facing a couple of issues that are not trivial to solve. The root of these issues is that mozilla::TimeStamp design enforces (for good reason) that its values are opaque and all taken via a monotonic timer (CLOCK_MONOTONIC on Linux & its POSIX-compatible friends). Some of the samples in StartupTimeline however are not recorded directly but generated from other sources instead:
- The PROCESS_CREATION timestamp is currently calculated on Linux by using the |starttime| field in /proc/self/stat, I have some doubts that the clock source used to calculate it is compatible with clock_gettime(CLOCK_MONOTONIC).
- If the process is restarted then PR_Now() is called and the value it returns is stored in an environment variable which is then read back, parsed and then used to populate PROCESS_CREATION.
- Finally the MAIN timestamp is converted back in seconds for crash detection.
In addition to this, JavaScript consumers also pose a problem. nsIAppStartup exposes the startup information as dates to JavaScript and its consumers obviously rely on this. The problem is that nsAppStartup uses the StartupTimeline to generate these values and obviously there's no reliable way I know of to convert from clock_gettime(CLOCK_MONOTONIC) to dates.
An overall solution would probably call for all these values to be switched to durations instead of absolute values but this would be quite intrusive.

Yeah, I tried looking at this too, but quickly ran into the non-opaqueness of PRTime being used in several interesting ways (0, negative error values, etc.).
(In reply to Gabriele Svelto [:gsvelto] from comment #1)
> I've got a WIP patch I started developing while working on bug 689256 for
> this but I'm still facing a couple of issues that are not trivial to solve.
> The root of these issues is that mozilla::TimeStamp design enforces (for
> good reason) that its values are opaque and all taken via a monotonic timer
> (CLOCK_MONOTONIC on Linux & its POSIX-compatible friends). Some of the
> samples in StartupTimeline however are not recorded directly but generated
> from other sources instead:
>
> - If the process is restarted then PR_Now() is called and the value it
> returns is stored in an environment variable which is then read back, parsed
> and then used to populate PROCESS_CREATION.
I think you can generate a TimeStamp from the environment variable like so:
TimeStamp() + TimeDuration::FromMilliseconds(...);
> In addition to this, JavaScript consumers also pose a problem. nsIAppStartup
> exposes the startup information as dates to JavaScript and its consumers
> obviously rely on this. The problem is that nsAppStartup uses the
> StartupTimeline to generate these values and obviously there's no reliable
> way I know of to convert from clock_gettime(CLOCK_MONOTONIC) to dates.
What's wrong with converting TimeStamps to milliseconds to pass into JS?

(In reply to Nathan Froyd (:froydnj) from comment #2)
> I think you can generate a TimeStamp from the environment variable like so:
>
> TimeStamp() + TimeDuration::FromMilliseconds(...);
I'll try, thanks for the tip.
> What's wrong with converting TimeStamps to milliseconds to pass into JS?
On Linux TimeStamps are taken using clock_gettime(CLOCK_MONOTONIC) which returns timestamps taken from an abstract and unknown point in time. There is thus no reliable way I know of to turn these samples into absolute times, it's meant to be used only for measuring intervals not absolute values.

I gave a long hard look at all the nsIAppStartup consumers as well as those directly using the StartupTimeline and all seem only interested in intervals (starting from process creation time). So my plan to fix this bug would be the following:
- Change StartupTimeline to record only timestamps taken with mozilla::TimeStamp
- Modify C/C++ consumers trying to compare the output with PR_Now() to also use mozilla::TimeStamp
- Modify GetAppStartupInfo() to return intervals from process creation time instead of absolute points in time
- Make GetAppStartupInfo() add an uptime field to the returned JS object for consumers who are interested in the time elapsed since creation (like nsBrowserGlue.js)
- Modify all JS consumers to use the intervals directly instead of trying to "create them" by substracting the values from Date.now() or Date.now().getTime()
There are still two issues to address: the process creation timestamp cannot be obtained directly via mozilla::TimeStamp and when restarts happen it needs to be serialized into an environment variable and then read back.
To address these two issues I'm tempted to add a specialized class for it (ProcessCreationTimeStamp) which would inherit from TimeStamp, be comparable with it and completely opaque to the user. All the logic to get/set the actual process creation timestamp and sanitize the value would then be sealed there so nobody is tempted to mix it with PR_Now() or Date().now() again.
Does this sound like a reasonable approach? I should have a WIP patch ready by tomorrow to foster the discussion.

This WIP patch adds a ProcessCreation() static method to the TimeStamp class that returns an opaque TimeStamp representing the process creation time. The code has been mostly culled from nsAppStartup.cpp with the appropriate modifications to make it work in this context. Next will be a patch that converts the nsIAppStartup to return intervals instead of absolute vales and finally a third one which switches StartupTimeline to using TimeStamps by leveraging the previous two (and gets rid of the now useless code in nsAppStartup.cpp).

This patch actually switches StartupTimeline to using TimeStamp and patches up nsAppStartup to cope with it. I've found a less intrusive way of fixing it up without resorting to changing nsIAppStartup.getStartupInfo() semantics and all its many users.
Among those there's TelemetryTimestamps.jsm which also uses |Date().now()| to create timestamps; we should probably file a follow-up bug to fix that one too because it will also be susceptible to time-jumps and negative intervals.

This is the final patch which re-enabled XRE_StartupTimelineRecord() and adjusts it's callers to submit timestamps taken with the appropriate timers. I've currently fixed only APKOpen.cpp and I'm now testing the resulting functionality. I'll fix-up nsBrowserApp.cpp next, test it too and submit the complete patch.

The times from nsIAppStartup feed into many systems used to crunch performance numbers, metrics, trends, etc. Could you please clarify whether this patch will change the semantic meaning of those numbers?

(In reply to Gregory Szorc [:gps] from comment #12)
> The times from nsIAppStartup feed into many systems used to crunch
> performance numbers, metrics, trends, etc. Could you please clarify whether
> this patch will change the semantic meaning of those numbers?
In the end I've opted for a more conservative approach so the semantics will be the same, the numbers provided by nsIAppStartup are still valid JavaScript date objects.
What changed is how those numbers are computed which is now done by subtracting the intervals obtained by sampling events with monotonic timers from the current date. This means that the dates themselves might not reflect the actual time when the events happened because that would be subject to adjustments (NTP, DST, etc...) but the intervals between the dates will be always consistent and reflect accurately the actual time that has elapsed between the different events.
To give a practical example if event A happened at 14:59 and event B happened at 15:02 but in the meantime the time was dialed back by an hour (because of DST for example) with the old system you would get:
A - 14:59
B - 14:02
With my patches:
A - 13:59
B - 14:02
So essentially the times reported by nsIAppStartup will be relative to the moment when getStartupInfo() is called.
In addition to this my patches make sure that the process creation time and main time are always returned by getStartupInfo() and are always consistent. If an error is detected at any stage the telemetry is updated to record it. All together these changes should provide consistent times for analysis but also fix the oranges that we're seeing because of timing inconsistencies.

Refreshed patch fixing nsAppStartup::TrackStartupCrashEnd() which was computing incorrectly the "toolkit.startup.last_success" preference value. Also made the process creation timestamp validation more defensive to help in tests that do not set a main timestamp value properly.

The patches are in good state now, I'm passing mochitests correctly on all platforms but I still have to iron out some xpcshell failures on Windows & MacOS X. I suspect they might be caused by an incorrect calculation of the process creation timestamp so I'll have to double-check them and that might take some time as I haven't setup a Windows build yet (and I don't have access to a recent-enough Mac).
Nevertheless I think the patch-set should be good enough for an initial review though I'm not yet sure who might be a good candidate for reviewing it.

So this is how it looks now:
https://tbpl.mozilla.org/?tree=Try&rev=5f17f8df8939
The test is "very" green and seems devoid of almost all common intermittent tests caused by timing issues but I've still got two important issues to iron out:
- xpcshell test test_TelemetryTimestamps.js which is using telemetry, apparently one of the extra fields that the tests adds to the telemetry data doesn't show up anymore later when it's checked. Curiously this is happening on Windows and Mac but not on Linux, I've got a Windows build working yesterday so I'll check it there directly.
- mochitest-browser browser_crash_detection.js which is failing on MacOS 10.6 and Win8. The test might be failing because of my changes but I'm not 100% sure because it looks a bit fragile. What it does is check the value of the toolkit.startup.last_success preference (which is expressed in seconds since Epoch) and compare it with the last StartupTimeline::MAIN sample rounded in seconds. The reason for this is that "toolkit.startup.last_success" should be set shortly after a successful startup so the two events should be happening less than a second away from each other. The problem is that even on my desktop machine I've measured ~700ms between the two events so if the machine running the test is a bit slow the test is bound to fail anyway. It might be possible that my patch is making it worse however so I'll double check and see if I can prevent the issue.

I've now integrated the code to sanitize the process creation time, previously it could still be incoherent with regular TimeStamp measurements thus breaking the abstraction of a fully monotonic timer.

Here's a full description of the patches to simplify reviewing them:
- The first patch adds functionality to compute the process creation timestamp to the TimeStamp class. This functionality is taken from nsAppStartup.cpp and adapted; it also includes functionality to properly detect application restarts. Since process creation time cannot be measured directly the computed value is sanitized before being returned to the caller and if an error is detected it is notified so that a consumer can report it to telemetry if need be. Overall this tries to encapsulate this functionality in the TimeStamp classes without breaking their abstraction of always providing monotonic and comparable timestamps.
- The second patch switches StartupTimeline to use the TimeStamp class and adjusts its use in nsAppStartup. The startup info will still be returned in milliseconds since epoch but is now calculated on the fly relative to the current time thus always providing consistent intervals between the various events. JS consumers don't need to be changed as the interface to nsIAppStartup has remained the same. The startup information now always contains some events (such as process creation or main) even if an error was detected; an error is logged through telemetry in these cases and an incorrect but coherent placeholder value is used instead of the actual timestamp. Generally speaking this patch should quiet a few intermittent failures we had because of incoherent or missing events in the values returned by GetStartupInfo().
- The third patch adjusts the StartupTimeline consumers that cannot use the TimeStamp class directly. The timestamps in those consumers are taken using the same timers as the TimeStamp class and fed into XRE_StartupTimelineRecord(). It's implementation is a friend of the TimeStamp class and can thus synthesize timestamps from these values; however to ensure consistency the synthesized values are sanitized and discarded if found wrong (telemetry errors are also logged in the process).

Comment on attachment 729508[details][diff][review]
[PATCH 1/3 v2] Add functionality to create TimeStamp instances representing the process creation time
Review of attachment 729508[details][diff][review]:
-----------------------------------------------------------------
I'm happy with this, but I'd like to see the refreshed patch. Nothing major, just some style and documentation bits.
::: xpcom/ds/TimeStamp.h
@@ +237,5 @@
> + * the @a aTelemetryError parameter will be set to true, the returned
> + * timestamp however will still be valid though inaccurate.
> + *
> + * @param aTelemetryError true if an inconsistency was detected in the
> + * process creation time
Nit: "Set to true if an inconsistency was detected..."
@@ +241,5 @@
> + * process creation time
> + * @returns A timestamp representing the time when the process was created,
> + * this timestamp is always valid even when errors are reported
> + */
> + static TimeStamp ProcessCreation(bool &aTelemetryError);
Nit: style places the ampersand with the type here |bool& aTelemetryError|.
I think you really want to call this aIsInconsistent or similar. The documentation and implementations need to be rewritten accordingly, of course.
::: xpcom/ds/TimeStamp_darwin.cpp
@@ +22,5 @@
> #include <time.h>
>
> +#define KINFO_PROC struct kinfo_proc
> +#define KP_START_SEC kp_proc.p_un.__p_starttime.tv_sec
> +#define KP_START_USEC kp_proc.p_un.__p_starttime.tv_usec
Nit: probably don't need these now that we're in a Darwin-specific source file.
@@ +31,5 @@
> +
> +// Holds the first timestamp taken with this class and is used to sanitize
> +// timestamps synthesized instead of obtained via monotonic timers such as the
> +// process creation timestamp
> +static mozilla::TimeStamp sFirstTimeStamp;
I think moving the definition of this into TimeStamp.h, along with the comment, and then just providing definitions (without comments) in the source files would be better.
@@ +200,4 @@
> }
> +
> +TimeStamp
> +TimeStamp::ProcessCreation(bool &aTelemetryError)
It is too bad that we have to duplicate this function and RecordProcessRestart across all three implementations... :(

Comment on attachment 729516[details][diff][review]
[PATCH 3/3 v2] Make XRE_StartupTimelineRecord() generate TimeStamps and modify its callers to use the appropriate timers
Review of attachment 729516[details][diff][review]:
-----------------------------------------------------------------
No comments here. I would like tellrob to look at the Windows bits in nsBrowserApp.cpp and StartupTimeline.cpp to make sure they're sane, though. They *look* sane to my eyes, but you never can tell with Windows. ;)

I'll post updated patches with the changes you suggested very soon. In the meantime I tried them on my Windows build and unfortunately they always give me telemetry errors for the PROCESS_CREATION and START events; I'll have to do some debugging there to figure out what's wrong with my changes.

Quick update, the Windows code is borked because I hadn't realized that GetTickCount*() returns the time in milliseconds elapsed since boot and not since the epoch. It's result cannot thus be mixed easily with the process creation time returned by GetProcessTimes(); I'll have to find another way around this particular issue.

More testing showed me that I'm having similar issues with process creation time on more than Windows. I tried the patches on NetBSD and the process creation time is also wrong for similar reason (clock_gettime(CLOCK_MONOTONIC) basically returns the uptime while the process creation time is an absolute timestamp). I've already applied all the changes in comment #25 and comment #26 but before posting the updated patches I'll make sure this issue is fixed for good. The START timestamps obtained via the third patch are looking good on Windows too however so at least that part seems to be sound.

This revised patch contains all the changes from comment #25 and more.
First of all the way the process creation timestamp is computed has been changed on all platforms but Linux (which was the only one actually working reliably). In a nutshell the process creation time is now always computed by obtaining the process uptime using a timer compatible with the one used to obtain the process creation timestamp and then subtracting this uptime from the monotonic timer used by the TimeStamp class. I've tested the resulting Firefox build on Linux and Windows and the timings I get are sound. I've also tested the code snippet doing the computation on an old Mac and a NetBSD box I had around and obtained similarly sound results; I couldn't test the full Firefox build on those machines as they're too old to build it there.
The second very important change is that the process creation timestamp is now computed only once when requested for the first time and then the result is cached. This is necessary to ensure consistency across calls; since every call to TimeStamp::ProcessCreation() would recompute the timestamp it could return slightly different timestamps and this problem was already present in the previous patch.

This is a refreshed patch with all the changes from comment #26 added. In my testing the full patch set now returns proper results for all events with the process creation time being always consistent with other timings (still the error checking and reporting is still in place just in case).
The only event that turns out wrong from time to time is START. Being very close to the process creation time it can end up appearing as having been taken just before it; in this case an error is logged with the telemetry and things go on as usual.

First of all thank you for the quick reviews Nathan! I'll add the other changes you're suggested and I think we should be good to go but see below.
(In reply to Nathan Froyd (:froydnj) from comment #34)
> If I understand correctly, this series of patches *should* ensure we never
> hit a case when Get(ev) < Get(PROCESS_CREATION). However, I think we should
> "trust, but verify" and continue to accumulate into this histogram. If
> Telemetry data shows that we're not hitting that case, well and good; if it
> shows we're hitting that case, then we should attempt to understand why.
Right, that sounds like a safer choice. Using the TimeStamp class should never allow measurements to be wrong but it's also true that we could screw up something in its implementation at a later stage and not having the check wouldn't allow us to catch the issue. I'll add it back.
As a very last thing I did some fairly widespread testing of the patches and I'm still hitting consistently a problem in the Win8 runs on try, see this log:
https://tbpl.mozilla.org/?tree=Try&rev=1fc171fae3d4
As I mentioned in comment #19 the browser_crash_detection.js detection checks the value held by the "toolkit.startup.last_success" preference happens within the same second of the MAIN startup event. In this case the test was off by 1s causing the fail but I've seen up to 30s of difference there in a debug run. When I saw this previously on MacOS 10.6 it was a real issue because the MAIN time was off but in this case I cannot tell since the Win7 and WinXP runs were both green. I suspect that one of the system calls I'm using might be behaving differently on Windows 8 but not having a machine to test on I cannot verify it.
Anyway I want to fix this issue before committing the changes because they were meant to reduce the timing-related failures we're seeing and adding a new one wouldn't be acceptable.

Interesting that you should mention time weirdness on Windows 8. We were running into similar issues in bug 831404. I /think/ the Windows SDK we are using for Windows 8 is currently exclusive to those builds. Semantic change or bug in the SDK, perhaps?

Added back the checks for spurious timings & recording the errors in the telemetry, used ComputeAbsoluteTimestamp() and did the other adjustments as per comment 34. Also fixed a couple of places where the indentation was incorrect.

I did a couple of full try-runs with this patch-set and they both came up green save for other intermittent errors:
https://tbpl.mozilla.org/?tree=Try&rev=e93372d70a88https://tbpl.mozilla.org/?tree=Try&rev=6ad089833620
However I launched these runs over the weekend when infrastructure load was very low, I fear this might hide timing-related issues. For this purpose I re-run only the tests that were failing on Windows 8:
https://tbpl.mozilla.org/?tree=Try&rev=5c4461163695https://tbpl.mozilla.org/?tree=Try&rev=2d4ef027a8a2
In both cases the browser_crash_detection.js test failed with "toolkit.startup.last_success" being one second off from the MAIN timestamp. I fear that this is down to the test being too strict; on my machine the two timestamps are off by 700ms so going over 1s when the machine is under load is not unlikely. Nathan if you think my analysis is correct should I patch the test to add a little bit more leeway between the two timestamps? Like checking that they are within 2s from each other instead of 1?

(In reply to Gregory Szorc [:gps] from comment #37)
> Interesting that you should mention time weirdness on Windows 8. We were
> running into similar issues in bug 831404. I /think/ the Windows SDK we are
> using for Windows 8 is currently exclusive to those builds. Semantic change
> or bug in the SDK, perhaps?
I re-did the tests (see above) and I don't see the timestamp being wildly off like before, just a little bit, so maybe there were still issues in my patch-set. My patches use GetTickCount[64](), GetProcessInformation() and QueryPerformanceCounter(); they've been both around for some time so I doubt their behavior might have changed. If it did I'd assume that the last one is more likely to behave differently than the others but it's hard to prove it without running a couple of Windows boxes side by side and checking for results.

(In reply to Gabriele Svelto [:gsvelto] from comment #40)
> However I launched these runs over the weekend when infrastructure load was
> very low, I fear this might hide timing-related issues. For this purpose I
> re-run only the tests that were failing on Windows 8:
>
> https://tbpl.mozilla.org/?tree=Try&rev=5c4461163695
> https://tbpl.mozilla.org/?tree=Try&rev=2d4ef027a8a2
>
> In both cases the browser_crash_detection.js test failed with
> "toolkit.startup.last_success" being one second off from the MAIN timestamp.
> I fear that this is down to the test being too strict; on my machine the two
> timestamps are off by 700ms so going over 1s when the machine is under load
> is not unlikely. Nathan if you think my analysis is correct should I patch
> the test to add a little bit more leeway between the two timestamps? Like
> checking that they are within 2s from each other instead of 1?
Well, the test looks like it's making sure that we stored the MAIN timestamp in the toolkit.startup.last_success pref. And that's perfectly valid; with PRTimes, we're always using the same timestamp for setting the pref and for computing the bits in nsIAppStartup.
But now we're not; for setting the pref we're now computing:
PR_Now() - (TimeStamp::Now() - MAIN)
and for when we getStartupInfo:
PR_Now() - (TimeStamp::Now() - MAIN)
and we're hoping that the (PR_Now() - TimeStamp::Now()) quantity is going to be fixed in both instances...but these tests are telling us it's not. And that strikes me as bad.
I'm not sure what the right fix is. One could just change ComputeAbsoluteTimeStamp to something like:
uint64_t
ComputeAbsoluteTimestamp(TimeStamp stamp)
{
static PRTime pt = PR_Now();
static TimeStamp ts = TimeStamp::Now();
return pt - (ts - stamp).ToMilliseconds();
}
so that we're always computing with fixed values in the above equations. I'm not sure that'd be preferred to:
uint64_t
ComputeAbsoluteTimestamp(TimeStamp stamp)
{
return (stamp - TimeStamp(0)).ToMilliseconds();
}
though. Would have to think about it some more.

(In reply to Nathan Froyd (:froydnj) from comment #42)
> Well, the test looks like it's making sure that we stored the MAIN timestamp
> in the toolkit.startup.last_success pref. And that's perfectly valid; with
> PRTimes, we're always using the same timestamp for setting the pref and for
> computing the bits in nsIAppStartup.
Right, somehow I was under the impression that we were taking another timestamp for that pref and not re-using the MAIN timestamp. I must have misread the code the first time I read it and that assumption stuck while I was doing the testing.
> and we're hoping that the (PR_Now() - TimeStamp::Now()) quantity is going to
> be fixed in both instances...but these tests are telling us it's not. And
> that strikes me as bad.
>
> I'm not sure what the right fix is. One could just change
> ComputeAbsoluteTimeStamp to something like:
>
> uint64_t
> ComputeAbsoluteTimestamp(TimeStamp stamp)
> {
> static PRTime pt = PR_Now();
> static TimeStamp ts = TimeStamp::Now();
> return pt - (ts - stamp).ToMilliseconds();
> }
>
> so that we're always computing with fixed values in the above equations.
I don't think that would work. If the system clock is adjusted after your first call to PR_Now() the results you'll be returning from that calculation will always be wrong.
> I'm not sure that'd be preferred to:
>
> uint64_t
> ComputeAbsoluteTimestamp(TimeStamp stamp)
> {
> return (stamp - TimeStamp(0)).ToMilliseconds();
> }
>
> though.
This wouldn't work as there's no way to turn that number into a time since epoch which is required to pass it back to JS code as a Date object.
> Would have to think about it some more.
In the meantime I'll verify the stability of those time differences across the platforms I can test on. I suspect that we're seeing problems on Windows because it's the only platform where we're using two distinct clocks to compute timestamps. I also want to verify if/how bug 791420 is affecting the checks between these two clocks.

Comment on attachment 729516[details][diff][review]
[PATCH 3/3 v2] Make XRE_StartupTimelineRecord() generate TimeStamps and modify its callers to use the appropriate timers
Review of attachment 729516[details][diff][review]:
-----------------------------------------------------------------
You're not adjusting the XRE_StartupTimelineRecord signature in nsXULAppAPI.h. Note it would be better to ensure that passing a PRTime *will* fail (third party code, like comm-central, could be doing that)
::: browser/app/nsBrowserApp.cpp
@@ +411,5 @@
> + return 0;
> + }
> +
> + uint64_t baseNs = (uint64_t)ts.tv_sec * 1000000000;
> + return baseNs + (uint64_t)ts.tv_nsec;
It would be better if we could somehow use Timestamp::Now. It would probably be fine to do so in a followup.

(In reply to Mike Hommey [:glandium] from comment #44)
> You're not adjusting the XRE_StartupTimelineRecord signature in
> nsXULAppAPI.h. Note it would be better to ensure that passing a PRTime
> *will* fail (third party code, like comm-central, could be doing that)
The reason why I did not change the function signature is that I did not want to break an external interface. Since the function semantics changed however, I put those additional checks in place to ensure that whatever external measurement is passed in is at least consistent with the measurements taken internally.
Besides to make it fully incompatible with PRTime (which is just an integer) we would need to replace the argument type with something completely different such as a structure/class (TimeStamp would obviously be a perfect fit if we could expose it).
> It would be better if we could somehow use Timestamp::Now. It would probably
> be fine to do so in a followup.
Yes, I would say that the biggest issue here is that we don't control how external code takes samples to be fed to XRE_StartupTimelineRecord(); if we did it would be easy to switch it from PR_Time to TimeStamp and remove the code I had to duplicate in nsBrowserApp.cpp and APKOpen.cpp. APKOpen.cpp however is a good example of why this is so hard: the first timestamp it takes happens before libxul.so has even been loaded.

(In reply to Gabriele Svelto [:gsvelto] from comment #45)
> (In reply to Mike Hommey [:glandium] from comment #44)
> > You're not adjusting the XRE_StartupTimelineRecord signature in
> > nsXULAppAPI.h. Note it would be better to ensure that passing a PRTime
> > *will* fail (third party code, like comm-central, could be doing that)
>
> The reason why I did not change the function signature is that I did not
> want to break an external interface. Since the function semantics changed
> however, I put those additional checks in place to ensure that whatever
> external measurement is passed in is at least consistent with the
> measurements taken internally.
>
> Besides to make it fully incompatible with PRTime (which is just an integer)
> we would need to replace the argument type with something completely
> different such as a structure/class (TimeStamp would obviously be a perfect
> fit if we could expose it).
Adding a dummy argument is another option. Changing the function name is yet another. What you suggest would be nice, though (and could be possible, see below)
> > It would be better if we could somehow use Timestamp::Now. It would probably
> > be fine to do so in a followup.
>
> Yes, I would say that the biggest issue here is that we don't control how
> external code takes samples to be fed to XRE_StartupTimelineRecord(); if we
> did it would be easy to switch it from PR_Time to TimeStamp and remove the
> code I had to duplicate in nsBrowserApp.cpp and APKOpen.cpp. APKOpen.cpp
> however is a good example of why this is so hard: the first timestamp it
> takes happens before libxul.so has even been loaded.
nsBrowserApp.cpp too. But it's not impossible to deal with those. There are essentially two approaches. One is to make TimeStamp independent of anything else (which it almost is) and allow to #include its implementation, another is to make TimeStamp part of MFBT (which also requires to make it independent of anything else). In the former case there would be one copy of the TimeStamp binary code (but not source) in each program or library that uses it, and in the latter case, there would be one copy of the TimeStamp binary code in mfbt, wherever it ends up on the given platform (which varies).

(In reply to Mike Hommey [:glandium] from comment #46)
> Adding a dummy argument is another option. Changing the function name is yet
> another. What you suggest would be nice, though (and could be possible, see
> below)
> [...]
> nsBrowserApp.cpp too. But it's not impossible to deal with those. There are
> essentially two approaches. One is to make TimeStamp independent of anything
> else (which it almost is) and allow to #include its implementation, another
> is to make TimeStamp part of MFBT (which also requires to make it
> independent of anything else). In the former case there would be one copy of
> the TimeStamp binary code (but not source) in each program or library that
> uses it, and in the latter case, there would be one copy of the TimeStamp
> binary code in mfbt, wherever it ends up on the given platform (which
> varies).
That sounds like a very good way to solve this issue; of the two solutions you propose I would be more keen into moving the TimeStamp class into MFBT. Considering it has three distinct implementations already lumping it all in an header file feels a bit unpleasant to me.
I'd like to address the problem highlighted by Nathan in comment 42 first however so shall I open a follow up bug for taking care of this?
Another issue that might be worth a follow-up is TelemetryTimestamps.jsm which currently uses Date.now() to gather timestamps; those are then mixed with the results of getStartupInfo() in TelemetryPing::getSimpleMeasurements() in TelemetryPing.js and could thus lead to inconsistent results when the system clock is adjusted. I assume that ideally we would like all those measurements to come from the same source.

I did some tests regarding the problem highlight in comment 42 and the results I got were not very good. On Linux the variation in the value of PR_Now() - TimeStamp::Now() across calls is usually very very small (in the microseconds range) so the resulting telemetry data (rounded in ms) shows up always the same.
On Windows 7 however the variation is much higher, often ~1ms and getting up to ~15ms when mixing timestamps taken with QueryPerformanceCounter() and GetTickCount[64](). The net result is that getStartupInfo() will often return different results across calls for the various events which besides breaking browser_crash_detection.js is not a very desirable behavior.
I can't think of a decent way of fixing this though. The only sensible way seems the first proposal from Nathan (i.e. fixing the PR_Now()/TimeStamp::Now() frame of reference); we should obviously accept that the values it returns will not be adjusted if the system clock is adjusted. However maybe that's preferable to having "wobbly" values when computing absolute timestamps.
On the other hand fixing the issue seen in the test alone is not hard; it would be enough not to use StartupTimeline but just a specific timestamp taken with PR_Now() only for that purpose.

I've modified the patch so that ComputeAbsoluteTimestamp() now uses a fixed interval between PR_Now() and TimeStamp::Now() as you had suggested in comment 42. This seems to work well across different platforms and at the current stage seems the safer choice to be able to land this.
The only side-effect this has is that the timestamps returned from getStartupInfo() won't be adjusted if the system clock is adjusted but this seems to me to be the lesser evil compared to the wobbly timestamps I was getting before. I'm testing this across different platforms and for now it's looking good.

Last-minute fix. While testing on Windows I noticed that during one of the previous changes I started logging telemetry errors incorrectly. Basically a telemetry error was being logged every time we were missing a timestamp in the startup timeline. This patch fixes this issue and logs an error only if the timestamp is present but is incoherent with the process creation time.
I've manually tested this patch-set on both Windows and Linux and the numbers look good, they're stable and consistent. I'll find someone from #qa to check Mac builds too.

I've only been casually following the comments in this bug. But, it now appears that we're using a monotonic clock under the hood and are deriving an absolute time by adding the monotonic elapsed interval to the initially-obtained absolute time. This will yield accurate timings - yay!
Since we're using monotonic clocks for everything now, shouldn't we just abandon the idea of reporting in absolute times and just report offsets instead? At most the only absolute time we need to report is the initial time.
Also, depending on the lifespan of the timeline, what happens when the machine is put to sleep? Does the monotonic clock stop when powered down? Does this vary between operating systems? CPUs? Motherboards? The reason I ask is Firefox Health Report currently reports all times as absolute (Date.now()) and Metrics says that like 14% of the clients have at least one inaccurate absolute time. We're investigating using monotonic clocks for as much as possible. However, we'd potentially be collecting samples across sleeps/hibernations and we'd like to know the guarantees for Gecko's monotonic clock implementation. I understand this bug may not be the best place to discuss FHR issues. But, this bug is currently the best source of clocks and Gecko discussion I've seen :)

(In reply to Gregory Szorc [:gps] from comment #52)
> I've only been casually following the comments in this bug. But, it now
> appears that we're using a monotonic clock under the hood and are deriving
> an absolute time by adding the monotonic elapsed interval to the
> initially-obtained absolute time. This will yield accurate timings - yay!
Yes, they should be :) Or at least they should be for everything that goes in the startup timeline. Unfortunately this is not a silver bullet because JS code will still be using Date(s) and Date.now() to take timestamps.
> Since we're using monotonic clocks for everything now, shouldn't we just
> abandon the idea of reporting in absolute times and just report offsets
> instead? At most the only absolute time we need to report is the initial
> time.
That was one of the ideas I had flirted with when I started working on this bug but it meant changing a couple of interfaces so maybe such work would deserve its own bug.
> Also, depending on the lifespan of the timeline, what happens when the
> machine is put to sleep? Does the monotonic clock stop when powered down?
> Does this vary between operating systems?
It depends on the OS AFAIK. On Linux we use clock_gettime(CLOCK_MONTONIC) which should not report time spent in standby/hibernate IIRC. On Mac it's mach_absolute_time() which should exhibit the same behavior. On Windows we're not so lucky. We use QueryPerformanceCounter() preferably, which AFAIK has weird behavior when putting the machine to sleep in the sense that it might keep counting correctly or be wildly off (in which case we fall back to GetTickCount[64]()). In both cases however we will take sleep time into account and I'm not aware of any Windows timer which can be used to work around this problem. But then again I'm not a Windows expert so somebody with more experience might have a better idea.

Comment on attachment 733355[details][diff][review]
[PATCH 2/3 v6] Make StartupTimeline use TimeStamp instead of PRTime and adjust nsAppStartup to cope with it
Review of attachment 733355[details][diff][review]:
-----------------------------------------------------------------
Thanks for working through this!
I think it would be worth addressing the comments Mike made about XRE_StartupTimelineRecord, but in a followup bug. Moving TimeStamp into MFBT sounds like the best way forward there.

(In reply to Nathan Froyd (:froydnj) from comment #55)
> Thanks for working through this!
And thanks for the speedy and detailed reviews! This is the try run for the final patch-set, I'll add the checkin-needed flags if it turns out nice and green:
https://tbpl.mozilla.org/?tree=Try&rev=3fb18a52037e> I think it would be worth addressing the comments Mike made about
> XRE_StartupTimelineRecord, but in a followup bug. Moving TimeStamp into
> MFBT sounds like the best way forward there.
Yes indeed, I'll add the follow-ups ASAP.

For this to break FHR would imply that nsIStartupInfo changed semantics (e.g. now reporting in microseconds instead of milliseconds). If nsIStartupInfo changed, this would also break a number of other components and I would recommend backout until nsIStartupInfo reverts to its old reporting unit.
But I thought this didn't change the units nsIStartupInfo reported - only how they are calculated (using a monotonic clock).

OK. Just picked up today's nightly. It looks like nsIStartupInfo is reporting backwards-incompatible values. IMO we need to back this out until all nsIStartupInfo consumers are updated to the new convention or until nsIStartupInfo reverts to its old behavior.

Ouch, I was pretty sure I was returning the measurements in milliseconds and I had double-checked by looking at the telemetry results in about:telemetry but apparently something slipped through the cracks. I'll try to figure out what happened ASAP.

(In reply to Gregory Szorc [:gps] from comment #62)
> But I thought this didn't change the units nsIStartupInfo reported - only
> how they are calculated (using a monotonic clock).
I've re-run my tests on both Linux (64-bit, debug build) and Windows 7 (32-bit, debug build) and on both the FHR and about:telemetry data is consistently reported in milliseconds. I need more data to reproduce the issue highlighted in comment 61.
(In reply to Mike Connor [:mconnor] from comment #70)
> Can we make sure there's a test that sanity-checks output to ensure this
> doesn't unintentionally change?
That wouldn't be easy to do, as with every timing check it might be tripped by valid but excessive values.

(In reply to Gabriele Svelto [:gsvelto] from comment #71)
> (In reply to Gregory Szorc [:gps] from comment #62)
> > But I thought this didn't change the units nsIStartupInfo reported - only
> > how they are calculated (using a monotonic clock).
>
> I've re-run my tests on both Linux (64-bit, debug build) and Windows 7
> (32-bit, debug build) and on both the FHR and about:telemetry data is
> consistently reported in milliseconds. I need more data to reproduce the
> issue highlighted in comment 61.
I'll try to obtain something for you.
> (In reply to Mike Connor [:mconnor] from comment #70)
> > Can we make sure there's a test that sanity-checks output to ensure this
> > doesn't unintentionally change?
>
> That wouldn't be easy to do, as with every timing check it might be tripped
> by valid but excessive values.
Well, the numbers we were seeing were off by several magnitudes. I think a simple test of "the browser didn't take more than 10 minutes to start" would be a sufficient sanity test without a realistic chance of tripping.

(In reply to Gregory Szorc [:gps] from comment #72)
> I'll try to obtain something for you.
Thanks!
> Well, the numbers we were seeing were off by several magnitudes. I think a
> simple test of "the browser didn't take more than 10 minutes to start" would
> be a sufficient sanity test without a realistic chance of tripping.
Yes, I was thinking of something along the lines; hopefully with a length long enough it shouldn't trip under pathological conditions (such as the machine being under heavy load or something). Shall I add a unit test for that purpose to the patch-set?

Definitely add a unit test! Otherwise we may run into the same issue later.
Also, if you have any context to add to bug 860930, please do that. AFAIK you are the person with the most context in your head on the issue ATM. The sheriffs are quite frustrated with all the Windows 8 failures and the sooner we can fix the underlying issue the better.

(In reply to Gregory Szorc [:gps] from comment #74)
> Definitely add a unit test! Otherwise we may run into the same issue later.
OK, I'll do that.
> Also, if you have any context to add to bug 860930, please do that. AFAIK
> you are the person with the most context in your head on the issue ATM. The
> sheriffs are quite frustrated with all the Windows 8 failures and the sooner
> we can fix the underlying issue the better.
I've got a hunch so I'll write that down in bug 860930. However without a Windows 8 machine to test I cannot confirm my suspicions; I'll be going to try and get a license via ServiceNow to dig further into these issues.

I wouldn't be too concerned about avoiding pathological conditions. If startup is more than two minutes on a test profile, I am more than okay with the idea that we'll fail a unit test, as it'll probably highlight something seriously wrong with a commit.

So, uh, I applied the 3 patches that were backed out from m-c and did a build on OS X. I couldn't reproduce any weirdness. I even did a clobber build to be sure it wasn't the build system being silly.
I then downloaded the OS X Nightly from https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2013-04-09-03-08-55-mozilla-central/. I was also unable to reproduce any weirdness with that build.
Thinking it may not impact OS X, I also downloaded a Windows Nightly from that same directory. I was also unable to see any weirdness.
In comment #63 I said I was able to see weirdness. However, that was on a different Windows machine (my home desktop). I guess I'll have to try the build on that machine and see if it's behaving differently from the Windows machine in the office.
Hmm...

(In reply to Gregory Szorc [:gps] from comment #78)
> I downloaded the aforementioned Nightly on my home desktop and was unable to
> reproduce the issue. Now I'm really confused. I'm pretty sure I and the
> others reporting issues were not hallucinating.
I believe you :-) I was wondering, do you remember exactly when it happened? Can we link it to some abnormal time behavior such as a jump caused by NTP, DST or something along the lines? The numbers looked off-the-scale by a margin large enough that they could have been caused by negative differences somewhere in the calculations.
BTW I got myself a copy of Windows 8, as soon as the Madrid workweek is over I'll install it on my laptop and try to figure out what's going on there.

I first noticed it on 4/9/13, the morning after this landed.
see: https://bugzilla.mozilla.org/show_bug.cgi?id=859863
I tested with archive builds back to 4/1/13 and the graph still has bogus times on the vertical axis, so even going back to pre-patch FHR is messed up.
Other 'test' profiles on my system seem to be showing 'reasonable' times on the vertical axis of the graph, and the install on my wife's userID on this same machine seems to be reporting reasonable times on the vertical axis.
I'm not sure if this is a profile corruption issue, or if the graph and data are downloaded from a server dedicated to FHR, or something written to the registry ?

Not only is the graph off, but so is total times I would guess.
My Vital Stats:
Vital Stats
version23.0a1
time open 1323402 min 1323402/60 (min/hr) = 22056.7/24 (hrs/day) = 919 days (2.5yrs)
last crash 2013-04-04
bookmarks
2.5 yrs is way before FHR was even thought of.... :)

When I started hearing people complain about things involving nsIStartupInfo being weird I immediately thought of this bug, upgraded to that day's Nightly (2013-04-09), restarted, and noticed that FHR's session recorder was recording absurd values. That is why I wrote comment #63.
FHR does periodically record bogus total session times due to system clock weirdness (it compares wall times instead of using a monotonic clock). I reckon that is why some people have seen and continue to see absurdly large total session times. However, the issue I saw when I upgraded to a build with this patch was FHR was recording the startup timeline (main, firstpaint, sessionrestored) values in unreasonable units (millions of milliseconds). That should never happen. Furthermore, the slow startup feature in Firefox was prompting people. This indicated a systemic failure in nsIAppStartup.

Let's just reland this and see what happens. Worst case Nightly is busted for a few days and it is backed out again. Best case is we get more eyeballs on the problem and fix the lingering issues being uplift.

(In reply to Gregory Szorc [:gps] from comment #85)
> Let's just reland this and see what happens. Worst case Nightly is busted
> for a few days and it is backed out again. Best case is we get more eyeballs
> on the problem and fix the lingering issues being uplift.
I was too busy with the Firefox OS bugs and I didn't have enough time to install Windows 8 and test it there. One of the patches doesn't apply anymore so I'll refresh it before asking for check-in again.

OK, let's try to re-land this as per comment 85; keep your fingers crossed.
This is a try-run of the refreshed patch-set which is all nice and green (even though it seems that the problems we ran into weren't tripping anything in the unit-tests; this reminds me that I should add some sanity check to weed out crazy values):
https://tbpl.mozilla.org/?tree=Try&rev=770278a80507
(And apologies for not being able to re-test the patch-set on Windows 8. I got hang of a copy but was too busy fixing B2G blockers and I didn't have enough time to install it on my laptop.)