Activity streams in stravadata

stravadata now provides data on activity streams, which are “the raw measurement sequences that define” Strava activities (source).
For example, here are some of the streams from when I ran Round the Bays in February:

library(dplyr)library(stravadata)rtb_streams<-activities%>%filter(name=='Round the Bays'&grepl('2020',start_time))%>%select(id)%>%left_join(streams)%>%select(distance,time,speed,hr,cadence)rtb_streams

The distance and time columns report cumulative distance travelled and time elapsed, while the other three columns provide (unevenly spaced) time series of performance indicators.
I smooth these series by computing means over rolling 10-observation windows, and plot the smoothed series together in the chart below.
The chart shows the gradual increase in my speed and cadence—and, consequently, heart rate—throughout the race, and my final sprint near the finish line.

As another example, on Tuesday I ran from the Wellington CBD to the Brooklyn wind turbine and back.
That run included 447 metres of total elevation gain.
streams disaggregates this total into a sequence of altitude values, allowing me to plot the activity’s elevation profile:

Having disaggregated stream data makes it possible to determine how Strava computes aggregate activity-level features.
For example, suppose I want to reconstruct the mean_hr column of activities using the hr column of streams.
The naive approach is to group streams by id and compute within-activity means.
However, this approach may generate biased estimates of mean_hr because the observations in streams are unevenly spaced with respect to time.
For example, if I stop to recover at the end of a short sprint or uphill climb then my measured heart rate spike (which, in my experience, tends to lag the corresponding effort spike) will be concentrated within a single observation, and subsequent observations will have relatively low heart rates.
This will bias naive estimates downwards.

I can correct for this potential bias in the naive estimator by weighting observations by how much they increase my total time.
Similarly, Strava may estimate mean heart rates based on moving time only, which I can replicate by weighting observations in streams by the boolean values in its moving column when computing within-activity means.

I compute the naive mean_hr estimates, and the estimates based on total and moving times, as follows:

Computing mean heart rates based on moving time recovers all of the 341 non-missing mean_hr values in my copy of activities.
In contrast, computing mean heart rates based on total time recovers 55% of these values only.
The total-time estimates get within a heart beat per second of the “true” value about twice as often as the naive estimates, which appear to be biased downwards (possibly due to the phenomenon described above).