Meanwhile I managed to generate a 70kHz PWM signal using MOSI (SPI2), but currently there is sometimes a glitch in the signal, I am trying to figure out where is it coming from. In the theory it should not be there

Just as a note, on the F1 I was doing 8bit of resolution, with the PWM running at max about 132Khz (3 x 44.1), and coulnd't hear any noise, and adding an RC filter didn't change the sound at all.
I was also using DMA triggered by a timer in my latest version, and using the DMA ISR to reload the buffer from the sdcard, so cpu load from playing the samples was minimal, mostly the sdcard access was taking the time, but still all in all I think it was taking less than 20% cpu to read the file and reformat the samples (2 channels, 8 bits, from whatever the format was in the file), in the F1 at 72Mhz.

I see a main difference with what you are doing is that you are using the HardwareTimer class, and after a a few revisions I decided to skip it and just use the libmaple timer functions, and that I see the repeat counter to get the 3x frequency. You should look at that, it's a feature in the timer DMA mode in which rather than send a DMA request each time the timer reloads, you can set a number of repetitions.
So as you are only using 8 bits of resolution as I was, you can run the timer much faster, but use the repetition value to output the same sample N times. At 16Khz, with a base frequency of 72Mhz, you could run the timer 17.5 times faster, and still get 8 bits of resolution in the PWM:
(72000000 / 16000)/17.5 = 257.1
You have to use integers in the repetition value, so you could set it to 18, 17, or other values, and calculate the max amplitude value you can use (250 ARR at 18 repeats). I think you get the idea.

Victor, indeed, it seems a good idea to use the DMA burst transfer feature of the timers.
I was reading about it but couldn't imagine before what is that good for. Now it makes sense

If I set the prescaler to only a factor of that what is set now (basically divide it by an integer), I can multiply the PWM frequency by keeping the resolution. Clever.

I try to use as much as possible the HardwareTimer functions because it seems that they offer a wide scale of setup possibilities.
Now that Roger has included the DMA enable/disable functions, it became even more powerful.

As you can see, I can set up the timer parameters with only 3 function calls, which is very convenient.
The function naming is not always self-explanatory, but if you know what they do then it is ok. For example, with setOverflow() the reload value (ARR reg) will be set.

I can only use integer values for timer registers, for prescaler = 17 I get a reload value of 264 and 16042 Hz effective playback frequency (0.26% deviation).
If I set the prescaler to 18, I need to limit the sample values to max. 250 by software.
So I will try to set the DMA burst length to 17 and see what happens. Ok, for that I need to use the libmaple timer functions, but I will maybe issue a PR for Roger to add a new corresponding function to the HardwareTimer class.

Currently I have a simple piezo buzzer (without electronics) connected.
Yes, the sound is relatively quiet, although the audio wave has 90% peek-to-peek amplitude.
But I cannot explain the spikes on falling edges, it probably has to do with using the speaker with the capacitor.
I would recommend to use the speaker either directly (for speakers with 8 Ohm or more) or in series with a resistor.
I also put an RC filter to input a small amplifier, but as I have no speaker I cannot use it. But the audio signal after RC (2k2 & 150nF) is very smooth.

Victor,
As fas as I understand, the DMA burst transfer of the timer cannot be used to repeatedly output the same value to the same timer register, but to send a burst of values to a specific number of consecutive timer registers.
So if I want to update CCR1 and CCR2 within the same DMA transfer then I can set the burst length to 2 and then I can have stereo PWM signal on CC1 and CC2 output, wherein the samples are read in one shot (burst) by the DMA from consecutive memory locations.

I think the only way to repeatedly send the same value to CCR1 is to multiply it by software. Set the DMA buffer size to (let's say) 17 and fill it up with the same sample value, triggered by the timer set to 17x the sample frequency (ex. 16kHz*17). This DMA buffer has to be updated with the new sample value on the next sample period (ex. 16kHz).
Or maybe using only a 4x multiplier, transform a linear buffer of
abcde...
to
aaaabbbbccccddddeeee...
Is this what you did (with 3x multiplier)?