This can be understood as an artifact of the instruction
pipeline: your
x86 CPU likes to perform similar operations in staggered manner, and
it does not like branches (jumps) because they break the flow.

Comparing the native code reveals that while f1 is jump-free, the if in f2 results in a jump (jae):

In my application the speed gain was more modest, but still
sizeable. Benchmarking a non-branching version of your code is
sometimes worth it, especially if it the change is simple and both
branches of the conditional can be run error-free. If, for example, we
had to calculate

g(x) = x ≥0?√(x+2) :1-x

then we could not use ifelse without restricting the domain, since
√(x+2) would fail whenever x < -2.

Julia Base contains many optimizations like this: for a particularly
nice example see functions that use Base.null_safe_op.