I'm working on optimizing code for a C8051F120 that needs to run at an
extremely fast clip. The following few lines are repeated over, as
fast as possible. The looping is a straightforward djnz. The bulk of
the cpu time is spent on the following lines. This is repeated (cut
and paste, but with the binary value changed) 8 times per loop.

At row1 the next set of those 5 lines executes. Essentially what this
is doing is taking the data byte at @DPTR, comparing it to the current
value of rLEVEL, and setting a bit in mDATAOUT if it is greater. It
does this for the sequential bytes at DPTR, but for the range of the
bitfield (#00000001b to #10000000b).

Can anyone see a way to optimize out some cycles from this process?
For reference, this is a chip doing video output. Even single cycle
optimizations can be big, at the rate that this block is being iterated
over.

Note, the subb you are using depends on the state of Carry,
so your present code has some LSB strangeness:^)
(you eventually could use an add instruction and
then complement mDATAOUT after you processed all 8 bits)

You could also use CJNE for the comparison - it also sets carry, but is
non-destructive, so you don't need to reload A from XRAM for each test:

movx a,@dptr
cjne a,ar0,$+3
mov mDataOut.0,c
cjne a,...
mov ...,c

As like Frieder's approach, this one also resets the output bit in case
the data value is lower than the treshold. If you want latching
operation, you need to use conditional jumps again or consider the
previous state:

I forgot to mention: The polarity of the output bits is opposed to that
of your code. You'd have to negate the data before comparison or CPL the
output bits afterwards (if you can't switch the polarity otherwise).