I have come up with the following snippet by building upon the answers given to my StackOverflow question. Just trying to get some other eyeballs to review things so that they can point out any egregious mistakes that I probably made.

One area that was a bit tricky to me to get working is the bottom half of the readFileHeader label; there seemed to be some conflicting information about the cmov instruction and I still don't quite understand if I "should've" gone with an alternative that involves a jump or two. Sure, such a change is a micro-optimization but a huge part of this exercise to me is trying to understand the way different constructs affect the instruction pipeline and whatnot.

\$\begingroup\$What is the conflicting information about cmov? btw there cannot be information that unconditionally either recommends cmov or disrecommends it because it is situational.\$\endgroup\$
– haroldMar 4 at 0:19

\$\begingroup\$@harold Maybe my wording there was quite poor; I suppose a better way to put things is that I don't quite understand the conditions of when cmov should and should not be used. I also suppose it is worth noting that I originally choose to go with cmov because 1) it was easier for me to parse the code and 2) it seemed "bad" to jump "within a loop".\$\endgroup\$
– Kittoes0124Mar 4 at 0:20

This is totally subjective and dependent on your preferred coding style, so you're free to ignore it, but I like to write assembler directives in all uppercase to make them easier to distinguish from instruction opcodes and registers. So, I'd write PROC, END, TEXTEQU, ALIGN, and so on in all caps.

Presumably, these are intended to encapsulate the Windows x64 calling convention for argument passing. But if that's the case, they are misnamed. They're not variables at all—they're arguments. If you must define these, I suggest calling them arg*.

But I'd really suggest just not defining them at all. If you're going to program in assembly, you need to know your standard calling conventions like the back of your hand. Maintenance programmers who don't know them or can't recall them need to stop and look them up. They're not going to change, so there's little point in trying to hide them away behind constants. That just obfuscates the code, in my opinion. It also hides the fact that some uses of var* access memory (dereferencing an address to obtain a value), while others are just reading an enregistered value. There's a big difference between the two, both in semantics and performance, so good code should make that difference obvious to the reader.

Also, if you are going to keep them, your capitalization is inconsistent. Other constants (e.g., TRUE) are written in all-caps (which is a good convention). Why aren't these constants also written in all-caps?

Finally, if you're going to keep them, use them consistently. You forgot to use var0 when calling the ExitProcess function:

Although JE == JZ, and JNE == JNZ, prefer to use the mnemonic with the most appropriate semantic meaning. After a CMP instruction, when you're checking for equality, either JE or JNE makes the most sense. But after a TEST instruction, where you're ANDing a register with itself just to set the zero flag, a JZ or JNZ makes the most sense. For example, I would rewrite:

Instead of packing the information about how you arrived at a magic number in a comment, write the arithmetic out explicitly. The assembler will fold the operations at assembly-time, so there won't be any cost. It's just more self-documenting for people looking at the code. In particular, this instruction:

sub rsp, 1048h ; align with 16 while simultaneously making room on the stack for the "home space", some parameters, and a 4096 byte buffer

Some of your comments are a bit too verbose. This is subjective, and it's ironic for me to be the one saying this, because I tend to like to write long, descriptive comments, but you do have to maintain the right balance. Per-line comments that I write in "real" code are much shorter than the ones I regularly write in Stack Overflow and Code Golf answers, since those are intended more for expository purposes than real documentation. They're also intended to be understandable by people who aren't assembly-language programmers, but your assembly-language code shouldn't necessarily be targeting that same audience.

You know that the ExitProcess function should not ever return control to your code, but I (and compilers) like to assert that by placing a trap immediately after the call. A simple int 3 will do; it's a simple, one-byte opcode (0xCC) that causes either a break into the debugger (if one is attached) or a crash.

Danger, Will Robinson!

You are allocating more than 4K bytes on the stack, which may be larger than the size of a page. Given the way the virtual memory manager works, you need to touch every 4Kth byte in order to force the stack to grow to the requested size. Otherwise, you risk getting access violations from hitting an uncommitted page.

Microsoft's compiler automatically inserts stack-walking code that does this for you whenever you allocate more than 4K bytes of local variables inside of a function. This comes in the form of a call to the __chkstk function, which just reads every 4Kth byte from the previous stack top to the new stack top, ensuring that all of the necessary stack-reserved pages have actually been committed. (If there is no more memory available to commit the pages, then __chkstk fails.)

So, in this case, the compiler would generate prologue code like the following:

Peephole Optimization

Don't use LEA when MOV will do. The mnemonic suggests that LEA is the way to load an address, and indeed it will do that, but so will MOV, as long as you use the OFFSET operator. Substitute:

lea var0, filePath

with

mov var0, OFFSET filePath

Honestly, the biggest use of LEA in assembly code is as a fancy way to do general-purpose integer arithmetic on non-address values, since it can perform addition with multiple operands, add and scale (by limited powers of two), simulate a three-operand instruction, and not clobber the flags. You will need it for scaled loads of addresses, but again, there, you're using its fancy address-calculation machinery, not simply to load an offset.

Use the non-volatile registers (RBX, RBP, RDI, RSI, and R12 through R15) to store temporary values that you need to persist across function calls. The calling convention requires the contents of these registers to be persisted across calls, so anything you have in them will be safe. This abundance of registers on x64 allows you to avoid storing to the stack, and gain a bit more speed. So, replace:

mov var5, rax ; save a reference to the file handle for later (taking advantage of the unused parameter slot 5)

with something like:

mov r15, rax

there seemed to be some conflicting information about the cmov instruction and I still don't quite understand if I "should've" gone with an alternative that involves a jump or two.

I'm not sure what the conflicting information was, but I can make it pretty simple for you:

If the branch is predictable, then use an actual branch (jump). That will be faster (less overhead). Trust the branch predictor to do its job.

If the branch is not predictable (because you have completely random input, or because it oscillates back and forth), then it is probably better to write branchless code using something like CMOVcc or SETcc. This will avoid the possibility of branch misprediction, at the cost of executing additional code each time.

That's a good enough rule of thumb. Correctly-predicted branches are virtually free; mispredicted branches are extremely slow (relatively speaking). If you want more information, see this answer.

Also keep in mind, though, that it is generally pointless to optimize code that is not a performance hotspot (like an inner loop). The branching code is shorter and simpler to write, so that should be your first instinct.

Branching code is especially appropriate for error handling code like what you have here, since:

Error handling is (almost) never a performance hotspot, and

Errors should be rare occurrences, and thus the direction of execution in an error check will be correctly predicted by the CPU's built-in branch predictor.

That is similar to your CMOVcc code, except that it uses TESTcc instead. It elides one instruction, but does risk a performance decrease on some microarchitectures where partial registers are not renamed separately (because we're using DL and DH, the two byte-accessible portions of the EDX register). We could solve this problem by rewriting as:

This clobbers EAX (or, at least, the low 8 bytes of it), but we don't care at this point, because we've already gotten the information we need from the ReadFile function's return value. It still isn't extremely performant code, though. The SETcc instructions have a relatively high latency, and there is a long dependency chain in these instructions. Your original CMOVcc version isn't much better on either of those fronts. I personally find the SETcc version more idiomatic and therefore more readable, but not strongly so; choose whichever you like better.

Your revised CMOVcc version is better. Or, at least, it is shorter. Whether it's actually better is a matter for a profiler to decide. Shorter code is not necessarily "better" or faster.

\$\begingroup\$Dumb question but how does one include __chkstk? I'm using VS2017 (MASM) and Google Fu has failed me.\$\endgroup\$
– Kittoes0124Mar 9 at 13:35

1

\$\begingroup\$@Kittoes Not a dumb question. It isn't really designed to be called from your own code. It's an internal helper function used by Microsoft's compiler. You can find the implementation at "C:\Program Files (x86)\Microsoft Visual Studio <version>\VC\crt\src\intel\chkstk.asm". You can probably figure out a way to import it so it's callable from MASM with a declaration like EXTRN __chkstk:NEAR, but I've never done it before. See the documentation for a more detailed discussion of how it works, if you want to reimplement.\$\endgroup\$
– Cody GrayMar 9 at 19:51