A video that supplements this content can be found here: https://youtu.be/4PmjTFgEybI

The Immediate Issue

Those familiar with the ‘Immediate’ form of many ARM instructions may know that these 12 bits to encode the immediate value aren’t as simple as it may seem at first glance. For a point of reference, we will use the MOV instruction as an example, as in MOV r0, #1337.

The above MOV example is a great one to work with, because it doesn’t actually exist, though in this case, your assembler probably wouldn’t give you an error, it would just use the word-sized version of MOV (MOVW r0, #1337).

But 12 bits contains 4096 values doesn’t it? It could, but the 12-bit immediate field isn’t that flat. Again, this is still probably not a new thing to ARM veterans. The 12-bit field is divided into two different parts: the first 4 bits are dedicated to a field that specifies how many times to rotate-right the last 8 bits. So if you specify in the first 4 bits not to rotate at all, then the last 8 bits can be interpreted literally. The 4 bits for rotating right are in multiples of twos, meaning that 0010 means rotate right by four, or 0101 would mean rotate right by ten). In this context, the rest of this post attempts to address the following type of question: “Is 2684354500 easily encode-able, if not, what are the closest numbers surrounding it?”

Why would the architects complicate things so much like this? Well this method does make it easy to load a wide range of larger numbers into a 32-bit register. Otherwise, it kind of sucks to only be able to load a 12-bit number into a 32-bit register. And we can’t directly load 32-bit numbers encoded into an instruction, because the instructions themselves are exactly 32-bits, meaning the immediate value would take up all of the space leaving none for the instruction itself.

Let’s look at some 12-bit examples:000000001010
This one is simple, the first four bits don’t have any rotation, and the last 8 bits equate to 10 (0xA), so that is the value that would be used.

001000001010
We still have 10 for our last 8 bits, but now our rotate right value is 0010, which means that we will rotate 10 (as a 32 bit value) to the right by four places. So 0x0000000A would then be 0xA0000000, which is a much larger value in decimal (2,684,354,560).

This is the type of thing that means that an instruction like ‘MOV r0, #2684354560’ is completely valid. On the other hand, ‘MOV r0, #2684354500’ is not valid. The distinction is unfortunately not super obvious to the naked eye.

Valid Numbers

Before digging into types of solutions to this issue, I would like to take a small detour about the range of possible numbers you can use, as the behavior here is incredibly interesting in my opinion. Due to the power of rotating bits around, you really can use numbers anywhere from zero to 4 billion and some change, with only 12-bits, or 4,096 possible values. Did I say 4,096 values? this is not true either! There are redundancies, we can rotate one 8-bit value by a certain amount that has the same result of rotating a different 8-bit value by a different amount (we will circle back to this soon). In reality, 25% of these encodings are redundant. There are only 3,073 unique values after eliminating redundancies.

We might also have some intuition that the spread of numbers isn’t linear. As in we should know we can literally use values 0-255 with no issue (as we have an un-rotated version of the first 8-bits). Starting at 256 we start counting in fours; as in 256, 260, 264, etc. After 1,024 we start counting in 16’s and so on. However, if graphed, the result isn’t as exponential as one would expect. There are clusters of numbers that hover on the incrementing by 1 for some stretches, conveniently around half way through as well (two billion and some change). I say conveniently because this allows us to capture small 2’s compliment negative numbers. As in, all of these unsigned numbers 2147483648-2147483711 (0x80000000 – 0x8000003F), can be encoded without modification. So this gives us a run of about the first 64 negative integers, not bad. Below is a graph of this spread:

Imm12 Encoding is Too High Level

Because redundant. Though it does turn out that there is corresponding assembly for these redundant forms that achieve the same thing (due to some syntactical sweetness). Let’s look at an example of MOVing 192 into r0. This can be done vanilla with just ‘MOV r0, #192’. The 12-bit encoded value would be 0x0c0. But there are other ways. We could encode 3 into the 8-bit part of the 12-bit encoding and ‘0xd‘ into the ROR 4-bit field; which effectively rotates the value 3 in a 32-bit field to the right by 26 places (0xd03). The result is 192. We could also rotate 12 to the right by 28 places (0xe0e). Finally, we could rotate 48 to the right by 30 places to get 192 (0xf30).

In the ARM manual, the syntax for this MOV instruction is: MOV{S}<c> <Rd>, #<const>
To break this down, the {S} is a 1-bit field that specifies if we want to set any conditional flags. The <c> is a 4-bit field specifying under which condition to execute MOV, if you don’t include this, the default is to execute under any condition. <Rd> is the destination register that you want your result value stored in. Finally, #<const> is the number you want placed into Rd. What isn’t shown for the syntax in this part of the manual (though I remember reading elsewhere in the manual that you can do this) is that you can also specify the ROR value that you want. I would revise the syntax to: MOV{S}<c> <Rd>, #<const>, ror_amount for my own personal use.
All of the below assembly instructions are valid and all achieve the same thing:
mov r0, #192 @e3a000c0
mov r0, #3, 26 @e3a00d03
mov r0, #12, 28 @e3a00e0c
mov r0, #48, 30 @e3a00f30

Solutions

So say your in a situation where you want to move an immediate large value into a register, but you don’t have a good idea of if it is compatible with this format? There are plenty of unsatisfying answers to this question online. Although I probably spent more time on this blog post than I did researching good solutions to this issue. I’ll talk about some of the different approaches I have seen to this issue, including my own approach and my own tool.

Visualize for Better Guesses

One of the first resources I came across was https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/. It’s a good post that explains the issue in a way most likely more elegant than I have. It also has a cool little visualization tool that interactively shows the bits being rotated around. But it doesn’t quickly solve my problem of: “Is 2684354500 easily encodable, if not, what are the closest numbers surrounding it?”

Next up, you will likely find numerous scripts that will take your number as an input and tell you whether or not the number is encodable. I feel like this kind of a tool is a complete waste of my time. Why? I can just try writing ‘MOV r0, #2684354500’ in assembly, and my assembler will quickly tell me “Error: invalid constant (9fffffc4) after fixup” and at which line this was found. This kind of tool also doens’t give me context as to which numbers I CAN use that are close to this one.

LDR Method

There’s another approach that kind of side-steps my question altogether. What I like about it is that it is consistent and you don’t have to worry about post-processing the number to get it to fit. This solution is to put the number in a .data section and ldr it. What isn’t immediately obvious about this solution, however, is that it takes up more data than it first appears.

The number in the data section itself takes up 32 bits. As far as the assembly goes, you need 2 instructions: one to load the pointer address (‘ldr r0, =pointer‘), and another to de-reference it (‘ldr r0, [r0]‘). That’s another additional 64-bits of data. However, there’s another 32-bits that are needed for something way less obvious. In the encoding of this ldr instruction, it always de-references; there’s no such thing as ldr just getting a pointer. In actuality, the assembler itself does some behind the scenes magic. At the end of you’re .txt section, the address for ‘pointer’ (or whatever you name it), will be stored. The instruction ‘ldr r0, =pointer‘ replaces =pointer to that end of .txt address, which contains the real pointer. In that context, ‘ldr r0, =pointer‘ is still dereferencing and getting a value, it’s just that this value is an address for the value you really want.

In conclusion of this method, it takes 16 bytes (equivalent of 4 instructions) to achieve loading any arbitrary value into a register. The benefit of this instruction is that it is easy and consistent (in both the instructions used and the size taken). Though it consumes 16 bytes of data, run-time only requires 2 instructions of execution (although they are memory operations).

Post Processing Method

This is my favorite method when it makes sense to use it. This method involves moving the closest valid number to the one you want to move into a register and then using an ADD or SUB to make up for the difference.

There are challenges to this though. First, it’s not readily easy (without some kind of tool) to just guess what the next closest encodable number is to the number you want. Next, it is also possible that the number it would take to ADD or SUB to get to your number is also not encodable, meaning you may have to do more than one ADD or SUB. Finally, and related to the previous point, if you hit a point where you’re doing 4 or more ADD/SUB operations, you would be taking up more disk space than the LDR method. Additionally, it’s not only 4 instructions worth of storage, but 4 instructions executed too, however, these are all direct register operations (quicker).

Why do I prefer this method? Assuming I’m not going to have to deal with all of these annoying calculations myself, I like the elegance of it, subjective, I know. But I like that all the information regarding this operation is all right there in-line. I don’t have to refer to a .data section. It also usually takes up less space than the LDR method. The space savings may not be enough to justify for some, but less space is less space. However, if I didn’t have some external tool to make this process easy, I would probably just be lazy and use the LDR method.

The Tool

Regarding the following question: “Is 2684354500 easily encodable, if not, what are the closest numbers surrounding it?” I wrote a tool to answer this question and additionally suggest a solution to the Post Processing method. I was going to write this tool in a high level scripting language like a bitch, but then decided to write it in ARM assembly.

The tool is called ImmSuggest, it takes one argument; the number you want to encode. If the number is encodable, it just gives an example instruction such as ‘mov r0, #200’ (assuming your number was 200).

If your number is not encodable, the tool will display your number surrounded by the next lowest and highest number that is encodable. Next it gives a series of instructions that would get your number into register r0 with a MOV and ADD instructions. For example, say you wanted to encode 301 (not encodable), the output of the tool would look like this:
300 < 301 > 304
mov r0, #300
add r0, #1

300 is encodable, so is 304. If those numbers are good enough for you for your use-case, you can just choose one of those instead. Otherwise, move 300 into your register, and then just add 1 to it.

Note, this isn’t perfect at all, which is nice that it also displays the surrounding closest numbers for us. I say that because you may notice that the number we want to encode is actually only 60 away from the next highest encodable integer. This means that the following code would also work:
mov r0, #2684354560
sub r0, #60

For simplicity, I wanted to just keep this to MOV and ADD instructions. I thought about doing a series of the most efficient combination of ADD and SUB instructions, but it weirdly got more complicated than I felt dealing with. Especially with the limitations I gave myself.

Stupid Coding Limitations

This is all self-imposed, hence why stupid. This program is about 16k and runs pretty fast. I know, it would run fast in perl/python/ruby/etc as well, as it’s not that complicated. I mostly chose the challenge of writing this in assembly, since it is about assembly. I also tried to be pure and not rely on external libraries (notably libc).

Speed over Size

This is often a mutually exclusive trade-off and this program is no exception. Where this really stands out is that I pre-generated a header file that contains a sorted list of all 3,073 valid encodable integers. This is about 80% of the resulting program. The program could be significantly smaller if it just dynamically generated all of these values and put them into memory, however, this would also add a lot of cycles that would be executed every single time the program is run. With the pre-computed header file, this needs to be done none of the times.

Re-invent All Of The Wheels

Even though I have been using the libc library from assembly quit a bit lately, I decided to throw away all of its usefulness and overhead and only rely on Linux API functionality to do my work. Just linking with gcc costs 3,000 instructions of execution, and then another 1,000 instructions of execution for the first time of use of any function (your printf’s, malloc’s, isupper’s, etc…). Granted, either strategy of whether to use libc or not will still result in this program executing in an un-noticable fraction of a second, my version currently executes in anywhere from about 325 – 6900 instructions (depending on how large the argument is; as this affects how complicated the result is).

In other words, on average, this program finishes before a gcc linked executable would even be done loading it’s functions. If you use more than 4 libc functions (printf, malloc, getopt, etc…), youre libc overhead already runs slower than even the worst case run of my program. In a best case run of my program, a gcc linked program wouldn’t have even gotten to your part of the code (main function) yet. Don’t get me wrong, I don’t hate libc, and I understand much of the benefits of dynamic linking of shared libraries. But I also rage when I see shit like this: https://stackoverflow.com/questions/3233560/in-c-is-it-faster-to-use-the-standard-library-or-write-your-own-function (honest question, naive and misled text-book answers). So I guess what I’m trying to say is I wrote my program this way out of principle.

Throughing away libc brings challenges though, even the simplest things. For example, I want to print the value of a register to stdout, but as an ASCII decimal number. Printf would make that effortless, but I’m not using printf. One good way to get a decimal equivalent is to massage the data through many rounds of div by 10 and extracting the modulus values. This is all and good, but my Raspberry Pi doesn’t support the div instruction. The code for this actually has to be written as well. I cheated with this a little, by manually ‘statically’ including a version of divsi3 from libc (as it both does division and keeps the remainder). Though halfway through modifying divsi3 to my needs, I realized it worked on signed values instead of unsigned. And the udivsi3 function didn’t seem to capture the remainders of the divs. So I just kind of strategically commented out most of the rsb (negate) instructions from divsi3 and things worked out.

Getting an argument from stdin was actually pretty easy, as arguments are on the stack. Converting the ASCII number to a register value takes more work than acquiring it from the stack.

Validation

Therez none. If you supply an integer too large (to fit in 32 bits), you’ll get stupid answers back. If it’s not an integer, stupid crap back. If it’s not even a number, stupid crap. I removed the div by 0 error checking in the soft division routine, I’m always dividing by 10. I hope that decision enrages at least one person.

The video version (for the illiterate) can be found at: https://youtu.be/QyjXBv3sqRY

I’m in the process of re-certifying for the GREM certification (GIAC Reverse Engineering Malware). Although I’m pretty good with assembly language in a handful of architectures (Motorolla, x86, propeller, and ARM), my skills are shit with Windows and its APIs. In the context of GREM and static code analysis goes, I still have a ways to go; a ‘not seeing the forest for the trees’ issue. I will still likely pass the certification like last time, because I understand most of the concepts in their compartmentalized pieces. My problem is some of the big picture stuff (always has been). I joke about everything being too high level, and honestly, most of the time it really is a joke or an extreme over exaggeration. But for me, I sometimes do have a harder time comprehending an abstraction when it abstracts away how things actually work. For most people, it doesn’t matter how the technology works, so long as it does. However, as a hacker, I have technology ‘trust issues’; things don’t always ‘just work.’. And the abstraction likely wont give you any hints as to why the thing failed, the answers are revealed at a lower layer.

Blah blah blah, I digress. I wanted to set out to learn many of these Windows APIs in a bit more detail. Reverse engineering usually teaches how to read the code, but my (and probably your) comprehension magnifies when we actually write code. So in this case, I wanted to set out and write a few very simple assembly programs that put the correct arguments on the stack and call a Windows API, just how I see this happening when debugging some malware, just how it is supposed to work. As a point of reference I am using the FLARE VM setup from FireEye. It comes with fasm, so that’s the assembler I will use (I don’t really have religious preferences with an assembler).

For API’s, the Windows way is a bit different than the Linux way. For Linux, generally, you put all of your arguments in registers and then do an Int 80 (interrupt to Linux). In windows, with ‘sdtcall’ functions, you push all of your arguments to the stack and Call the Windows API function by name (the corresponding addresses of these functions end up getting linked in). I’m not really opposed to this method, it allows for a large amount of arguments by default, as it’s the stack, not a limited amount of registers.

As I didn’t know the fasm ways of assembly, I looked to the Internet for some examples. I wanted to create a simple dialog box. I expected to see a simple assembly program with a .data section with the strings and then the .text (.code) section with some instructions pushing the arguments to the stack and then a call to the API function. For pretty much every google result I got, what I got back was a heavily abstracted version of how this is generally done, and the ironic bonus: NO ASSEMBLY INSTRUCTIONS!

Before I get to that, I will say that I eventually figured out the way to do this with real assembly language in the source file. And it was as straight forward as I would have expected it to be. For reference, here is a screenshot of the source program:

This is what it looks like in the x64dbg debugger:

Note that the assembly looks awfully similar to the source. This is no mistake. This is exactly what I’m going for here. Remembering that my goal is to try and understand what is actually going on with these API functions, this is the most comprehensible way to go about this. You’ll notice that all the arguments are on the stack and ready to go for when I’m about to call them. And it is extremely clear how they all got onto the stack (the 4 preceding push instructions).

Okay. Now let’s talk about the ‘no assembly required’ way that is recommended to write this. Because the source code is easier to read. Because it’s ‘cleaner code.’ Because assembly language is so ‘hard’ to write that you might as well write assembly programs that don’t use assembly instructions (then just give up and fucking use python). Anyway, here’s a screenshot of the ‘clean’ way to do this:

It is clearer to read. If there were no comments in my version, then the ‘invoke’ version would be much more obvious in its intentions. But now, here’s a screenshot of how dirty and incomprehensible this is in the debugger:

Before I start ranting and criticizing, I have to be fare and state that the examples I found on the Internet didn’t use a .data section and inlined the strings in the invoke section (cleaner source code). This is the real cause of the mess of the disassembly. Had I used a .data section with this invoke command: ‘invoke MessageBox,HWND_DESKTOP,message,title1,MB_OKCANCEL’, it woulnd’t be so bad. I digress. So note that even though the source code is ‘clean,’ what’s actually being ‘assembled’ (compiled really) is nothing but. You see as we are about to make the call, all the right arguments are on the stack. I see two of the original pushes needed for two of our arguments (push 1 and push 0). We also need two more arguments; we need pointers to our strings for the title of the window and the message in the window. How on earth did these get into the stack, and what the fuck are these confusing instructions doing in our program. Do we really need to do ARPL, INSB, OUTSD, DAA, and IMUL instructions? Well no, that’s not what is happening. What we are actually seeing is a disassembled representation of our strings. See our first call to ‘syscalls.40201B’, it’s jumping past our first string. A call normally knows how to return to where we came from by pushing the address of the next instruction to the stack. In this case though, our program doesn’t intend to return to this at all, it is using that pushed address as a side effect, as that address really is the first byte of our string, it serves as a pointer to it, and it is now on the stack conveniently as an argument. So that call jumps us to another call that does the same thing; it skips over the next string that follows it, getting a pointer to it on the stack, indirectly. So that second call instruction brings us all the way down to the ‘push 0’ instruction right before our API call to MessageBoxA. These abused CALL instructions are how we got the string arguments onto the stack.

The end result is the same. As somebody that has to read or write the assembly source, using invoke is likely a better way to write and collaborate. However, nothing about it is actual assembly language, it abstracts it away. It’s not like this behavior is uncommon or indefensible. Compilers do this kind of thing all the time, even when they aren’t optimized that much (and when they are optimized, wow). Joking aside, using invoke is probably the way to go if your writing something more serious, although, why not just use C? Writing “assembly” in shortcuts and macros with no actual assembly sounds a lot like a higher level language (like C). This is why I always found HLA (High Level Assembly) so objectionable. Though to be clear, I respect the Author of HLA and he has done other really amazing work.

A lot of arguments of which way is better than which (with many things) comes down to what your doing at the moment. In the use case from the paragraph above, invoke away. But to return to my use case, I’m trying to familiarize myself with some simple Windows API calls by playing with different arguments in assembly and calling them, and then watching them perform their actions in a debugger (as not all API’s will do something visual; I might have to see the stack, registers, and memory getting manipulated). Using invoke for this strategy makes this process all the more confusing.

All this said, you might be able to see why I have a little ways to go when it comes to fully reverse engineering Windows binaries. Not to be confused with targeted reversing. I’m somewhat adequate with looking at particular APIs and pulling out IOCs from the artifacts they leave behind, and all the other ‘cheater’ dynamic forms of analysis. But if I ever want to see a bigger and fuller picture, I’m going to want to start writing the assembly that I’m reading and put bigger pieces of the puzzle together. At least, that’s the plan.

Looking at instruction encodings, ‘ROR r0, #0’ should be the same as ‘RRX r0, r0’.

Let’s first take a look at the encoding for the ROR instruction:

So Rm gets rotated imm5 places and gets stored into Rd

Now let’s look at the encoding for RRX:

Note that the encoding is identical to ROR, with the exception that the imm5 field is harcoded to 0.

So if we were to write ‘ROR r0, #0’, we should expect this to disassemble as an RRX instruction; as we are merely mimicking these hardcoded zeros by providing ‘#0’ as the imm5 value:

Yep! wait wut? Very much no. What is the encoding for this MOV instruction?

Come on assembler, why don’t you know that I wanted to do an RRX when properly writing ROR in such a way that should encode into RRX? You have no idea how many times I legitimately need to use ROR to use RRX instead of actually typing RRX in assembly (even if it’s zero times).

As a reminder from ARM Assembly is Too High Level: Moving by Shifting, the MOV instruction is just an LSL with it’s immediate value set to 0’s. If we were to take the 5 (of the 8) zeros following Rd and make them something else, the above MOV instruction would become an LSL by that amount. After these 5 bytes, the next 2 bytes define what kind of shift, rotate (or MOV) we are doing. For example, MOV and LSL are ’00’, LSR is ’01’, ASR is ’10’, and RRX and ROR share ’11’ (you can see this ’11’ field occur after the imm5 field in the first screenshot above). I really don’t know why the assembler jumped from a ROR instruction to the MOV/LSL type instruction when giving it an imm5 value of zero. Sometimes assembly language is too high level for me to comprehend.

]]>http://xlogicx.net/?feed=rss2&p=6731ARM Assembly is Too High Level: Moving by Shiftinghttp://xlogicx.net/?p=662
http://xlogicx.net/?p=662#respondSat, 14 Jul 2018 03:30:00 +0000http://xlogicx.net/?p=662The syntax for the register form of a Logical Shift Left is:
LSL{S}c Rd, Rm, #imm5

This will take a value stored in the source register of Rm and shift the bits left by the amount defined in #imm5. The result of this operation is stored in the destination register of Rd. Like many instructions on ARM, you can make it conditional defined by a condition code (c) and define if you want the flags to be set as well with {S}.

The real question (for freaks like me) is what happens if you use #0 as the shift value, doesn’t it just effectively move Rm into Rd? Wouldn’t LSL r0, r1, #0 do the same thing as MOV r0, r1?

To say that the answer is ‘yes’ should be obvious and also boring. The interesting thing is that ARM decides to make the machine encoding for LSL Rd, Rm, #0 equivalent to MOV Rd, Rm. Look at the encoding graphic near the bottom of the post for further context of this and the following info.

What about LSR? It effectively does the same thing as our MOV when imm5 is 0, but the machine encoding is off by one bit. Interestingly, LSL can have shifts from 0-31 (well, the manual says 1-31…), but LSR has shifts from 1-32, and there is no fucking way to force it to shift by 0. If directly writing the binary for the imm5 field of this instruction, and setting it to 00000, it actually ends up being LSR Rd, Rm, #32; that’s how they fit that 32 in there (as 11111 is still 31 in binary). You can look to the last image in this post for a comparison how the source assembly is encoded.

So the next question is what happens when we just write LSR Rd, Rm, #0? If what the ARM manual says is true about the 1-32 range, this instruction should be illegal (it kind of is). In the case of the ‘as’ assembler, it changes this to a MOV Rd, Rm, but with the exact same machine encoding as the LSL. See, the 3 bits between the imm5 and Rm field should be ‘010’ for an LSR, it is ‘000’ for the MOV and LSL. So the LSR Rd, Rm, #0 gets encoded as an actual different instruction even at the machine encoding level.

This is all interesting. It just goes to show that it is regular behavior for an assembler to just take your shitty assembly and do whatever it fucking feels like with it. This is, I should remind everyone, because assembly language is too high level.

For reference, here is a nice little graphic illustrating the machine encodings of each instruction discussed:

Also for reference is some source assembly instructions (cat’ed out) followed by a disassembly of the assembled program (using objdump):

]]>http://xlogicx.net/?feed=rss2&p=6620sed/regex Based BrainFuck Compilerhttp://xlogicx.net/?p=647
http://xlogicx.net/?p=647#respondThu, 13 Apr 2017 01:42:52 +0000http://xlogicx.net/?p=647BrainFuck is an ‘esoteric’ programming language with only 8 one-character instructions. I’ve used it here-and-there for well over a decade. I love minimalist languages, so RISCy. A brainfuck environment operates on a large array of data. There’s an instruction to move the pointer in this array forwards and backwards and to increment or decrement it’s value…that’s already half the language. There’s also an instruction for input or output of 1 character. Finally, there’s an instruction to start and stop a loop. That’s the language. There are many interpreters out there, I’m sure there are some compilers too, but I wanted to write a one-liner-esque command with common simple *nix tools. I actually ended up with 4 separate commands, but we will get to all of that.

Technical notes:

sed is the main tool I used to compile the BrainFuck code into x86. dd, printf, tr, and cat are helpers with formatting and staging the resulting 512-byte boot image output.

This does mean that the output file is limited to 512 bytes (more like 470-ish bytes after some of the overhead).

dd bs=512 count=1 if=/dev/zero of=out.bin
We are using an output file of out.bin. We want to start with a blank 512 byte file. So we use dd, set the block size to 512 bytes and use /dev/zero as the source of nulls to pad this file with.

printf ‘\x55\xaa’ | dd of=out.bin bs=1 seek=510 count=2 conv=notrunc
The image file needs to end with 0x55aa to be considered a bootable image. So we use printf and dd to inject these bytes into the image

First we feed the source file to our chain of commands with ‘cat‘
We then use tr (translate) to replace the ending newline with the x character (because we can’t just make it blank / remove it with tr)
Then we have a long line of find/replace commands with sed. The first one is simple, we now can just remove that ‘x’ that we put in there, so there are no longer any newlines.
The rest of the find/replace (regex) sed arguments are to literally find BrainFuck instructions and replace it with equivalent machine code. Before we explore each instruction, let’s make some notes about our environment. I mentioned that we are using a large array in our environment, we will be using the 16 bit ‘bx’ register to track this; we use it as a pointer that we increment and decrement. I considered not even using a real stack for this resulting program, because it is mostly not needed as a concept, but I ended up needing it for the looping instructions, which I will get to

I remember learning these properties in basic algebra: Associative, Distributive, and Commutative. It’s the Commutative property that states that a + b = b + a. The same principle is true with multiplication. In x86 pointer math, of course the results of these operations follow the commutative property; that’s just math. However, the machine encoding doesn’t consistently take this into account. To be facetious with the blog title image, machine code takes apple color into account most of the time, assembly language just looks at the number of apples.

To spoil everything from the beginning, ‘xor byte [esp + eax], 0’ is encoded the same as ‘xor byte [eax + esp], 0’. In machine code, when using esp as one of two non-scaled registers as a pointer, the commutative property is acknowledged. However, any other of the 8 general purpose registers are not treated this way! In other words, ‘xor byte [ebx + ecx], 0’ is not encoded the same as ‘xor byte [ecx + ebx], 0’

The claim I just made is slightly unfair though, or unfair-ish. Assemblers (like nasm) allow us to get kind of loose with our assembly. An instruction like ‘xor byte [ebx + ecx], 0’ isn’t really showing the whole story. In assembly, these pointers are made of up 3 parts (all 3 are optional…ish): one base register, one scaled register, and one displacement (8/32 bit offset). Scaled registers can be multiplied by 1, 2, 4, or 8. So more accurately, the above instruction is actually a base register of ‘ebx’ + a scaled register of ‘ecx’ with a scale of 1 (multiplied by 1). In machine code, the encoding of a base register and scaled register are encoded entirely differently.

With the above knowledge in mind, it’s no surprise that ‘xor byte [ebx + ecx * 1], 0’ is not the same as ‘xor byte [ecx + ebx * 1], 0’. Even if the result of what memory location this points to is the same (it is), it is now obvious why these are encoded differently…except for when it’s the ‘esp’ register…

When writing assembly, ‘xor byte [eax + esp * 1], 0’ would get encoded the same as ‘xor byte [esp + eax * 1], 0’. The actual encoding for both more accurately represents the 2nd form of this instruction: xor byte [esp + eax * 1]. Remember, the esp register can not be scaled Why ESP doesn’t scale (But EBP can still Base). If I were to write ‘xor byte [eax + esp * 2], 0’ instead of ‘* 1’, I would get an error from my assembler. Instead, my assembler (nasm) is clever enough to know that even though my instruction (scaling esp by 1) is not valid, it replaces it with an equivalent instruction (using the commutative property), and all is well. But without knowing machine code, this would all be happening magically behind the scenes to us, because assembly is too high level.

Before we forget, lets take a look at ebp, because the ebp register also gets encoded differently sometimes in memory pointers. Even though esp can not in any case be encoded as a scaled register, making ebp the base register can be done, but comes with a compromise: the displacement component of the pointer is no longer optional. So if you didn’t include a displacement in your assembly, the assembler will add a zeroed out byte as a displacement for you. In other words, ‘xor byte [ebp + eax], 0’ is actually more accurately ‘xor byte [ebp + eax + 0x00], 0’.

So taken the above information about the encoding of ebp, we are faced with a trade-off. This is one of those times where my assembler takes me literally, in the sense that it obeys making ebp the base (first) register (even though it may add an extra null byte behind our backs to do it). So even though, ‘xor byte [eax + ebp * 1], 0’ logically does the same thing (commutative property), nasm does not choose this form, because it doesn’t need to do this, like it does with the scaled esp register. The interesting thing is that this alternate form of the instruction is a byte less (because it doesn’t need that displacement byte). The takeaway is that if you are using two unscaled registers in your pointer, and ebp is one of them, and you didn’t already have a displacement: make ebp the last one (all to save one byte)

]]>http://xlogicx.net/?feed=rss2&p=6350Assembly is Too High-Level: Signed Displacementshttp://xlogicx.net/?p=626
http://xlogicx.net/?p=626#respondWed, 01 Feb 2017 02:28:52 +0000http://xlogicx.net/?p=626For those that don’t know about unsigned and signed data types, it’s not all that complicated. One byte can hold a total of 256 possible values. If these values were only positive numbers and included zero, we would have a number range of 0-255. But what if we wanted negative numbers? The byte is divided; we now have a range of -128 through 127. When including zero, this is all 256 possible values. The data is formatted as https://en.wikipedia.org/wiki/Two’s_complement

With a 32 bit register, we have four bytes, giving us 0-4,294,967,295 for unsigned values and -2,147,483,647 through 2,147,483,647 for a signed value.

When using instructions that use memory locations for an operand, we are allowed to use literal displacements. For example, we can use an address/pointer stored in EAX and add 128 to the address (not value). This 128 is called a displacement, or offset. This displacement can be up to 32 bits, but if the displacement is small enough, it can be encoded as 8 bits.

One thing to note however, is that this displacement is a signed value. At least for me, the documentation of this fact was subtle. And in practice, if you operate at the level of abstraction that assembly language provides and don’t know this subtlety, it is possible to run into some unfortunate bugs.

A good example is the two following lines of assembly language:
lea ebx, [eax – 1337]
lea ebx, [eax + 4294965959]

As it turns out, when it comes to machine code, both of these instructions are completely identical. Remember to look up to the range of a 32 bit signed value, the range stops after around 2 billion; it doesn’t go all the way up to 4 billion as in the above lea instruction example. This 4 billion value would be valid as an unsigned 32 bit number, however. Nasm doesn’t appear to care about warning us; it just encodes this positive 4 billion number as if it were unsigned. But when you actually execute this instruction, it is surely treated as a signed number (which turns out to be -1337).

Exhibit A:

We see that the 2nd line of assembly has the eax-1337, even though the machine code has a value of 0xfffffac7 (2’s compliment little endian). My assembly source actually used lea ebx, [eax + 4294965959] to generate that line. Even though my source added a large source number, and the machine code appears to be a large number (if it weren’t 2’s compliment), the disassembled version in edb (not how my original source was written) is what is actually executed. We know this because 0x80000000 – 1337 is 0x7ffffac7 (the result that gets stored into ebx shown in the screenshot).

Here’s another thing to think about that can get glossed over when not paying attention to the details. In 32 bit, negative 1337 is represented as 0xfffffac7 in 2’s compliment. if this value were to be interpreted as unsigned, it’s value would be 4,294,965,959. This is where I got this value to use in my examples above.

Does this same logic work in 8 bit? Say we use negative 100. This is 0x9c in 2’s compliment and as an unsigned 0x9c is 156. So to review, these two instructions are the same:
lea ebx, [eax – 1337]
lea ebx, [eax + 4294965959]

Does that mean that these two are too?
lea ebx, [eax – 100]

The answer is no. Note the below screenshot showing the difference between adding 127 and 156:

As a signed number, 156 would be too big to fit into 8 bits, so nasm graduates this to 32 bits instead, and it actually gets interpreted as positive, where as the 4 billion positive number is unable to graduate to anything higher than 32 bits, as there is no machine encoding for this (even in a 64 bit architecture; this encoding is not part of the Mod/RM encoding).

What about this instruction?:
lea ebx, [eax + 4294967292]
Below is the same instruction represented in hex instead of decimal:
lea ebx, [eax + 0xfffffffc]

The resulting machine code is 8d58fc. Where’d all the f’s go? They are leading f’s, just as there’s a such thing as leading zeros. As in 00003 is the same thing as 3. In twos compliment, f’s in a negative number are leading f’s. This number represents -4, this number is small enough to fit into 1 byte; all the leading ff bytes can be dropped.

And on the other hand, the below assembly instruction:

lea ebx, [eax + 0xfc]

Has the machine code of 8d98fc000000; a bunch of leading zeros (looks like trailing due to little endian) get added to the machine code. nasm must pick between the ambiguities of 0xfc. Does the author mean ‘+ 252’ or do they mean this to be a signed value of ‘-4’? Nasm chooses to interpret 0xfc as positive, but if it were to put this data into only one signed byte, it would actually represent negative 4, therefore has to be placed into 4 bytes.

]]>http://xlogicx.net/?feed=rss2&p=6260Boot Sector Graphical Programming (Tutorial)http://xlogicx.net/?p=583
http://xlogicx.net/?p=583#respondMon, 29 Aug 2016 21:11:11 +0000http://xlogicx.net/?p=583This tutorial is aimed at those that have some assembly experience, but very minimal 16-bit BIOS programming experience, in other words; a short list of some of my friends that I want to coerce into doing some BIOS programming.

Assembling:

VirtualBox
Create floppy image: Use this padding in 2nd to last line of code: 1440 * 1024) – ($ – $$) db 0 (instead of times 510-($-$$) db 0)
Run floppy image in VirtualBox: Create a low spec VM and set it to boot to yourboot.bin as the floppy image. Either rename image file to tronsolitare.img or use: nasm yourboot.asm -f bin -o yourboot.img

BIOS Programming Environment:

These programs are small, as in only 512 bytes of code. Fortunately, you don’t have to do absolutely everything, there are some extremely useful BIOS routines that you can call to do some heavy lifting. A good guide/lookup of these routines can be found at http://wiki.osdev.org/BIOS.

In the below examples, an 80×25 (80 columns and 25 rows of characters) display is assumed. This isn’t the only mode, it’s just a mode I feel comfortable with. It’s actually not too small. The main reason to not go too much bigger is the memory challenges that you will already have will get even more noticeable.

Each ‘character’ is actually 2 bytes (16-bits) of information. The first 4 bits is the background color, the next 4 bits is the text color, and the last 8 bits is the actual character. The 8-bit character is not ASCII however (it is similar), it is code-437. https://en.wikipedia.org/wiki/Code_page_437 has more information on this.

You will have a register (di) that points to this video memory. It starts at the upper left of the screen. As you increment di, it stays on the same row and moves to the right. After it reaches the end of the row, it moves back to the left on the next row.

Code Basics:

This section will have snippets of code to get you started

BIOS signature and padding

This is really the only required part that you need in your bootable image. This should be your last 2 lines of code.

times 510-($-$$) db 0
dw 0xAA55

The last line of code is a 2-byte signature that must be at the end of your image file. The line of code above it makes sure that no matter how much code you write, your image file will be exactly 512 bytes (after your code, the rest of the file will be filled with nulls, and then end with the signature.

Nasm ORG directive

This should be the first line of your assembly file when using the nasm assembler

Basic Video (and stack) Setup

xor ax, ax ;make it zero
mov ds, ax ;DS=0

mov ah, 0xb8 ;text video memory
mov es, ax ;ES=0xB800

mov al, 0x03
int 0x10

mov ah, 1
mov ch, 0x26
int 0x10

This code does all of the BIOS video overhead. It initializes the data segment (to zero), it allocates an area of memory for the stack and puts the pointer at 0x9c00 (assuming you’ll be using the stack, and you probably should), initializes video memory at 0xb800, sets the video mode to 80×25 (80 columns by 25 rows), and hides the cursor. For more video BIOS info, check out https://en.wikipedia.org/wiki/INT_10H.

A Simple Time Delay Loop

Example of keyboard input for arrow keys

This may look a little complicated, but it’s not so bad. It’s also the most reliable way that I’ve found to take keyboard input without lag, with continuous polling, and remembers last key pressed. For more keyboard information, check out https://en.wikipedia.org/wiki/INT_16H

;Infinite Loop To end the program
endloop:
jmp endloop

;BIOS sig and padding
times 510-($-$$) db 0
dw 0xAA55

Colors:

I made a boot sector program (https://github.com/XlogicX/colors) to display all of the foreground/background colors. These are the hex codes that would be in the upper part of ax (ah) register right before a stosw. As an example, knowing that 0x58 is the character for ‘X’, E458 in ax right before a stosw would produce a red ‘X’ in a yellow background.

Showcase:

Some examples of some projects out there that are boot sectors that don’t boot to an OS (that I know of):

Tetris (Game)

https://github.com/Shikhin/tetranglix

TronSolitare (Game)

https://github.com/XlogicX/tronsolitare

Goatse (Image)

https://github.com/jbremer/goatse.mbr

Boot2Sol (Game)

https://github.com/masneyb/boot2sol

Nyanboot (Animated Image)

https://github.com/XanClic/nyanboot

Phosphene (Hi-Def Animated Fractal)

https://github.com/kmcallister/phosphene

512B-bootloader-effect (Animated Graphic)

https://github.com/pjanczyk/512B-bootloader-effect

]]>http://xlogicx.net/?feed=rss2&p=5830CactusCon Slides: Machining, A Love Storyhttp://xlogicx.net/?p=515
http://xlogicx.net/?p=515#respondWed, 11 May 2016 21:37:46 +0000http://xlogicx.net/?p=515Here is the full ~6Mb image that I used as my slide deck within MS Paint in Windows 3.1 for my CactusCon 2016 presentation: Machining, A Love Story. Below the large image are all the images again, slide-by-slide, with brief notes; so there can be some context. All non-screenshot art done by KRT c0c4!N (my lovely girlfriend), it should be noted that I limited her to 16 colors with a specific pallet.

The intro slide:

A slide showing the 2016 CactusCon art in less than 16 colors:

The ‘ToC’ slide summing up what’s to come:

As a teenager, I got my first family computer, a 486DX. Playing games on a SNES or SEGA I felt like I was playing games from the ‘gods.’ But playing games on a computer made me realize I was using the same platform that could allow me to be one of the gods: I wanted to learn to program.

I tried QBasic, which was fun at first. But I didn’t want an interpreted language. I wanted to write software where I could just run the executable standalone; where the program was the machine code meant for the processor, not an interpreter.

I wanted to see what a ‘real’ program looked like. So I dropped a program into Notepad and inspected. Even though I knew this code was not printable, I still had a feeling that if I could understand these characters, and had the right editor, I would have all I needed to write software (this assumption turns out to be correct, it’s just too bad I didn’t find the answer until way later in life)

I found the nerdiest friend I new in school and asked him:

He responds:

So I ask:

He responds:

Remembering my exploration of a program inspected in notepad:

He still persists, only knowing what he has heard, with no appreciation that there are lower levels handed to us by the false gods of abstraction:

I remember this conversation for eternity. It is the moment I start to hate abstractions, to fundamentally know that if something can’t be done at the layer of abstraction we are dealing with, one must only go a level deeper and repeat if needed (even though lower levels of abstraction are more difficult to deal with, they always come with more control and power):

This starts my journey to learning programming, assembly language, and machine code. I remember programming in BASIC for this TI-82. But then I learned you can program in assembly for it (Z80 chip). My first program cleared the screen (as intended). My second experiment cleared the memory (it was meant to be ‘Hello World’). I gave up on this for a little while.

Then I formally learned assembly (and even machine code) for the Motorola 68HC11 embedded system. For class, we didn’t get a text-book. Instead we had a lab manual and the Motorola reference manual for this chip. The reference manual had every instruction and even the corresponding machine code for each instruction. After doing all of the labs, my personal project was to try to write some code that would replicate itself into memory right after itself. This required an appreciation for machine code.

The next architecture I learned assembly for was the Parallax Propeller chip. I wrote a 4-channel wave table based audio driver in assembly. I put the chip in my 4-string bass with a NES controller as input. It was only until later that I experimented with Propeller machine code, only to find out that this architecture is the closest to 1-to-1 between assembly and machine code that I had ever seen. More on this project: Bass + Computer

I finally learn x86 assembly. I learn it from some SANS GREM (GIAC Reverse Engineering Malware) training that a previous employer sent me to. It was actually a fantastic intro to x86 assembly. It also offered/explained a tool that can be used to convert ‘shellcode’ into a real executable program under windows. I liked this, but really wanted one for GNU/Linux instead (one did not exist at the time)

I then read more than 10 books on assembly and all 3 volumes (3,500 pages) of the Intel Manual.

I learned that Assembly is too high level. I wont go into too much detail on the next 3 slides; as the deeper explanation of these topics is contained within this same blog (in other posts), and is enough for a dedicated talk…

My rant on responses I see on stackoverflow (not about the platform itself). Remember, I wanted something like shellcode2exe.py, but for GNU/Linux ELF. To see if there was anything like this, I started with a search, and found someone asked this question:

This was the first moronic (and highest upvoted) answer. It is assembly (not machine code, like the question asked for):

In the comment of that first answer was this (correct):

This is probably the best answer, as it fully satisfies the question of having no headers (PE nor ELF). But there was no proof of concept

Right above (with the moar shit), someone gives an ‘example’. It is moar shit because the example is just more assembly (not machine code):

Then there are these unhelpful tidbits. Being that ELF is not machine code and a.out is not an appropriate alternative to ELF in the context of wanting to do pure machine code.

Finally, I create something for my own needs; writing pure machine code / ‘shellcode’ and being able to run it, albeit in ELF format. It takes a machine code (ascii hex) source file, and makes an ELF executable of it. I respond with my tool and a proof of concept:

This is the closest thing to a helpful answer. Not only are we back to the DOS .COM format file (pure machine code, no headers), but there is a proof of concept; the fully functional and executable EICAR antivirus test file. But it wouldn’t be stackoverflow if the most helpful answer wasn’t the most downvoted and has the most ignorant responses. ‘compiler’ says that it doesn’t look like machine code (it is). ‘petersaints’ states that this isn’t machine code, and that it’s just the EICAR test string (it is machine code, it also tests AV). For an in-depth debug of EICAR, see http://thestarman.pcministry.com/asm/eicar/eicarcom.html (it’s elite). Also, my friend did a write-up on the same topic: http://www.biebermalware.info/2016/05/playing-with-eicar-my-nerdiest-post-ever/

So now starts the section where I give various ways to write raw machine code and execute it. Starting with the Windows platform and shellcode2exe.py. The screenshot itself shows how it is run:

This ImmunityDBG screenshot is the output of the above shellcode2exe.py command. Note that I used assembly and machine code in the examples above about assembly being too high-level; hence a few ??? dissasemblies.

Below is a source file for my m2elf script. I have another blog entry that goes into more depth on these tools: How to Machine

A screenshot showing the running of the script and the executing of the result

NASM directives are another way of inserting literal bytes into otherwise assembly source files. The advantage of this is that it allows for 64-bit code (my m2elf script only supports 32-bit). The thing to be aware of is the memory order model, as things can tend to get reversed if you’re not paying attention.

Another method is to write boot sector code. The slide below outlines the features of coding this way. I wrote a PoC that I call TronSolitare, https://github.com/XlogicX/tronsolitare

And to return a way to write raw machine code without headers, a method I could have used as a teenager, if only I had the right knowledge:

I also wrote a program to interpret commented machine code (like above) and output a .COM file. I demo’d this as well during the talk. Even though this is assembly, I took the machine code from the assembled output and wrote this entire program using debug (in machine code), because I’m a purist…

]]>http://xlogicx.net/?feed=rss2&p=5150Assembly is Too High Level: Repetition of REP Instructions That Don’t Repeat Anythinghttp://xlogicx.net/?p=493
http://xlogicx.net/?p=493#commentsSat, 13 Feb 2016 21:03:43 +0000http://xlogicx.net/?p=493The REP (Repeat String Operation) is a pretty cool prefix; It modifies a single string instruction to repeat until the ECX register reaches zero. As this only applies to one instruction (as apposed to a block of code), ECX needs a way to decrement, REP automatically decrements ECX by 1 each execute of the string operation instruction. So the idea is to set ECX to the amount of times you want the string operation to execute and the run the string operation with the REP prefix. The instructions that REP is supposed to be appropriate for are: INS, MOVS, OUTS, LODS, STOS, CMPS, and SCAS.

Now that I described the coloring book, lets color outside the lines. What happens when we try to REP prefix an instruction not in this list. For example:

And this is in a debugger after executing the last INC.

You’ll notice that ECX didn’t decrement at all, and EAX is only up to ‘2’; as there were only 2 INC instructions. In other words, the REP prefix was completely ignored. I like this, because gratuitous prefixes can be abused…

Consistent Instruction Sizes:

Revisiting the concept discussed in the Consistent Instruction Sizes blog post, this can be done with an ironic repetition of REP instructions that don’t repeat anything (best phrase ever):

Just like the previous blog post PoC, this one launches a /bin/sh shell as well. The thing I love so much more about this one is that the 0xF3 prefix (REP) doesn’t really change the original meaning of the code (unless it is one of the very few string instructions). In comparison, the 0x66 and 0x67 override prefixes will change the register sizes all over the place and has to be treated carefully. And with the REP prefix, even if it was a string instruction, all you have to do is just make sure to set ECX to 1 before the instruction and it will work like a non-REP as well. But don’t just do a normal ‘XOR ECX, ECX’ with machine code 0x31c9, you should do the full REP version of 0xf3f3f3f3f3f3f3f3f3f3f3f3f331c9

Repeating a NOP to “pause”:

The above source file produces the below results in most debuggers/disassemblers.

So we take a normal NOP instruction:

And put the REP prefix in front of it and get a pause:

It’s really just a cool backwards compatibility hack on Intel’s part. PAUSE was introduced with the Pentium 4. But if you used this instruction on an older Intel processor, what would it do. Well like we discovered with using a useless (non-string based) REP prefix, it wouldn’t modify the instruction after it, in this case machine code for a NOP. In other words, machine code for a PAUSE on a pre Pentium 4 process would just be a glorified NOP.