Marcan's tis101 crackme

Here are my findings in the challenge by marcan codenamed 'crackme.tis101'

Setup

The challenge is available at the following URLs:

nc marcan.st 10847preferred

telnet marcan.st 10847

If you telnet the following url and port, we are greeted with the follwing prompt (does not vary):

================================================================================
WELCOME TO THE TIS-101 DEVELOPMENT AND TEST ENVIRONMENT
================================================================================
Enter object code followed by NEWLINE:

We need object code to be sent after this. Marcan provides such an object code at the following URLs:

We can guess the goal of this challenge is to try to guess the password, or be able to upload another valid binary, or both.
As a bonus challenge, it may be desirable to understand what is this 'object code'.

Methodology

For this challenge, we'll try to gather as much information as we can in order to 'understand' the underlying principle.

First remarks

The fact that the problem is named tis-101 (probably related to the game tis-100) and the general form of the object code make us think this is some sort of bytecode.
Also, it seems there is 'code' and 'data store'.
There is also a 'SECURE' and 'INSECURE' mode, there is probably some checks (maybe done outside the program it self, as the 'Launching application') is happening after.

Upon trying to fuzz the input, the VM crashed, we didn't include it in this analysis (and it has been patched since).

We can rule out first that the encoding is two characters long, since 1923 is odd (however it may contain a odd header, we'll revisit later).
But a 3 character encoding maybe possible, since 1923/3==641. However there is a 33% chance that this is happening randomly. So the size is not really helpful yet.

Next, we can confirm that there is an awful lots of 0: more than both 2nd and 3rd combined! It seems that these 0 could be used for padding.
Or they may be there to confuse us (and have no real information in it, they would be ignored).

Glancing at the text, we can't see obvious encoding. However some patterns are repeating. For example, the pattern 47343 is repeating 4 times, it may be exiting code.

The total information of the file is around ~3.3bit per symbol times 1923.
So we have around a 6KBits or 798 bytes file. A lots of things are possible with such a length!
It all depends on the underlying VM (it may have very 'complicated' operands, such as 'output a string of X length').

Modifying the binary one byte at a time

The nice thing is we don't have to send the binary as is to the server: we can manipulate it.
If we send anything other than '0123456789', it seems to refuse our binary with a simple Invalid data. answer.
We can use this property for later source file annotation by removing all characters that are not in the '0123456789' range (disclaimer: this was suggested by marcan himself).

First step is try to remove or add a few bytes to the 'binary' (we call it binary for lack of better name). No dice, we get the dreaded Invalid data answer.

However we if modify a random byte, let's say 0x100 to 0, then we get a different answer! Time to write a python script that will do the following (abbreviated):

Upon reading the answer files, we can see two things:
* Some bytes are corrupting a single byte of output, for example byte #1768.
* Some bytes are corrupting sequences of bytes, for example byte #1147.
* Some bytes are corrupting the output logic, for example #150.
If we relate to the fact that the problem is named tis-101, we can infer some 'cells' may interact in a weird way when we corrupt this byte.
* Some bytes are outputting a very large answer and the VM stops at 5000 cycles. For example, byte #145. We probably hit a infinite loop.
* Two bytes are making the program output 'Great' and exiting: #1322 and #161. They may be related to checking code and data.

Interactive analysis

At that point, we take a break from analysing the file and will try to confirm what happens to the above locations.
We create another program that takes a file, send it and print the answers, then repeat. In that way, we can modify the binary and have interactive results.
The loop has the following format:

Strings

Until now, we don't know yet the encoding. It may be variable length encoding. However we discover for sure many strings in the binary.
They are of the following form:

78
072 H
78
101 e
78
108 l
78
108 l
78
111 o
78
033 !
78
010 \n
75

or

48
071 G
48
114 r
48
101 e
48
097 a
48
116 t
48
033 !
48
010 \n
45

However it is not clear how the 4, 7 are related. If we change the 8 by a 9, then we obtain 256-X where X is the following 3 chars interpreted as a byte.
So we can do things like 48108 and 49148 that correspond to the same letter 'l' in the output.

Blocks

As per tis-100, we suspect there is some kind of 'block' concept. This could explain the differences between 78 and 48: they don't output the data to the same direction.
Also, if we tingle with the 90XXX block after the strings, the following string is interleaved with the current one: this supports the idea that it takes the data from another block.
In pseudo-code, it would have the following structure (line numbers courtesy of GW-BASIC):

If we change the offset of the goto, then we will repeat more characters from current string.
However offset manipulation doesn't yield a clear rule. 9002 is equivalent to 9006. Maybe the offset is encoded later.

Number of cycles

Let's try something different. The prompt invites us to write a password, we can do so.
Of course, it will fail. However we do have a bit of information given: the number of cycles used by the machine: 553.
This number is quite low, but remember that tis-like machine are massively parrallel.
Maybe the number of cycles used is not the same if one character is valid ? Let's try.

We first discover we have to enter at least 16 bytes of data to the program.

So, it does vary. Not a lot but a little. If we take a further look, we discover that only the 4 upper bits of input are changing the value.
More over, only the following values are yielding results:

Byte

NumCycle

4 bit upper

241-256

552

1111

192-240

551

1100->1111

177-191

552

1011

128-176

553

1000->1011

113-127

552

0111

64-112

551

0100->0111

49- 63

552

0011

0- 48

553

0000->0011

We can see the patterns repeats if we ignore the upper bit, so maybe values are treated as signed and truncated to 7bit (& 0x7f).
Strange values are 48 and 176, which do not fall 'properly' within the bit ranges. So there may be a comparison.

Back to block analysis

Not getting very hot with this cycle analisys, we go back to the blocks.
One thing we note is that sometimes, the systems anwsers: Units: 63
Hmm, could it be we deactivated one unit? 64 units looks lot like a 8x8 matrix.

Getting the blocks!

So, we have around 64 markers for possible places inside the file. Upon inspection, we see it's of the form X00Y00 or 00X00Y.
This could explain the '+3' rule: if two blocks are already defined at 004000 and 000003 for example, then the block 004003 would overlap in both +2 and +5.
We check quickly if our hypothesis can be valid:

We see that some of them are not found. So the format X00Y00 is either wrong, or there is something more.
But by testing 00X00Y we find all occurences! With some manual work, we get the following formatted file:

That's better! We can see a proper structure here. Note that all the strings are at the end.

More analysis

Now that we have proper blocks, we can first guess the structure of the program:

B(0,0)

B(0,1)

B(0,2)

...

B(0,7)

B(1,0)

B(1,1)

B(1,2)

B(1,7)

...

B(7,0)

B(7,1)

B(7,2)

B(7,7)

In that notation, the B(7,7) is the following:

007007 051164191044480704809748105481084803348010900394590044 #Great

Now that we understand basic blocks and how they relate to each other, we can try some tests:

001001 005 78080

Will display a series of P on the screen. This that, we start reverse engineering the opcodes.

The opcodes

It's pretty easy to see that 78 put something on the screen. However, it does so unless there is something below.
It seems that the interpreter of tis-101 is creating the cell up until the maximum cell (which would make sense).
So, if we create a cell in 099099, the 100x100=10000 cells would be created.
It also means that the maximum number of cells is one million.

The first opcode are quite easy to guess:

90XXX JMP XXX #XXX is the number of bytes relative to start of cell
78YYY MOV YYY, DOWN #we suppose for now that '7' is down
79YYY MOV 256-YYY, DOWN
48XXX MOV YYY, LEFT #we suppose for now that '4' is left
48YYY MOV 256-YYY, LEFT
91XXX CONDJMP
92XXX CONDJMP
93XXX CONDJMP
94XXX CONDJMP

Try all the opcodes (again)!

This time, we have better knowledge. We will try a simple program in the following format:

000 000 002 II

Where II goes from 00 to 99.

With that method, we find another way to crash the interpreter. That being fixed, we have a list of interresting answers.
75 and 73 prints:

Where 111 is very close to 56x2. 56 is the number of characters in the string displayed.
We will also note that 03,05,13,15 are taking 56 cycles so they are probably related to reading from this location.

For now, we will consider that the tis-101 reacts this way:
If it reads from a location that is a border, then it reads for STDIN.
If it outputs to a location that is border, then it writes to STDOUT.

From this, we think that 71 is moving stuff down and 16 taking it from up.
If we try the following program:

000000 009 167178080

Then that programs outputs the byte we give it in input followed by P.
From that, we conclude that 1 may be related to some ACC register.

Disassembler

From that point, it becomes tiresome to disassemble by hand.
We will write a small disassembler that format things nicely for us.

Writing the dissasembler is very straightforward. Take some opcodes, disassemble them.
Then see what is the structure of the program, what holes are left, rince&repeat.

Two main difficulties were encountered:
* Only the bottom left cell is outputting things and only the top left cell is taking input;
* The opcode 98XXX is very special

98 from hell

Up to this point, we considered the program to be made of variable length encoding of opcodes, with two variants.
One of them is two bytes opcode, and one of them is five bytes opcode.
However, the cell (0,2) is resisting to this analysis.

000002 009 98497 51 14

If 98 is two byte opcode, then the next is 49751 which whould be MOV LEFT, 256-751 which doesn't make sense.
After many trials, we conclude that this opcode is special.
In the form of 98Y with Y between 1 and 7, then it's SUB ACC, ZZZ where ZZZ is normal encoding.
However, in the form of 98(8|9) it is then SUB ACC, (-)IMM.
So the opcode takes either 3 bytes or 6 bytes.

Upon looking at the disassembly, we can infer the following behaviour.
Take 8 bytes of input -> Substract them from previous one -> Explode into 8 bits -> Mangle the bits -> Reassemble 8 bytes from each bit colums.

Reversing

We now know what is the goal: find 16 values, that, when input, will trigger the code path that will read from secure location.
In order to reverse the algorithm, we create the following helper block:

This blocks basically consumes what comes from top and then output it to the console.
That way, we can verify that each step we are reversing correctly.
We know that the first 8 values we should obtain are:
[6, 249, 145, 63, 239, 187, 160, 114]

Solution

Which translate to 'gr1dc0mPut1n60MG'. Success!!
We also get confirmation that this is a correct password.

Your flag is: D3naryCPUs4r3ALLth3r4g3theseD4ys
Fun fact: You should look up GreenArrays chips, they are real and have a
very similar concept!
Now go play some TIS-100 ;-).

Bonus points

We tried to make an enormous program to see if that if the cells were all operating in parrallel, thus overriding the 5000 cycles limit at least for the first step.
However the system is protected against size for the binary and the cells numbers.