Explaining technology, software development, security, and other things. And we can't forget good BBQ.

Sunday, July 27, 2014

Calculating primes numbers on a C64

I’ve been (very) slowly working my way through various project Euler problems as a side project. For some reason, calculating primes became a fascination for me. I eventually wrote a fast, one line python script that uses a sieve. Not long after that, I decided to revive my C64 hobby.

I’ve had a Commodore since I was 10 or 11.. Started writing assembly when I was 14 or so, and soon discovered the world of demos in the late 80’s. I’ve always wanted to do demos, and even coded a few simple ones as a teenager.

So coming back to the C64 after 25 years, refreshing my 6502 limited coding skills, I set to work. I set a challenge for myself. Could I get this C64 to calculate the prime numbers up to 1,000,000. At first glance, this seems impossible. Then on a lark, I decided to add another challenge: Could I do it in less than one minute.

Firstly, the C64 only has 65536 bytes of memory total. There are 74,000+ primes less than 1,000,000. How would I store them? Even the amount of processing seems to be too much. Having to do multi-byte math on an 8bit processor. Looping over the data hundreds of times. I wasn’t too concerned by the time. I knew if I could get everything to fit in memory, given enough time, it could calculate it.

The first challenge was storing the data. Going to the disk drive for swap is notoriously slow. First thing is first, can I eliminate some of the problem set? Why yes. I can cut the problem set in half, by only needed to look at the odd numbers. So now we’re down to 500,000 numbers. Next on the elimination list is anything divisible by 5. We already eliminated anything that ends in zero, so now we eliminate anything ending in 5. This is another 100,000 numbers. So we are down to 400,000. Still way exceeding the limits of the C64.

My break through came when I figured out that I do not have to store the actual numbers. I just need something to represent each number. In this case, I need to represent if a number is prime or not. A boolean variable. A bit. Which means I can store 8 numbers in a single bit. There’s 400,000 numbers I need to store, and at 8 numbers per byte, I need 50,000 bytes. Yes folks, I can fit the data into the C64’s memory!

To access a particular bit, I used two different numbers. One is an offset, and another is the index into the byte. Each byte contains 8 numbers, but in theory actually represents 20 numbers, but with the unneeded ones removed. So the index is always under 20.

In order to save time doing loop addition in order to multiply (remember, the 6502 doesn’t have multiply or divide instructions), I decided to pre calculate some stuff. What I noticed is that there’s a pattern to the multiplication. Starting with the prime we’re working with, we actually add 2 times the prime (2 X prime) to get the next number, which will not be prime. This skips the numbers that will result in even numbers. The resulting number is really (prime X 3). If we added the (2 X prime) again, we get the result of (5 X prime). We don’t really need this, since we already eliminated all stuff time 5 in our storage. So we add the (2 X prime) again, which really means that we’re adding (4 X prime). The result is (7 X prime). Again, we add (2 X prime) to get (9 X prime). To loop back around, we go one more time, (2 X prime) to give us (11 X prime). If you continue this pattern, you’ll notice that you’re adding like this: 2X, 4X, 2X, 2X.. Here’s a couple examples:

prime = 3

3 + (2 x prime) = 9

9 + (4 x prime) = 21

21 + (2 x prime) = 27

27 + (2 x prime) = 33 (loop around to beginning)

33 + (2 x prime) = 39

39 + (4 x prime) = 51

etc...

prime = 7

7 + (2 x prime) = 21

21 + (4 x prime) = 49

49 + (2 x prime) = 63

63 + (2 x prime) = 77 (loop around top beginning)

77 + (2 x prime) = 91

91 + (4 x prime) = 119

etc...

And so on. So what I did at the beginning of each prime number, was to pre-calculate (2 x prime) and (4 x prime). Store that into a table that looks like this:

(2 x prime), (4 x prime), (2 x prime), (2 x prime)

Due to the way I’m storing the numbers, I’m actually using 3 bytes. 2 bytes for the offset, and a single byte for the index. So when I do my addition, I add the table’s index to the current value’s index, check to see if it carries, if so, increase the offset by one, then add the table’s offset. That gives me the next value I need to turn off.

I start with a prime of 3, pre-calculate my adding table described above, and then start adding. On each add, I turn off a bit. I keep adding until I equal or exceed a value of 1,000,000. In this case, that is when the offset is greater than 50,000.

Once I do that, I search the data for the next prime. In this case it’s 7. Then when it searches after 7, it see that the 9 is not prime, and continues to 11. And so on and so on until we hit the square root of 1,000,000, which is 1,000.

So thats the general algorithm that I used.

Some implementation details. First, I didn’t write the division routine. I used one that I found online, and then modified it slightly. I didn’t need the full 32 bits that it was calculating, and I also didn’t need the error check. The routine did exactly what I wanted though, it allowed me to divide by 20.

Why 20? Because thats the number of ‘numbers’ stored in a byte. This means that the quotient is my offset. The remained tell me which bit I need to turn off, but how do I go from 20 to 8 bits? With a table:

The reminder from the division is the index to this table. This tells me exactly which bit I need to work with. I look at the data in the table via the offset, and AND it with the value from this table. Then store it back in the data at the offset.

To get the speed, I played the ‘follow the carry bit’ game. I looked through my code for places where I could remove a CLC(CLear Carry) or SEC (SEt Carry) before an ADC (ADd with Carry) or SBC(SuBtract with Carry). Sometimes I reordered something so I didn’t have to do one of those. Every one I removed was 2 cycles per loop iteration.

At one point in time, I had some self modifying code. I eventually got rid of it, because it wasn’t giving me enough, and I needed to the space.

Speaking of space, this whole implementation takes 1K. Thats 1024 bytes. That also includes a summing routine that adds all the prime numbers together after the sieve is complete. I needed to keep it under 1K, because of the memory layout. If it was any larger, I would have to play games moving pieces of the app around in memory, turning off the kernel and I/O pages. It was bad enough i turned off BASIC.

All in all, the sieve run in 55 seconds. Not too shabby for a 30 year old computer running at 1Mhz. I’m sure some of the demo scene people could get it faster, probably a lot faster. This happens to be my first 6502 assembly program in nearly 25 years.

And now the source. For my development environment, I use my MacBookPro with OSX, Kick Assembler, and VICE. I’ll also show my makefile, so others can build this easily. After I got it working, I transfer it to my real, stock, c64, and ran it there as a verification. See my other blog post on how I transferred it. Also, here’s a video of it running on my machine: https://www.youtube.com/watch?v=bl8DUJAPyFU

1 comment:

I wonder what the fastest one could do this in BASIC would be - without resorting to ML subroutines and PEEK/POKE memory manipulation. Due to memory limitations, you'd have to resort to sending to disk, printer, or even modem/RS232 output. I remember writing these sorts of things as a kid, but as someone who wasn't a mathematical prodigy, more often than not my efforts were thoroughly brute-force.