Tachyum is developing a processor that it alleges will run everyday applications as well as AI code that would normally require a GPU-like hardware accelerator.
Anandtech live-blogged the biz's Prodigy CPU presentation at this week's Hot Chips conference in California, where a lot of promises were made. We don't yet know if …

The objective

Seriously, actually quite easy......

A TensorCore in an Nvidia GPU is just a 4x4 matrix multiply. That’s where almost all the deep learning “TOPS” come from.

If you took a 32-bit processor, and added a single FMA 4x4 packed INT8 matrix multiply instruction, this would easily trounce a Xeon.

Since they probably based it off an old dumbish RISC core they licensed for almost free, yes it will be smaller than a modern out of order ARM.

They presumably think that the clever thing is being able to do this without handoff to an accelerator (internal or external), saving cycles and latency.

But, these things are *never* about the hardware. They are about the software and ecosystem. Basically, who is going to write TensorRT for them, and even if done are prospective customers really going to buy into a small proprietary chip ecosystem.

But yeah, if you actually want this doing, I could probably do the hardware with a team of five or ten in less than a year.

Re: Seriously, actually quite easy......

I had second thoughts....it’s both better and worse than that....

a) Honestly, I forgot that a 4x4 matrix would be 128 bits. So, it needs either some equivalent of an AVX multi-register, and (better) an additional 128-bit register load/store operation. Not exactly rocket science though.

b) There is a *major* advantage putting something like this into software. You get to use a Strassen or better algorithm O(N2.4) rather than naive O(N3) for large matrix multiplies, which I doubt you can on a GPU (although I haven’t thought this through, someone else may know more)

c) Ummm, what’s to stop Intel doing this on their Next Xeon.....Once Nvidia have pipecleaned whether it is really commercial with the TensorCore, this is just another fairly trivial AVX instruction for Intel. If this is a disruptive winner, Tachyum *still* get destroyed without even a course change by Intel.

Validation & Prior art.

First, I believe that you MIGHT be able to pull of your design in a year with such a team. I don't believe for a minute that you could get it validated.

Second, you proposal (once you realized you want 128-bit registers) sounds A LOT like key aspect of the STI Cell project from around 2004. So it's not like the industry doesn't understand this sort of idea.

But architecting off a 12 year old spec? When architecting by spec in the first place results in years-late designs?

Compiler?

The key problem with this and similar chip designs is not the hardware, but having a working compiler that will actually let one do all the great things this chip is supposed to do. If Tachyum can demonstrate an early version of a compiler that at least sort-of works, they have a chance. For example, Tachyum wants to implement out-of-order execution on their in-order design through the compiler - far from a trivial task, and that's just one of many.