Something I've often dreamed about is an fpga board in pcie card form with a sane toolset along side it so I can treat it as software instead of getting advanced degrees in desktop cable management and I/O pin mapping. Does something like that exist?

What you're describing is OpenCL, yes it exists, both Xilinx and Intel produce toolsets. No they aren't sane by software standards, but they're fantastic compared to hardware engineering. A card will cost you ~$10k for something you'd actually get acceleration from (https://www.xilinx.com/products/boards-and-kits/alveo/u250.h...) and you'll still need a degree in electronic engineering to produce something that convincingly accelerates your task.

Most FPGAs that are viable accellerators aren't for hobbyists but they also are not as expensive as you think. I can't find it anymore but I once saw an online shop with a huge variety of 500k+ LUT FPGA modules (just the chip itself on a small PCB) for around 1000€ + 500€ breakoutboard/mainboard. At those prices it makes more sense as an individual to invest into more CPU cores or a GPU (if your problem maps to it).

Depending on how you define "sane toolset", they do exist[1]...except they're in that class where if have to ask the price you probably can't afford it, and it doesn't relieve a developer of vendor place-and-route toolchain to build the application pipeline.

I'd think it would be cool to have an FPGA in my PC for various kinds of emulation. If I want to play some old games I can use it for accurate emulation (like the MiSTer project[1]) or if I'm in a DAW and want to produce audio from some old synthesizer I can do that on the FPGA to get a more authentic sound. Likely niche but I'd be all over it.

What's proving to be a problem though is where does this fit? If you don't have a clear need for an FPGA then just buy a normal Xeon. If you do need an FPGA then why compromise your Xeon? Have an FPGA card, or hell a group of FPGA cards.

The only place this makes sense is if you can think of a use case where you have an FPGA task that needs low latency communication with your CPU. Even with this chip though you have an uphill struggle because the cache hierarchy of a Xeon makes access to memory non-deterministic which traditionally isn't what FPGAs are designed for. It's much more difficult to design your algorithm on FPGA to deal with arbitrary memory latency.

It doesn't have a use case, at least not yet. But easy, cheap gains are running out in general-purpose computing as we near 1nm process. Heterogenous computing will then become more relevant, and a great way to do that is an FPGA.

Because 2 FPGA cores don't give you the same bang for your computational buck as 2/4 general purpose cores. You're better off hanging an FPGA off a fast internal bus with an expansion card, rather than try and cram an FPGA on a CPU die.

Think of them like graphics cards, but even more niche. Trying to stick them directly into the CPU isn't going to provide the power of a dedicated add on.

Although if they are on-die, you can benefit from shared L2/L3 cache, and lower power/increased performance of the CPU-FPGA coupling, shared memory path (lower cost than dedicated, although they can compete/starve each other if there isn't good synergy at the OS level.)

Yeah, the problem is any FPGA solution that integrates directly with the CPU cache is going to be a bit underpowered due to fitting on the silicon. Even the integrated CPU/FPGA SoCs I've seen have the ARM core separated by an interconnect

The article seems to say FPGA on a high latency bus can only accelerate workload that are streamed via DMA, and implies that a general purpose accelerator has to be closer to the CPU. Sounds like a coprocessor, like putting an FPGA into the slot where the 8087 used to be.

That made me think, why not get even closer? Why not have an FPGA as execution unit? Modern CPUs have multiple ALUs, multiple FPUs, multiple vector units. Wouldn't it be great if an FPGA was added to that, such that the instruction set becomes extensible?

The idea is too obvious to assume nobody ever thought of it. Why isn't it done?

If you want performance, you still better do it through DMA transfers that bypass the CPU, because otherwise, the CPU will still be waiting for thousands of cycles to fetch data from the device on the other side of the bus.

And the transfers that are done by the CPU should be write-only to the bus as much as possible.

Data transfer from the host CPU to the GPU card can kill the performance of offloading. You need a hefty data-parallel kernel, with a high-ish work-per-element, to get speedup that's worth the data transfer costs.

GPUs worked well because you could transfer all your large art assets upfront and then only communicate your mesh and shader logic as the game ran. They don't work so well if you need frequent access to system memory.