Customers Care About Their Shit, Not Mine!

I re-factored the library of VHDL modules for the XuLA and XuLA2 boards last week (that's fancy talk for cleaning up the shit that had accumulated in it). Basically, I built a single, merged library that could be used for both types of boards even though they're based on different FPGA families.

Of course, after building the new library, I had to test it. I usually re-build the board diagnostic and run it because that design uses some of the more complicated library components, so it's more likely to pickup an error. There were a lot of easy-to-fix compilation errors, but nothing indicating any fundamental problems with the new library.

So I downloaded the compiled bitstream to a XuLA2 board as a formality, expecting to get the normal "SUCCESS" result from the diagnostic. And, of course, THAT DIDN'T HAPPEN!

My first response was to run the old diagnostic just to make sure the board wasn't actually damaged. It wasn't.

Naturally, I figured I'd bungled-up some of the code and introduced an error. But I really hadn't changed the guts of any of the VHDL components, just edited some library and use statements and collected a bunch of constants into central board definition files. I probably made an error in the definitions file or else the definitions weren't getting passed correctly to the rest of the modules.

So I traced through the compilation log and confirmed all the constant definitions were correct and getting to where they needed to be. Then I re-compiled the old diagnostic and compared its log with the new one. The statistics on the old and new modules were identical: same number of register bits, adders, muxes, etc. However, at the end of the log, the total size listed for the new design was a few LUTs smaller than the old one. Where did those go to? No clue.

I thought the diagnostic failure might be frequency dependent: possibly the mapping of the new design to the FPGA fabric was incurring some extra delays. The diagnostic design uses a Digital Clock Manager (DCM) to multiply a 12 MHz input clock by 25 and then divide it by 3 to get a final clock of 100 MHz. By increasing the divider, I was able to run the diagnostic at lower frequencies.

As the frequency went lower, the problems got stranger. Now the design would throw an error because its JTAG-to-USB communication channel reported an incorrect signature. I would attribute that as another frequency-dependent effect, except the communication channel runs off a separate clock and doesn't even touch the clock from the DCM. I also found I could shift the problem between the signature error and the diagnostic failure by optimizing the design for either area or speed.

Finally, I got the diagnostic to work by using the default settings for the synthesizer and place-and-route tools, except for the specific case where the DCM multiplier/divider was set to 25/6 to generate a 50 MHz clock. I thought that might happen because 50 MHz is close to the 48 MHz used internally by the PIC18F14K50 MCU that manages the USB interface on the XuLA boards. Maybe some type of weird resonance was causing the malfunction. But I tried a multiplier/divider of 29/7 to create a 49.7 MHz clock and the diagnostic ran fine with that.

So I was stuck. I had a problem that originated merely because I rearranged some library files and which manifested itself differently depending upon the optimizations applied by the synthesizer and upon some very specific parameters passed to to the DCM. This had the look of something that would take a long time to track down and fix.

And then I thought: Who gives a shit? Obviously I did, but did my customers? I think my customer's concerns look like this:

Are they getting reliable hardware...

at a reasonable price...

with supporting documentation and examples...

in a timely manner.

In other words, are my customers getting what they need to get their shit done? They don't care about my shit; they've got their own to worry about. They don't care if there's some strange problem with the board diagnostic; the chance of that affecting them is 1 in 10,000. They want me to be working on things that help them do their work. In a world full of severed limbs, they don't want me placing band-aids on boo-boos.

So in the end, although it bothered me, I placed a notation in the VHDL:

-- Using a MUL/DIV of 25/6 causes the diagnostic to fail for an unknown reason.

and I checked it into the repository. Maybe in the future, a customer will encounter the same problem and we will finally do the work to solve it. Or maybe Xilinx will change their synthesis software and the problem will go away. Or appear somewhere else. But probably it will lie there, dormant, never to be heard from again.