Does DDR3 have a maximum theoretical capacity per SODIMM?

I'm not asking what's on the market; I'm asking if the specification contains limitations on bus widths, address lines, chip architecture, etc. that limit how much RAM can be crammed onto a single SODIMM.

Interesting question. Bank Addresses (BAs) exist in threes, so you have BA0-BA2. This means we can have 8 banks. Calling them "Bank Addresses" rather than "Bank Address Lines" is a holdover from Ye Olden Dayes when there was only one of them which almost always selected which side of the SIMM to use.

Address lines number 16 (A0 to A15) but pull double duty: They select both rows and columns, so we can have 65,536 rows and 65,536 columns (there are always many more rows than columns in real DRAM)

So we multiply 'em all up: 34,359,738,368 possible addressible locations per DDR3 DIMM (SODIMM is identical). Each addressible location is 64 bits deep and can (will) span multiple chips. Now we know how many bits we have per addressible store, we know how many bytes (8) and so we know the maximum capacity: 274,877,906,944 bytes or 256 GB.

Same workings for ECC, but they're 72 bits wide and have a higher raw capacity of 288 GB. Of course only 256 GB is actual memory addresses.

Not really. Each slot has four chip selects (these tell the desired rank it is active), so can contain as many as four ranks (it isn't a bus, isn't binary coded, it's a direct map). There is no reason I can see why a memory controller cannot have unlimited chip selects and simply route them as needed. For example, a memory controller could have sixteen chip selects routed to four slots, allowing sixteen ranks per channel.

DDR3 cannot address on a per-rank basis (as far as I can tell), but DDR4 will be able to, so the theoretical maximum capacity of a DDR4 system should be unlimited, able to have 256 GB per rank, the same four ranks per slot as DDR3. Believe it or not, the 512 GB limit of DDR3 dual-channel systems will be quite pressing relatively soon. With pincount being quite pressing for future designs, I'm expecting a future DDRx spec to change the CS lines into a binary code, so allowing 16 ranks for four CS lines.

We are considering a design for a pretty specific type of data acquisition card, and I'd like to know how much RAM we can cram onto a single 6U/4HP CompactPCI card (Eurocard form factor; same overall bounding box as VME). I'd like to have as much capacity on the board as reasonably possible, using commodity RAM if possible.

I'm looking at the double-stacked DDR SODIMM sockets in MBP, and I'm guessing that, after allocating space for external I/O, a PCI interface chip, and a big-ass FPGA, I could probably cram 4 of those sockets onto a card, or 8 SODIMMs if the spec will physically permit that.

8 GB SODIMMs are available and cheap today, and I'm guessing there's another 2x or 4x before DDR3 runs out of steam. So if I design a card that can accommodate 8 SODIMMs, that gives me 64 GB and potentially up to 128 GB down the road before I have to redesign.

DDR3 will be replaced by DDR4 in the 2013-14 timeframe. There's considerable difficulty clocking DDR3 higher than 1 GHz, which is why your commodity DDR3s are all 1333 (666 MHz) or 1600 (800 MHz) and have been for the last two years. JEDEC specifies a maximum of 1066 MHz, although DDR3-2166 in 11-11-11 timings (11 ns CAS) will run you a pretty penny. We STILL can't reliably pull CAS faster than around 8 ns in volume parts with almost all RAM on the market being around 10 ns... as it has been for the last ten years.

If you're engineering for a release sooner, you need to be talking to the guys making the memory controller, as they will determine how many slots and what densities you can use.

Unless you're going fully custom with the FPGA, you'll probably be limited by the memory controller xilinx or whoever gives you rather then any theoretical format limits. Might want to talk to their engineers about what they can offer you.

Unless you're going fully custom with the FPGA, you'll probably be limited by the memory controller xilinx or whoever gives you rather then any theoretical format limits. Might want to talk to their engineers about what they can offer you.

It'll be fully custom with an FPGA, so yeah, we probably have to talk to Xilinx/Lattice/Altera and see what sort of DDRx cores we can drop into their parts. But I wanted to get up to speed with the limitations of the specifications as well.

And we don't care about random access. It'll pretty much all be very long sequential writes and reads. Any random access will be on the order of 1 MB at at time, and when we do that we won't really care about speed.

Unless you're going fully custom with the FPGA, you'll probably be limited by the memory controller xilinx or whoever gives you rather then any theoretical format limits. Might want to talk to their engineers about what they can offer you.

It'll be fully custom with an FPGA, so yeah, we probably have to talk to Xilinx/Lattice/Altera and see what sort of DDRx cores we can drop into their parts. But I wanted to get up to speed with the limitations of the specifications as well.

I haven't followed this stuff closely in a long time, but is a soft memory controller even viable? I thought the reason they included fixed function memory controllers on FPGAs was that it wasn't really feasible to make one of out programmable logic given the analog and performance constraints.

Thanks, BTW, for the suggestion to talk to Xilinx et al. For some reason I hadn't thought to do that earlier. So I just googled "virtix ddr3 core," and Xilinx has a dedicated page for DDRx controllers in a Xilinx part.

It's sort of a semi-soft cheese. It's a soft core, but with some dedicated bits for getting the speed.

And we don't care about random access. It'll pretty much all be very long sequential writes and reads. Any random access will be on the order of 1 MB at at time, and when we do that we won't really care about speed.

Why are you using DRAM at all then? With DRAM you're paying a very large premium for fast random access and speed, qualities you don't want. It's much like using a backhoe to take the kids to school with!

Flash would present a massively simpler interface, it'd be much cheaper, use far less power and you can fit way more gigabytes in the same space.

And we don't care about random access. It'll pretty much all be very long sequential writes and reads. Any random access will be on the order of 1 MB at at time, and when we do that we won't really care about speed.

Why are you using DRAM at all then? With DRAM you're paying a very large premium for fast random access and speed, qualities you don't want. It's much like using a backhoe to take the kids to school with!

Flash would present a massively simpler interface, it'd be much cheaper, use far less power and you can fit way more gigabytes in the same space.

Generally the reason one sticks a few GB of DDR memory on a PCI card is that they can't (or haven't yet) upgraded to PCI Express, but still need to capture data much faster then the ~100 MB/s real world PCI transfer rate allows. Since hes asking about 100GB+ of memory, I'm guessing hes acquiring a lot faster then that.

That said, unless you really need CompactPCI, perhaps looking at CompactPCI Express would be cheaper then building such a large DRAM cache? I've had very good results using PCI-E RF boards streaming at 1500+ MB/s over PCI-E 8x. Or are you stuck interfacing with some legacy device?

The reason we're looking at DRAM is for write speed and ease of configurability/upgrading*. We want to build a card with 0 GB and lots of sockets, and then be able to populate it whenever we want to with whatever is available on the market at the time. Flash doesn't come on DIMMs, and putting a pseudo HDD interface (e.g. CF) on there adds a lot of complexity. Flash won't write nearly as fast as DRAM, and we want to go as fast as the FPGAs and external LVDS interconnects will run.

And yes, we're trying to stay off the backplane completely, for two reasons. 1) We're on legacy 32-bit/33 MHz CompactPCI, in Win32, with about a dozen other boards to support, so CPCIe isn't available to us any time soon. 2) one of our selling points is that we absolutely do not care what the heck is going on the backplane or in the CPU, and we guarantee that we will capture every damn byte of your data. With all the RAM on the card, next to the FPGA and interconnect, there's no bus contention or CPU to worry about. Data comes in and we write to DRAM, and all latencies are known.

The Win32 problem also limits how much RAM we can use on the CPU board, but it doesn't affect how much we can put on the card. Once we've acquired our bajillion bytes on the card, we'll just map each section and read it as needed. Speed on the back end processing is not nearly as important as write speed on the front end, because you don't want to miss a single frame, line, or pixel when the million dollar experiment is happening.

If you're pulling frames down, all you need is to be able to write a full frame, and empty it during the VBlank, your data rate is known. That lends itself well to a few frames of fast cache DRAM and then flash.

If you're pulling frames down, all you need is to be able to write a full frame, and empty it during the VBlank, your data rate is known. That lends itself well to a few frames of fast cache DRAM and then flash.

Ahh, there's the rub

We/our customers work with all sorts of different detectors, many of which have no horizontal or vertical blanking interval. These aren't cameras; they're raw detector arrays, so we can't count on the data being in any sort of standard video format. We have to assume that is just a continuous screaming firehose of bits at X MHz.

Compression also doesn't work in our application, because a lot of the analysis is across the Z (e.g. time or frames) dimension. For example we need to calculate the Std Dev of pixel 2000,2000 across 1000 acquired frames. If we have a big uncompressed data cube then it's a fairly trivial operation to calculate the offsets and scan across 1000 frames.

If the data is compressed then I can't get to pixel 2000,2000 of frame N without decompressing the entire frame.

The reason we're looking at DRAM is for write speed and ease of configurability/upgrading*. We want to build a card with 0 GB and lots of sockets, and then be able to populate it whenever we want to with whatever is available on the market at the time. Flash doesn't come on DIMMs, and putting a pseudo HDD interface (e.g. CF) on there adds a lot of complexity. Flash won't write nearly as fast as DRAM, and we want to go as fast as the FPGAs and external LVDS interconnects will run.

...

* of course if there's a better/cheaper/faster solution I'm all ears!

Does mSATA work for you? The sockets are small, you can add several to a card, it is reasonably fast and a you should be able to get SATA cores for your FPGAs as easily as DDR3 cores.

He probably won't like mSATA for the same reasons he doesn't like CF - it's an entire extra layer of overhead (drive controller). Dumping bits to memory is much easier than dealing with blocks and storage and such.

He probably won't like mSATA for the same reasons he doesn't like CF - it's an entire extra layer of overhead (drive controller). Dumping bits to memory is much easier than dealing with blocks and storage and such.

Compression also doesn't work in our application, because a lot of the analysis is across the Z (e.g. time or frames) dimension. For example we need to calculate the Std Dev of pixel 2000,2000 across 1000 acquired frames. If we have a big uncompressed data cube then it's a fairly trivial operation to calculate the offsets and scan across 1000 frames.

If the data is compressed then I can't get to pixel 2000,2000 of frame N without decompressing the entire frame.

Of course you can. You decode the block containing it, and there's your pixel. Commodity ARM cores are fast enough to make the compression/decompression process completely realtime (fuck, so is a 68020 at 20 MHz!). Split your data into 32x32 blocks (or whatever), Huffman code them independently (Huffman is really easy to code, you'll want LZW for better efficiency) and your Huffman coded blocks, or LZW blocks, are retrievable at DRAM bottleneck speed.

You'll tailor your block size on what your decoding CPU IP block can keep in its local cache. The small blocks limit encoding efficiency, but give you explosively fast per-pixel data access.

Compression also doesn't work in our application, because a lot of the analysis is across the Z (e.g. time or frames) dimension. For example we need to calculate the Std Dev of pixel 2000,2000 across 1000 acquired frames. If we have a big uncompressed data cube then it's a fairly trivial operation to calculate the offsets and scan across 1000 frames.

If the data is compressed then I can't get to pixel 2000,2000 of frame N without decompressing the entire frame.

Of course you can. You decode the block containing it, and there's your pixel. Commodity ARM cores are fast enough to make the compression/decompression process completely realtime (fuck, so is a 68020 at 20 MHz!). Split your data into 32x32 blocks (or whatever), Huffman code them independently (Huffman is really easy to code, you'll want LZW for better efficiency) and your Huffman coded blocks, or LZW blocks, are retrievable at DRAM bottleneck speed.

You'll tailor your block size on what your decoding CPU IP block can keep in its local cache. The small blocks limit encoding efficiency, but give you explosively fast per-pixel data access.