IDT DDR4 RCD register and DB data buffer enable RDIMM and LRDIMM to faster speeds and deeper memories. This video helps you understand the DDR4 feature enhancements of IDT’s DDR4 RCD and DB compared to earlier DDR3 technology. An introduction into some available LeCroy testing and debug tools completes the video. Presented by Douglas Malech, Product Marketing Manager at IDT and Mike Micheletti, Product Manager at Teledyne LeCroy. To learn more about IDT’s leading portfolio of memory interface products, visit www.idt.com/go/MIP
Doug: First I’m going to compare DDR4 RDIMM with DDR4 LRDIMM. On the RDIMM, all commands, address and control are buffered by the register. Here shown as the device in the middle of the RDIMM with the acronym RCD which stands for register clock driver. That’s the acronym used by the industry. However as you can see, the data outputs from the DRAM are not buffered on the RDIMM. Therefore, that can be from one to four DRAM loads presented at the RDIMM connector. This picture shows the front side of the RDIMM, there are just as many DRAMs on the backside. So you can imagine four DRAMs vertically placed where I’m showing one to four DRAM loads. These additional loads degrade signal integrity. Off to the right, on the LRDIMM, command address and control buffering are similarly buffered by the register like on the RDIMM. However the DRAM outputs are also buffered by the data buffers. Here shown as the chips with DB. This means that there will only be one load presented at the LRDIMM connector instead of four, like there were on the RDIMM. This leads to better signal integrity on the data signals at the edge connector also referred to as DQ bits on the edge connector. However, having fewer loads at the connector is why LRDIMMs can be populated into your server with less degradation in performance. Imagine three RDIMMs populating a server. That would mean up to three times four loads, or twelve DRAMs loads, connected onto the DQ path of the motherboard. On the LRDIMM, that would mean only three times one or three loads on the DQ path. Fewer loads, LRDIMM.
This concept of RDIMM vs. LRDIMM is the same for DDR4 as it was in DDR3. However, we will see later that the DDR4 LRDIMM has a better architecture improving signal integrity.
So now I’m going to start showing the advantages of DDR4 versus DDR3 and here I’m going to be talking about both RDIMMs and LRDIMMs. As far as scalability goes, DDR4 RDIMMs and DDR3 LRDIMMs use the same approach of having a central register device to buffer command and address for the memory module. However, in my opinion, DDR4 load reduced LRDIMMs are more scalable than DDR4 RDIMMs.
And I’m showing it here by bringing in the red checks on the top, okay? As you can see in the pictures, the same central RCD is used in both DDR4 LRDIMMs and DDR4 RDIMMs. Therefore all of the software used to control the DRAMs through the RCD on a DDR4 RDIMM can be reuse for controlling DRAMs on LRDIMMs. Hence the scalability. In addition to the register, LRDIMMs also have nine data buffers, located between the lower DRAMs and the edge connector. The data buffers are controlled through the register and intercept the memory read, write data. In a DDR3 LRDIMM a completely different buffering device called the memory buffer, is communicating with the host controller. Therefore, completely new software must be developed to address the DRAMs on the DDR3 LRDIMM, because the register software cannot be reused. Item two; DDR4 modules will eventually reach speeds of 3200 mega transfers per second, whereas DDR3 modules have topped out at 2133. Currently DDR4 is defined for operating speeds between 1600 and 2400 mega transfers per second, but there are plans to increase the speed in future products. Additionally an incredible amount of DRIM memory is possible on these DDR4 modules. DDR4 addressing schemes are prepared to handle one terabyte of memory. In DDR3, while 64 gigabytes modules might be realizable, I don’t expect any higher densities from DDR3 module vendors.
I wanted to get back to comparing DDR3 and DDR4 again. In this case, I want to point out the DDR4 improvements in the signal flow for the DQ bits. These pictures show the DQ data paths. You can see that the DDR4 DQ data path is very consistent from column to column of DRAMs. Shown here as vertical path connections. In DDR3, the trace variability from one DQ bit to the next is between two inches to six inches in length. This means that data arriving simultaneously for two different DRAMs at the edge connector will have two very different flight times, at which the data will arrive at the DRAMs. The modules calibrate out this variability in trace lengths but signal integrity and performance are compromised. Another benefit to DDR4 is that the training algorithms to remove trace length variability in the command address and DQ paths are completely controlled by the host controller. And finally, because these training algorithms are done completely by the host controller and the backside bus is not isolated, the host can be responsible for creating all of the training software and therefore the training software can be standardized even if different chipsets are used from companies such as IDT.
I wanted to quickly show how the DDR4 commands flow from the host to the register data buffer and DRAM. For the RCD, the register command words are called RCW, or register command word. You can see here that RCW writes are done directly from the host to the RCD. For reads, an RCW command is sent by the host to the RCD, to move the bits into a special multipurpose register in the DRAM, and then the data that has been written into these special registers, called MPR registers, is read out of the DRAM onto the DQ bus. The data buffer works in a similar way, for the data buffer, the buffer control words are called BCW. You can see here that writes are written from the host via the RCD to the data buffer. For reads, a command is sent by the host to the data buffer to move the appropriate data buffer bits into a special multipurpose register located in the data buffer. Then the register’s data is readout of the data buffer onto the DQ bus. And finally, DRAM mode registers called MRS are written to, from the host, via the RCD to the DRAM. Reading from the MRS registers is also done through the same multipurpose registers inside the DRAM that are used to readout the RCD register contents. In addition, I wanted to point out that you can individually write to the individual DRAM registers and data buffer registers which is something that was not available in DDR3, which helps with training. And finally, there’s even parody options available to make sure that the signals flowing into the RCD from the host, as well as into the DRAMs and the data buffers, is valid data. And the difference here is that in DDR3, only the signals flowing from the host into the RCD had parody checking. There was no parody checking for the DRAMs.
This slide is meant to intentionally show you how complicated it can become when writing to RCWs and BCWs and MRS’ etc. And I’m going to elect Mike and Lecroy, go into the specifics of some of these protocols with their presentation, but I wanted to show them to you here. You can see the table below for details, but there are seven MRS registers, of which MRS zero through MRS six, are used to write control information into the DRAMs. Writing to MRS7, rank zero, side A, with an A twelve address, bit twelve equals zero writes to the RCWs. Writing to MRS7, rank zero, side A with, A twelve equal one writes to the data buffer controllers. You can write to them via the host controller. You can also write to them through a serial bus called I2C. There are different ways of writing, so for example, you may be booting the server and that would use MRS commands to communicate with all the chipsets on them. And simultaneously, you could communicate with maybe some kind of debugging software and hardware through the I2C interface at the same time. So again, I’m going to let Lecroy go into the specifics of these protocols with their presentation, so I’ll stop here and past the baton over to Mike.
Mike: Okay, thanks for that, Doug. Well I think it’s pretty clear that these are not your garden variety DIMMs, the DDR4 LRDIMMs are going to be pretty sophisticated devices, definitely a quantum leap in terms of capacity and performance. Definitely, there’s going to be more complexity and more testing when you are working with LRDIMMs. So for the second part of our webinar, I’m going to start by showing you how the Teledyne Lecroy bus analyzer is used to test DDR4. I’ll be using some screen captures from the analyzer throughout the presentation to show you real DDR4 traffic, and this is one way to get visibility to the DDR4 bus. And then we’ll focus the discussion around the configuration process. We’ll look at some of the different register and buffer commands that get set on power on. Along the way, we’ll see how the bus analyzer can help you troubleshoot problems that might come up.
The analyzer uses a slot interposer-style probe, which sits in line in the DIMM slot. You remove the memory, insert the probe and then put the DIMMs on top. These are actually UDIMMs in this picture. But some good news, DDR4 RDIMM and LRDIMM, they’re going to continue to use the same connector as UDIMM. So we have a single interposer that can be used to basically test both unbuffered and buffered DIMMs.
The blue cable is attached directly to the Kibra 480 analyzer. The interposer probe is self-powered, so it pulls power from the analyzer so that it can capture the boot up sequence. When your memory is getting configured, it’s going to record the command address and control signals, allowing you to analyze the timing. You can see any state changes and basically, you see every command that gets sent to the LRDIMMs.
There’s no calibration step, so this makes it quick and easy to get started. The only thing you need to do is go in through the software and tell the analyzer what speed and latency you’re running. You specify these parameters on the left and then the software loads the exact timing intervals for that DIMM on the right. It can track up to 65 different protocol and timing violations. And it will actually trigger if it detects any errors in real-time.
Okay, there’s a full blown timing waveform viewer that allows you to see what violations occurred. This is an example of a timing violation. Here, the controller is issuing two write commands that are too close together. You can see all the control signals, the rank, the bank, the full command is decoded. The software uses timestamps to actually show you which commands caused the violations and you can, of course, do your own measurements. But verifying the compliance at the JEDEC timing table level is really just one part of the feature set here. We have users right now doing DDR4 bring-up testing, functional testing, debugging, you know, real problems, at power on. And basically, you can do all this at much lower cost than using a logic analyzer by using the Lecroy DDR bus analyzer.
Okay, let’s start by walking through the configuration sequence for an LRDIMM, which starts at power on, you’re going to apply voltage then stay in the reset state for 200 microseconds. Just like DDR3, the host will read the SPD to get the device’s operating parameters. The command and address lines between the host and the register are trained first. This is usually initiated with a register command. This just helps make sure that the input side of the register is clocking in a clean signal. This is how all your commands are sent to the DIMM, so it’s pretty critical. Next step is to set up the register itself. This is where you see more register commands, or RCWs. We’ll look at these in detail in a second. The first MRS command that are going to be sent to the DRAM are going to be the typical MRS commands that are needed to setup your CL, your burst length, you still need to do this at the DRAM level, right? You need to send the MRSs, and this is done per rank. Than your normal ZQ calibration commands are sent. These two stages look a lot like what occurs today with UDIMM. But then you need to train the DQ, DQ lines on the data buffers. This is, again, initiated by the host with the register command. Really, it’s a MPR override read that allows the host to pick one of several training patterns that then gets pipelined from the buffer back to the host over the data lines. So, this is really for the host to optimize its receivers and buffer chips like the IDT 0124 have the ability to be tuned independently from the DRAM. There’s a large set of special buffer control words that Doug mentioned that the host is going to use to customize how the data buffer operates. Mostly to get better signal integrity, then it’s going to move to the full DQ, DQS training on the DRAM side of the buffer. This is your right-leveling, your retraining, then it exits to normal operation. So, definitely the initialization sequence is more elaborate for sure, then what we do with UDIMM. It’s all done serially for each DIMM, so this could take several minutes with a fully populated system. And again, it’s all directed by the memory controller so, this needs a lot of testing.
Okay, we’re looking here at the different DIMM commands, different commands on the host controller side, that he’s going to use to set up and operate an LRDIMM. It needs a large command set primarily to configure the register and the buffer. The first two are standard DDR commands, right? Common to UDIMM and RDIMM. For basic operation and memory, you got your normal read writes. These are, again, targeted at the DRAM. You still need to write the mode registers, write with the MRS commands, the MRS zero through six. Again, we now have this MPR or multipurpose register read and writes that can be sent to the DRAM or to the buffer. Memory guys kind of refer to this as a scratch pad inside the DRAM where you can read and write custom patterns for doing things. Like retraining and error recovery. The last two sequences are special commands, you got, again, these are specific to RDIMM and LRDIMM. But you got your register control words that configure the register, and you got your buffer control words that configure the data buffers. We’ll look at these in a second.
Okay, if you’re going to be involved with testing and qualifying LRDIMMs, it’s really helpful to understand the structure and flow of these commands. This is some of the stuff that Doug talked about but I think it bears repeating. Again, standard commands, your write, you refresh your MRS, they’re going to flow through the register to the DRAM. But during initialization, when the host is programming the LRDIMMs, it will send this mode register seven. This MR seven is not considered a normal mode register. With LRDIMM it becomes a register or data buffer command. It’s always sent to the chips like zero and if the address bit twelve is zero then it is routed to the RCD. If the address bit twelve is one, it’s routed to the data buffer over the BCOM bus. I’ll talk about that in a second but the key point here is that the analyzer is here, it’s monitoring the command and the control lines, so we can see your standard commands, we can see all your buffer commands with the analyzer.
Let’s stay with the register commands for a second. Again, they flow from the controller to the RCD. These are usually the first commands on the bus at power on. They’re used to configure the register. We can set things like the input bus determination, parody checking. They flow through the rank zero DRAMs, behind the register, but because it looks like an MR seven, it’s ignored by the DRAM, okay? MR sevens supposed to be ignored, but just by the DRAM. The remaining address bits, you know, your zero through twelve, they identify the control word and basically the payload of the RCW.
What does it look on the bus? This is an RCW, it requires three clock cycles to transmit a typical RCW command. Your chip select zero is low, okay? Your bank group one is low. So this is going to look just like an MRS. Your command pins are going to look like an MRS, where you got RAS-CAS write enable all low, activate is high, your address pin is zero. Really zero through twelve contains the address and the payload of the register commands. So, just like a normal MRS, there’s payload that’s traveling on the address bus. Okay?
Alright so there are 27 register control words, really more than I can list. This table just gives you a few examples, I’m actually only showing the register address bits. So these bits here are actually the part that contains the payload, right? The rest of the address just identifies which command. So of the 27 commands, about half use a four-bit payload and the other half use an eight-bit word size. These settings are what gets written into the register. One of the main roles of register control words is to tune the output side of the register chip. So these outputs route signals, you know, your command, your address, directly to the DRAM. Alright, there are hooks to set the speed to enable parody checking, turning off chip selects that aren’t getting used. There’s lots of controls, okay? So, for example, the RCO5 gives us a four-bit configuration for setting the clock driver strength. This allows the host to boost the clock signal to a higher level, maybe to compensate for a too-high stack or a different design or a different rock hard. So, register commands will play an important part in getting the system kit configured and really running at the higher DDR4 speeds.
Alright, another quick example, the RCO9, there are several power-saving controls within this register control word. It’s a four-bit RCW, these are the settings. They’re configured by the host, alright? So, things like the input bus termination can be enabled or disabled. This is similar to ODT, only for the register. So the IBT resistors are actually integrated on the RCD. And they help reduce stub reflections. There’s something called CKE power down mode, which lets the register put the DRAM channel in power down when all the chip selects are idle. Okay, so a lot of controls for the RCW, similar controls for the buffer chips.
But getting visibility to these register commands is one of the first steps in bringing up an Arden system. So, say you’re trying to debug a problem, getting a DIMM into a lower power state. Okay, using the bus analyzer, it’s possible to trigger on any standard command, okay? So in this window, we’re actually seeing the RCO9 as its defined in the spec. So, to the verify that you are enabling the right bits, there is a built-in option to trigger on the RC, including the entire 12 bit address. For the RCO9, all we need to do is set the address bit four and seven to one. So those bits go high, it’s a simple bit mask, now you’re set to trigger on the RCO9 command.
Okay, so the analyzer triggers, first occurrence. The trigger marker is right here. It’s sort of this red line. Again, it’s usually a clock or two past the actual event. So floating the mouse over the command itself will basically give you the decoded bits. This is exactly what’s getting set, all right, with the power down and IBT on. That’s getting enabled, again the idea behind this specific command is that this type of termination consume a little more power but does allow faster power down exits. So it’s something that a lot of vendors are going to use. But basically, there’s any number of RC commands that you might be interested in capturing. So the analyzer gives you an easy way to do that.
Alright so the data buffers have their own set of commands just like the register commands, only they’re used to customize the operation of the data buffer. They flow from the memory controller, through the register to the data buffer itself. Like the register commands, these look like MR7s, where the address lines are really the payload. And as Doug mentioned, the physical link between the RCD and the data buffer is called the BCOM bus. It’s a nine-pin control bus that connects the register to each data buffer. And this is how, the buffer commands get to the data buffer. It’s only nine pins though, so the register can’t really pass it through as an MRS7. The register actually has to convert or mux the MRS7 into this BCOM format.
So, for buffer command, the RCD maps the 12 address bits into five separate data transfers that are basically four bits wide. They’re sent over this BCOM bus, back to back, it looks like this on the bus where again each buffer chip gets a BCOM command similar to this. It takes five clock cycles plus a parody cycle to basically complete this command. Once it gets there, the data buffer decodes the command, writes it to its function space and to complete the command. So it’s a little unusual. Like most things from JEDEC, it’s a bit convoluted, but it works. The main function for this BCOM interface is to transmit your buffer control words. But the spec also requires that the register send all read/write commands and MRS information over the BCOM bus, sort of to allow the data buffers to keep tabs on what the DRAM is doing.
So again just to review that one more time cause it is confusing. The buffer command data flow, the host is going to want to change a parameter, typically at boot up. It sends the MRS7 to the register, the register sees that it is a MRS7 with the address bit set to one. So it’s a buffer command, it’s not meant for me, it converts it into a BCW and sends it out the BCOM bus to the data buffer. The DB writes it to the function space to complete the command. Okay, so normal commands, your reads, your writes, your MRS, they are going to flow directly between the host and the DRAM. So, that’s one point that shouldn’t be lost on anyone, this is again how these LRDIMMs are going to decrease your loading and give you the higher capacity that they’re promising. So again, pretty sophisticated, no doubt about it, these are highly configurable. Again, they allow you to customize every aspect of the termination, the signal strength on the DQ lines in both directions. There are commands to optimize power consumption on the data buffer. There are read/write delays that can be added per data line. So, no doubt a lot of controls, the RCD carries some of this load, but it’s mostly on the host controller to get this working. So having an analyzer to get visibility to the host commands is pretty key. Both to just get the system configured as well as reaching the higher speeds that DDR4 is promising.
Okay, so that is the end of the material that I formally wanted to present. And so, by all means if there are any questions, please type them into the chat box. We’ll certainly do our best to answer any questions. Even if they come in later, we’ll respond to you by email. Okay, well here’s one question, someone is asking if it’s possible to configure these data buffer chips in parallel or do they have to be configured individually? And that is a good question, and I believe the answer is that, because of the sensitivity to each system, that you’re going to have to configure each buffer chip individually, in a serial fashion, every time you boot up. I don’t think there is any way that you’re going to be able to come up with a series of settings that will work in all environments or in even a given environment from one boot to the next. So this configuration process is one you won’t be able to get around, as far the current spec goes. Okay, well it looks like we haven’t got any other questions at this instant. I will wrap up the webinar now. I certainly appreciate you joining us today. Apologies for the audio glitch at the beginning. And look forward to joining us next time for our next topic, which will be coming out next quarter. We’ll certainly be notifying you about that. Thanks again.