mcf52259 FEC RXF ISR stops sometimes

We have a project where after some unknown event happens, the FEC stops triggering the RXF interrupt. It could be minutes, could be days before the event happens. Wiresharking hasn't shown anything suspicious. Tx is still working, in fact, our clue that there is a problem is that the device is refreshing its ARP table on its 20 minute boundary but doesn't listen to the response.

Things I've checked when this occurs:

EBERR is 0

RXF is 0

RXB is usually 1, but that mask is not set

ETHR_EN is 1

RDAR is 1

Also, not every device seems to do this. Out of 100 devices probably 5-10 exhibit this behavior.

> A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses.

There are 16 FEC-related pins on the CPU. See if you can see a difference (in level, frequency, slope) between "working" and "not working".

Do you have any code monitoring the LINK bit in the PHY over the MII? Just in case some of its internal bits flip when it goes wrong it might show something. You must have something monitoring the PHY (to get 10/100 MHz auto-negotiation information for matching the FEC to the PHY), but I noticed the MII interrupt wasn't enabled. Is it polling with a timer?

We have some very simple code monitoring the link. In case this is useful for someone else looking to do this We periodically send this:

// start a fresh read

MCF_FEC_MMFR = MCF_FEC_MMFR_ST_01 | MCF_FEC_MMFR_OP_READ |

MCF_FEC_MMFR_PA(phy_addr) | MCF_FEC_MMFR_RA(1)|MCF_FEC_MMFR_TA_10;

And the MII ISR does this:

void fec_mii_isr( void )

{

MCF_FEC_EIR = MCF_FEC_EIR_MII;

fec_link_stat = ( MCF_FEC_MMFR & 0x04 ) != 0;

}

> > Have you found a way to recover from this?

> A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

So a reset of the FEC doesn't fix it. So what does a chip reset do that resetting the FEC doesn't? It reprograms the GPIO pins.

You should check to see if any of the GPIO registers that control assignment of the 16 FEC pins have flipped. corrupted or gotten stomped by a rogue pointer somehow.

"Table 2-1. Pin Functions by Primary and Alternate Purpose" shows the FEC pins are PTI and PTJ (selected by PTIPAR and PTJPAR) as well as being affected by PSRRH and PDSRH.

You have a powerful tool there. In the screen snapshot you provided the debugger helpfully highlighted the I/O registers that had changed since the last time it sampled them. That removes the drudgery of manually comparing or decoding them all. That didn't show anything interesting with the FEC registers, so I suggest you see if any *OTHER* I/O registers changed.

Does the register window only show changes in the ones showing in the window, in the ones where that part of the tree is "open" or does it sample and list ALL of them? It might show a change that is causing this.

You could also get the "foo" code to call the function that reprograms all the GPIO pins, and see if that fixes it.

You should be reading the FEC_IER register, writing the exact read value back into FEC_IER, and THEN checking the event bits in that register to see what you should be doing (like scanning through the Receive BDs). That's the safest and most efficient way to handle the events without losing any.

Are you sure you're not taking the interrupt, scanning the (one or more ready) received buffer descriptors and THEN clearing the interrupt? That's guaranteed to lose them. You can also get multiple receives and only one interrupt, so you have to read "all that are there" and not just one.

Thanks for the reply. Our receive interrupt is "stand alone". Each interrupt has its own vector, so we don't need to figure out why we are in the interrupt. So (as can be seen in the code snippet below) we immediately clear the EIR bit for this interrupt.

__declspec(interrupt) static void fecISR_RXF ( void )

{

devCtrl->eir = 0x02000000; // clear the bit

// stuff a message to send to the handler task

...

}

Attached is a screen shot showing the FEC registers and the buffer descriptors. From what I can understand they buffers are pretty much all free. No dropped frames etc.

As a side note, we are sending frames just fine. The device is sending an ARP request for the gateway repeatedly, which is responding each time. But we don't see the answer.

Attachments

Under identical test conditions in your shop or at 100 different customer locations (receiving different data patterns)?

The usual then. Check all the voltages. Check for noise. Add lots more bypass caps to a failing unit. Check for crystal/oscillator stability. Check for undershoots/overshoots on the external pins (between the CPU and the PHY). Run hot and cold. Run at low and high supply voltages. Change the clock speed (if you can). Play with the Arbiter settings.

Hit it with "sudo ping -f -l 20 -s 100" from a Linux box with small and large packet sizes and see if you can get it to fail more often.

> From what I can understand they buffers are pretty much all free.

I'll analyse that next, but the only thing that should stop the receive interrupts, is either there's no data arriving (something went wrong with the PHY or the pin programming) or the receiver is "stuck" on a full descriptor and is waiting for you to read it. The latter can happen if the ring is full, or if it is just out of sync.

First I want to check the EMRBR and RCR[MAX_FL] against the programmed buffer sizes.

That's all fine. You're using 8 Receive Descriptors, all are empty and the last one has the "Wrap" bit set.

What about RDAR? It is 0x01000000, so the FEC thinks it has free descriptors.

I still think your PHY has locked up or you've got a hardware problem external to the chip. Are you monitoring for Link Status through the MII? Do you have a Link LED connected to the PHY?

You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5". That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

Have you found a way to recover from this? Does disabling and enabling the FEC fix it (ECR[ETHER_EN])? How about resetting and reprogramming (ECR[RESET])? Do you have a reset control to the PHY? I see you have a "foo" variable in the code that looks like you're using it to force reset and reprogramming. What have you found using that?

Can you try and test it with Internal Loopback? Can you switch it to Loopback when it has failed?

A: I only have one in the shop that will do this "consistently" sometimes every 20 minutes, sometimes not for a week. Most are out in the wild at customer locations

>Check all the voltages. Check for noise. Add lots more bypass caps to a failing unit. Check for crystal/oscillator stability.

A: I did check the supply, 3.3v with a dmm. but you make me think I should scope it, could be sag/spike related.

>Check for undershoots/overshoots on the external pins (between the CPU and the PHY).

A: I did put a scope on the fec_RXDV pin on the MII interface, to see if the PHY was stopped. The pulses look good on that one, and they never stopped.

>Run hot and cold.

A: did this, couldn't find a correlation.

>Run at low and high supply voltages.

A: this would be a challenge, I'll look into it.

>Change the clock speed (if you can).

A: I'll look to see what our PLL is set to, the xtal is 25Mhz, the cpu runs at 50Mhz.

>Play with the Arbiter settings.

A: I'll look into this, not sure what/where it is.

>but the only thing that should stop the receive interrupts, is either there's no data arriving (something went wrong with the PHY or the pin programming) or the receiver is "stuck" on a full descriptor and is waiting for you to read it. The latter can happen if the ring is full, or if it is just out of sync.

A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses. And as you verified, the BD are all marked as empty.

>I still think your PHY has locked up or you've got a hardware problem external to the chip.

A: The PHY isn't completely locked up, since the system is sending ARP requests, and we are getting fec_RXDV pulses for new frames.

>Are you monitoring for Link Status through the MII?

A: no

>Do you have a Link LED connected to the PHY?

A: yes link is on, and the activity LED blinks with every TX frame the device sends. It is not designed to blink with RX packets.

>You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5".

>That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

A: The cpu runs at 50Mhz. I'll look at the MSCR setting to see what is more appropriate.

>Have you found a way to recover from this?

A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

>Does disabling and enabling the FEC fix it (ECR[ETHER_EN])? How about resetting and reprogramming (ECR[RESET])?

>I see you have a "foo" variable in the code that looks like you're using it to force reset and reprogramming. What have you found using that?

A: nope, and nope. yeah, the foo variable is to force the re-init of the fec and phy, but to no avail.

>Do you have a reset control to the PHY?

A: yes, thanks for making me look. I'll try playing with this in addition to the above re-initing the fec and phy.

>Can you try and test it with Internal Loopback? Can you switch it to Loopback when it has failed?

A: I hadn't thought of this, I'll try and see what happens.

I have some homework to do. It'll probably be a day or so to check all this out. However, if you see anything in my current answers that points to something, please let me know.

>You've got MSCR set to 0x0000000A which means MSCR[MII_SPEED] is "5".

>That's the recommended value for a System Clock of 25MHz. Is that what you're running at? If you're running at 66 or 80MHz you should change that divider.

So as I said the CPU runs at 50Mhz, The driver code tried to write the value 0xa into the register, but since MII_SPEED is bit shifted by one, with respect to the MSCR base address, the value of 5 was being written.

5 would make the clock speed be 5Mhz, which is out of spec. I've fixed the driver to write an actual 0xa into MII_SPEED, and am testing now.

> A: I don't think it's the PHY due to the fec_RXDV pin continually sending pulses.

There are 16 FEC-related pins on the CPU. See if you can see a difference (in level, frequency, slope) between "working" and "not working".

Do you have any code monitoring the LINK bit in the PHY over the MII? Just in case some of its internal bits flip when it goes wrong it might show something. You must have something monitoring the PHY (to get 10/100 MHz auto-negotiation information for matching the FEC to the PHY), but I noticed the MII interrupt wasn't enabled. Is it polling with a timer?

We have some very simple code monitoring the link. In case this is useful for someone else looking to do this We periodically send this:

// start a fresh read

MCF_FEC_MMFR = MCF_FEC_MMFR_ST_01 | MCF_FEC_MMFR_OP_READ |

MCF_FEC_MMFR_PA(phy_addr) | MCF_FEC_MMFR_RA(1)|MCF_FEC_MMFR_TA_10;

And the MII ISR does this:

void fec_mii_isr( void )

{

MCF_FEC_EIR = MCF_FEC_EIR_MII;

fec_link_stat = ( MCF_FEC_MMFR & 0x04 ) != 0;

}

> > Have you found a way to recover from this?

> A: No. it requires a chip reset (restarting the debugger works too), which does a hard reset on the PHY.

So a reset of the FEC doesn't fix it. So what does a chip reset do that resetting the FEC doesn't? It reprograms the GPIO pins.

You should check to see if any of the GPIO registers that control assignment of the 16 FEC pins have flipped. corrupted or gotten stomped by a rogue pointer somehow.

"Table 2-1. Pin Functions by Primary and Alternate Purpose" shows the FEC pins are PTI and PTJ (selected by PTIPAR and PTJPAR) as well as being affected by PSRRH and PDSRH.

You have a powerful tool there. In the screen snapshot you provided the debugger helpfully highlighted the I/O registers that had changed since the last time it sampled them. That removes the drudgery of manually comparing or decoding them all. That didn't show anything interesting with the FEC registers, so I suggest you see if any *OTHER* I/O registers changed.

Does the register window only show changes in the ones showing in the window, in the ones where that part of the tree is "open" or does it sample and list ALL of them? It might show a change that is causing this.

You could also get the "foo" code to call the function that reprograms all the GPIO pins, and see if that fixes it.

I'm assuming you found a genuine dry or cracked joint, verified it visually, verified by bending the board (or flexing the joint) and then fixed by resoldering. My suggestion of "heat it, cool it" might have found that. I didn't suggest "bend it" because you said you had lots of other units with the "same fault".

> Argh!I think the fact that the problem would go

> away after a reset was leading me astray.

I can't see a how a Reset fixing a bad solder joint, unless you have a reset press-button on the board and you bent the board while resetting it. Or reset it by physically unplugging it and mechanically stressed the board while doing that. But you said "reset from the debugger" fixed it, so that's not the case.

The bad joint doesn't explain all the other units. They must have different faults to the one with the bad joint. It would be a very strange hardware design or production problem that consistently creates one specific bad joint.

It might still be possible that you have some tracks shorted together. You might have some other pin on the CPU shorted to a FEC signal (maybe even RX_CLK), and when that pin is programmed as an output, there's a "fight" between the PHY pin and the CPU pin and the resulting signal exceeds the receiver threshold on some boards at some times and not on others. Throwing a bad joint into that mix would certainly show a bad signal level. Shorted tracks could be a design/production problem.