(usr-tc) Crashing HiperARC
OK, this is kind of odd. Got a HiperARC that's rebooting itself periodically. The card's been replaced already, flash has been upgraded to 5.1.99-8 (was acting up with 5.0.9 as well), and I've tried setting up a sniffer to see if anyone's throwing some sort of DoS attack at it (nothing showed up). Did get a crashdump, though. We've got a case open with 3Com, so I'm posting here in hopes that one of the more clueful 3Com techs will pick up on it. Support recommended replacing the HARC (did that), reflashing (did that), rebuilding the config (did that too), swapping power supplies (did that). They now say our power supplies aren't big enough, and we need to have 130 amp ones for this configuration. Possible, but I kind of doubt it, since they say that one 130 is sufficient, but we currently have two 70s, and removing either one of them and running the chassis on one makes no noticeable impact on the frequency of the crashes. Nothing interesting gets logged to syslog or RADIUS accounting, nor anything else of interest on the console before the crash dump shows up. Configuration is one ARC, 14 DSPs on PRI, dual 70A power supplies. Routing IP only, PAP/RADIUS authentication, OSPF. NMC and DSPs are at the required code versions for compatibity with 5.1.99 according to 3Com site, 2.1.9 across the board on DSPs, 8.0.9 on HiperNMC. This hub had been stable for months, and just started rebooting a few weeks ago. No configuration or firmware changes had been made to the ARC, although we have added a few more elements to the network it sits on, including one new OSPF speaking router. Looked over the config on that, and don't see anything that could be causing it to spew bad OSPF packets. One other thing I have noticed -- the past few times it's blown up, my RADIUS server logs an authentication request with no username received from the ARC. I don't know yet if this is a cause or a side-effect of the thing crashing, but I'm soon going to do some testing with throwing invalid LCP/PAP packets at it and see if I can provoke it into bombing. EXCEPTION 0200 CRASH DUMP: GPRs: R0: 0x00385584 R1: 0x07F55E00 R2: 0x000F19E0 R3: 0x02B1A2F0 R4: 0x0000000F R5: 0x01938430 R6: 0x001A2001 R7: 0x00000001 R8: 0x3F91CE34 R9: 0x00000000 R10: 0x00000016 R11: 0x00000000 R12: 0x01CB7AB0 R13: 0x000FBA20 R14: 0x00982320 R15: 0x00982304 R16: 0x0098230C R17: 0x009822F8 R18: 0x00982300 R19: 0x00982310 R20: 0xFFFFFEFF R21: 0x00C03B50 R22: 0x00000001 R23: 0x00000000 R24: 0x00000000 R25: 0x00000001 R26: 0x032584B0 R27: 0x07F55F20 R28: 0x01F86370 R29: 0x02B1A2F0 R30: 0x00981A64 R31: 0x04010801 SPRs: CR: 0x40000400 XER: 0x20000000 LR: 0x00385584 CTR: 0x007EE224 SRR0: 0x003B4F10 SRR1: 0x00089030 DSISR: 0x00000000 DAR: 0x00000000 DMISS: 0x00000000 DCMP: 0x00000000 HASH1: 0x00010000 HASH2: 0x0001FFC0 IMISS: 0x00000000 ICMP: 0x00000000 RPA: 0x00000000 IABR: 0x00000000 82660 Registers: Err Status 1: 0x20, Err Status 2: 0x00, CPU Err: 0x72, PCI Err: 0x06 CPU/PCI Addr: 0x0F0E7368, Sys Error Addr: 0x0F0E7368 Call Stack: 0x003B4F10 (Exception return address - SRR0) 0x00385584 0x004BDDC0 0x004C0D40 0x004B58F0 0x004B2688 0x007E290C 0x007E2AB4 0x002008D4 0x0020024C 0x0020009C 0x000A7D80 BOOT PROM Version 1.16 (Built on June 9th, 1998 at 12:24:24) - To unsubscribe to usr-tc, send an email to "majordomo@xmission.com" with "unsubscribe usr-tc" in the body of the message. For information on digests or retrieving files and old messages send "help" to the same address. Do not use quotes in your message.
We have problem like described you on two HiPer chassis (four ARC). And long way with open case on 3Com. Last events on that case is "possible memory corruption" and if problem still present on our newest code (V5.1.94/Non-Encr) we start to build the fix. :(( === cut I was able to have our developers look at the crash dumps and for them,it seems like this is a memory corruption. They are suggesting to have you try our Emergency code of 5.1.94-1 which addressed many of these kinds of issues. However, you must understand that this code may not fix your particular problem. If, for some reason, this code does not fix the memory issue, I will escalate this case. ==== After loading this code on one from our ARCs (4 days ago) we do not have reboots. We wait and look. At the monday I load this code on the other three of our ARCs and look again. (14 days without reboot after loading new code usual situation for this case) Good luck ! ====================== Andrey Zimin | AVZ7-RIPE MTU-Intel ISP Moscow, Russia ====================== ----- Original Message ----- From: "ROC Services" <roc@itol.com> To: <usr-tc@lists.xmission.com> Sent: 25 ц▒ц▌ц≈ц│ц▓ц▒ 2001 ц┤. 18:57 Subject: (usr-tc) Crashing HiperARC - To unsubscribe to usr-tc, send an email to "majordomo@xmission.com" with "unsubscribe usr-tc" in the body of the message. For information on digests or retrieving files and old messages send "help" to the same address. Do not use quotes in your message.
participants (2)
-
Andrey Zimin -
ROC Services