Mellanox 10Gbit Connect-x 3en Crashing
#1
[Image: d2333Mq.jpeg]
Hello everyone.

I have a Mellanox 10Gbit connect-x 3en CX311A (with optical fiber sfp+ ubi modules) and I thought to try it on the ROCKPRO64 board.

First off all, all official armbian images 5.x kernel stuck at starting kernel at boot. No errors or anything.

All 4.x kernel images boot fine and the card gets recognized but no drivers available.(lspci ok - no driver)

Now, long story short, I've tried to compile an armbian image - kernel 4.4.213 with mellanox driver enabled as module and I can see the card (lspci ok - driver ok) but it's impossible to get IP address.

Now after a lot of tests I've got the card to get IP address with this image
buster-minimal-rockpro64-0.10.12-1184-arm64.img.xz from ayufan
with this kernel 5.11.0-rc4-1147-ayufan-gbf2a8ef692d2


Code:
oot@rockpro64:/home/rock64# lspci -vvv
00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd RK3399 PCI Express Root Port (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 79
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00000000-00000fff [size=4K]
        Memory behind bridge: fa000000-fabfffff [size=12M]
        Prefetchable memory behind bridge: 00000000-000fffff [size=1M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable+ 64bit+
                Address: 00000000fee30040  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
                Vector table: BAR=0 offset=00000000
                PBA: BAR=0 offset=00000008
        Capabilities: [c0] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
                LnkSta: Speed 5GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Off, PwrInd Off, Power+ Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL+ CmdCplt- PresDet- Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCap: CRSVisible-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
                         10BitTagComp-, 10BitTagReq-, OBFF Via message, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, LN System CLS Not Supported, TPHComp+, ExtTPHComp-, ARIFwd+
                         AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd+
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn- NFERptEn- FERptEn-
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
        Capabilities: [274 v1] Transaction Processing Hints
                Interrupt vector mode supported
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: pcieport
lspci: Unable to load libkmod resources: error -12

01:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
        Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 78
        Region 0: Memory at fa800000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at fa000000 (64-bit, prefetchable) [size=8M]
        Expansion ROM at fa900000 [virtual] [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
                Product Name: CX311A - ConnectX-3 SFP+
                Read-only fields:
                        [PN] Part number: MCX311A-XCAT
                        [EC] Engineering changes: A9
                        [SN] Serial number: MT1502K00886
                        [V0] Vendor specific: PCIe Gen3 x4
                        [RV] Reserved: checksum good, 0 byte(s) reserved
                Read/write fields:
                        [V1] Vendor specific: N/A
                        [YA] Asset tag: N/A
                        [RW] Read-write area: 109 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 253 byte(s) free
                        [RW] Read-write area: 252 byte(s) free
                End
        Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
                Vector table: BAR=0 offset=0007c000
                PBA: BAR=0 offset=0007d000
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #8, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [148 v1] Device Serial Number e4-1d-2d-03-00-6d-89-a0
        Capabilities: [154 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [18c v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: mlx4_core

Code:
rock64@rockpro64:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether mac brd ff:ff:ff:ff:ff:ff
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether mac brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.102/24 brd 192.168.2.255 scope global dynamic enp1s0
       valid_lft 259139sec preferred_lft 259139sec
    inet6 fe80::e61d:2dff:fe6d:89a0/64 scope link
       valid_lft forever preferred_lft forever

Now the problem is that the system is extremely unstable.

If I try to download something the system crashes after 2-3 seconds. Sometimes it gets completely freezed but most of the times the card shuts down after this messages:

Code:
[   57.444678] ------------[ cut here ]------------

[  36.631026] mlx4_core 0000:01:00.0: mlx4_cmd_post:cmd_pending failed
[   36.631603] mlx4_core 0000:01:00.0: device is going to be reset
[   37.671544] mlx4_core 0000:01:00.0: device was reset successfully
[   37.672117] mlx4_en 0000:01:00.0: Internal error detected, restarting device
[   37.672823] mlx4_core 0000:01:00.0: command 0x49 failed: fw status = 0x1
[   43.191166] mlx4_core 0000:01:00.0: mlx4_restart_one_up: ERROR: mlx4_load_one failed, pci_name=0000:01:00.0, err=-22
[   57.445177] NETDEV WATCHDOG: enp1s0 (mlx4_core): transmit queue 0 timed out
[   57.445969] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x328/0x330
[   57.446713] Modules linked in: snd_soc_hdmi_codec dw_hdmi_i2s_audio dw_hdmi_cec hci_uart rockchipdrm btqca dw_mipi_dsi btbcm dw_hdmi btintel analogix_dp panfrost cec bluetooth gpu_sched drm_kms_helper snd_soc_simple_card rockchip_rga snd_soc_audio_graph_card pwm_fan drm syscopyarea snd_soc_simple_card_utils ecdh_generic sysfillrect ecc sysimgblt drm_panel_orientation_quirks dw_wdt fb_sys_fops snd_soc_rockchip_i2s rfkill videobuf2_dma_sg snd_soc_rockchip_pcm snd_soc_es8316 rockchip_thermal rockchip_saradc nfsd btrfs blake2b_generic zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1 raid0 multipath linear mlx4_en realtek gpio_keys mlx4_core dwmac_rk stmmac_platform stmmac phylink
[   57.452861] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-1137-ayufan-ge57f05e7bf8f #ayufan
[   57.453631] Hardware name: Pine64 RockPro64 v2.1 (DT)
[   57.454095] pstate: 20000005 (nzCv daif -PAN -UAO)
[   57.454540] pc : dev_watchdog+0x328/0x330
[   57.454914] lr : dev_watchdog+0x328/0x330
[   57.455283] sp : ffff80001000bdb0
[   57.455591] x29: ffff80001000bdb0 x28: 0000000000000004
[   57.456080] x27: 0000000000000140 x26: 00000000ffffffff
[   57.456568] x25: 0000000000000001 x24: ffff0000e5860000
[   57.457056] x23: 0000000000000000 x22: 0000000000000001
[   57.457545] x21: ffff800011547000 x20: ffff0000e5860480
[   57.458033] x19: 0000000000000000 x18: ffffffffffffffff
[   57.458520] x17: 0000000000000000 x16: 0000000000000000
[   57.459009] x15: ffff800011549888 x14: ffff80009000bad7
[   57.459498] x13: ffff80001000bae5 x12: ffff800011562000
[   57.459986] x11: 0000000005f5e0ff x10: ffff8000115498c0
[   57.460474] x9 : 00000000ffffffd0 x8 : ffff8000107c0b20
[   57.460962] x7 : 000000000000026b x6 : 0000000000000002
[   57.461448] x5 : 0000000000000000 x4 : 0000000000000000
[   57.461935] x3 : 0000000000000006 x2 : 0000000000000001
[   57.462423] x1 : 6fc68b1e2d377c00 x0 : 0000000000000000
[   57.462911] Call trace:
[   57.463151]  dev_watchdog+0x328/0x330
[   57.463501]  call_timer_fn.isra.34+0x20/0x78
[   57.463899]  run_timer_softirq+0x468/0x4e8
[   57.464282]  efi_header_end+0x114/0x234
[   57.464641]  irq_exit+0xd0/0xd8
[   57.464939]  __handle_domain_irq+0x60/0xb0
[   57.465324]  gic_handle_irq+0x5c/0x148
[   57.465675]  el1_irq+0xb8/0x140
[   57.465974]  arch_cpu_idle+0x10/0x18
[   57.466308]  do_idle+0x1d8/0x2b0
[   57.466612]  cpu_startup_entry+0x20/0x60
[   57.466981]  secondary_start_kernel+0x19c/0x1f0
[   57.467397] ---[ end trace dd9ceca56ad3e078 ]---
[   57.467851] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[   73.444697] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[   89.444697] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[   90.084653] mlx4_core 0000:01:00.0: command 0x49 timed out (go bit not cleared)
[   90.085345] mlx4_core 0000:01:00.0: device is going to be reset
[   90.085890] mlx4_core 0000:01:00.0: crdump: devlink snapshot disabled, skipping
[   91.133684] mlx4_core 0000:01:00.0: device was reset successfully
[   91.134269] mlx4_en 0000:01:00.0: Internal error detected, restarting device
[   91.135035] mlx4_en: enp1s0: Failed disabling multicast filter
[   91.135588] mlx4_en: enp1s0: Failed enabling multicast filter
[   91.136132] mlx4_en: enp1s0: Fail to attach multicast address
[   91.164623] mlx4_core 0000:01:00.0: Fail to set mac in port 1 during unregister
[   91.186059] mlx4_en: enp1s0: Failed activating Rx CQ
[   91.196219] mlx4_en: enp1s0: Failed restarting port 1
[   94.004577] mlx4_core 0000:01:00.0: Internal error mark was detected on device
[   94.096323] mlx4_en 0000:01:00.0: removed PHC
[  OK  ] Stopped Serial Getty on ttyS2.
[  OK  ] Started Serial Getty on ttyS2.
[  105.204611] mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
[  105.205288] mlx4_core 0000:01:00.0: Failed to reset HCA, aborting
[  106.244652] mlx4_core 0000:01:00.0: mlx4_restart_one_up: ERROR: mlx4_load_one failed, pci_name=0000:01:00.0, err=-11
[  106.245627] mlx4_core 0000:01:00.0: mlx4_restart_one was ended, ret=-11

Also, I tried to install the official drivers from mellanox, I've got an error about dkms config missing.
I really don't know what else to try.

Any help is really appreciated. 

ps. yes the card is working fine on any other linux x86/x64 system.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  SATA keeps crashing JPT223 1 533 09-21-2023, 10:52 PM
Last Post: tllim
  How do you connect PWM fan (3/4 pins) to this board (2 pin fan connector)? aleksei 3 8,005 11-19-2020, 04:35 PM
Last Post: Gienek
Sad connect power in reverse :( nhzgroup 6 6,491 05-24-2019, 02:04 AM
Last Post: pfeerick

Forum Jump:


Users browsing this thread: 1 Guest(s)