02-07-2021, 11:26 PM
Hello everyone.
I have a Mellanox 10Gbit connect-x 3en CX311A (with optical fiber sfp+ ubi modules) and I thought to try it on the ROCKPRO64 board.
First off all, all official armbian images 5.x kernel stuck at starting kernel at boot. No errors or anything.
All 4.x kernel images boot fine and the card gets recognized but no drivers available.(lspci ok - no driver)
Now, long story short, I've tried to compile an armbian image - kernel 4.4.213 with mellanox driver enabled as module and I can see the card (lspci ok - driver ok) but it's impossible to get IP address.
Now after a lot of tests I've got the card to get IP address with this image
buster-minimal-rockpro64-0.10.12-1184-arm64.img.xz from ayufan
with this kernel 5.11.0-rc4-1147-ayufan-gbf2a8ef692d2
Code:
oot@rockpro64:/home/rock64# lspci -vvv
00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd RK3399 PCI Express Root Port (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 79
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00000000-00000fff [size=4K]
Memory behind bridge: fa000000-fabfffff [size=12M]
Prefetchable memory behind bridge: 00000000-000fffff [size=1M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [80] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Address: 00000000fee30040 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=00000000
PBA: BAR=0 offset=00000008
Capabilities: [c0] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
LnkSta: Speed 5GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power+ Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL+ CmdCplt- PresDet- Interlock-
Changed: MRL- PresDet- LinkState-
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp-, 10BitTagReq-, OBFF Via message, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp+, ExtTPHComp-, ARIFwd+
AtomicOpsCap: Routing+ 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd+
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn- NFERptEn- FERptEn-
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [274 v1] Transaction Processing Hints
Interrupt vector mode supported
Device specific mode supported
Steering table in TPH capability structure
Kernel driver in use: pcieport
lspci: Unable to load libkmod resources: error -12
01:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 78
Region 0: Memory at fa800000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at fa000000 (64-bit, prefetchable) [size=8M]
Expansion ROM at fa900000 [virtual] [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: CX311A - ConnectX-3 SFP+
Read-only fields:
[PN] Part number: MCX311A-XCAT
[EC] Engineering changes: A9
[SN] Serial number: MT1502K00886
[V0] Vendor specific: PCIe Gen3 x4
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific: N/A
[YA] Asset tag: N/A
[RW] Read-write area: 109 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 253 byte(s) free
[RW] Read-write area: 252 byte(s) free
End
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number e4-1d-2d-03-00-6d-89-a0
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [18c v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Kernel driver in use: mlx4_core
Code:
rock64@rockpro64:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether mac brd ff:ff:ff:ff:ff:ff
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether mac brd ff:ff:ff:ff:ff:ff
inet 192.168.2.102/24 brd 192.168.2.255 scope global dynamic enp1s0
valid_lft 259139sec preferred_lft 259139sec
inet6 fe80::e61d:2dff:fe6d:89a0/64 scope link
valid_lft forever preferred_lft forever
Now the problem is that the system is extremely unstable.
If I try to download something the system crashes after 2-3 seconds. Sometimes it gets completely freezed but most of the times the card shuts down after this messages:
Code:
[ 57.444678] ------------[ cut here ]------------
[ 36.631026] mlx4_core 0000:01:00.0: mlx4_cmd_post:cmd_pending failed
[ 36.631603] mlx4_core 0000:01:00.0: device is going to be reset
[ 37.671544] mlx4_core 0000:01:00.0: device was reset successfully
[ 37.672117] mlx4_en 0000:01:00.0: Internal error detected, restarting device
[ 37.672823] mlx4_core 0000:01:00.0: command 0x49 failed: fw status = 0x1
[ 43.191166] mlx4_core 0000:01:00.0: mlx4_restart_one_up: ERROR: mlx4_load_one failed, pci_name=0000:01:00.0, err=-22
[ 57.445177] NETDEV WATCHDOG: enp1s0 (mlx4_core): transmit queue 0 timed out
[ 57.445969] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x328/0x330
[ 57.446713] Modules linked in: snd_soc_hdmi_codec dw_hdmi_i2s_audio dw_hdmi_cec hci_uart rockchipdrm btqca dw_mipi_dsi btbcm dw_hdmi btintel analogix_dp panfrost cec bluetooth gpu_sched drm_kms_helper snd_soc_simple_card rockchip_rga snd_soc_audio_graph_card pwm_fan drm syscopyarea snd_soc_simple_card_utils ecdh_generic sysfillrect ecc sysimgblt drm_panel_orientation_quirks dw_wdt fb_sys_fops snd_soc_rockchip_i2s rfkill videobuf2_dma_sg snd_soc_rockchip_pcm snd_soc_es8316 rockchip_thermal rockchip_saradc nfsd btrfs blake2b_generic zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1 raid0 multipath linear mlx4_en realtek gpio_keys mlx4_core dwmac_rk stmmac_platform stmmac phylink
[ 57.452861] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-1137-ayufan-ge57f05e7bf8f #ayufan
[ 57.453631] Hardware name: Pine64 RockPro64 v2.1 (DT)
[ 57.454095] pstate: 20000005 (nzCv daif -PAN -UAO)
[ 57.454540] pc : dev_watchdog+0x328/0x330
[ 57.454914] lr : dev_watchdog+0x328/0x330
[ 57.455283] sp : ffff80001000bdb0
[ 57.455591] x29: ffff80001000bdb0 x28: 0000000000000004
[ 57.456080] x27: 0000000000000140 x26: 00000000ffffffff
[ 57.456568] x25: 0000000000000001 x24: ffff0000e5860000
[ 57.457056] x23: 0000000000000000 x22: 0000000000000001
[ 57.457545] x21: ffff800011547000 x20: ffff0000e5860480
[ 57.458033] x19: 0000000000000000 x18: ffffffffffffffff
[ 57.458520] x17: 0000000000000000 x16: 0000000000000000
[ 57.459009] x15: ffff800011549888 x14: ffff80009000bad7
[ 57.459498] x13: ffff80001000bae5 x12: ffff800011562000
[ 57.459986] x11: 0000000005f5e0ff x10: ffff8000115498c0
[ 57.460474] x9 : 00000000ffffffd0 x8 : ffff8000107c0b20
[ 57.460962] x7 : 000000000000026b x6 : 0000000000000002
[ 57.461448] x5 : 0000000000000000 x4 : 0000000000000000
[ 57.461935] x3 : 0000000000000006 x2 : 0000000000000001
[ 57.462423] x1 : 6fc68b1e2d377c00 x0 : 0000000000000000
[ 57.462911] Call trace:
[ 57.463151] dev_watchdog+0x328/0x330
[ 57.463501] call_timer_fn.isra.34+0x20/0x78
[ 57.463899] run_timer_softirq+0x468/0x4e8
[ 57.464282] efi_header_end+0x114/0x234
[ 57.464641] irq_exit+0xd0/0xd8
[ 57.464939] __handle_domain_irq+0x60/0xb0
[ 57.465324] gic_handle_irq+0x5c/0x148
[ 57.465675] el1_irq+0xb8/0x140
[ 57.465974] arch_cpu_idle+0x10/0x18
[ 57.466308] do_idle+0x1d8/0x2b0
[ 57.466612] cpu_startup_entry+0x20/0x60
[ 57.466981] secondary_start_kernel+0x19c/0x1f0
[ 57.467397] ---[ end trace dd9ceca56ad3e078 ]---
[ 57.467851] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[ 73.444697] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[ 89.444697] mlx4_en: enp1s0: TX timeout on queue: 0, QP: 0x208, CQ: 0x84, Cons: 0xffffffff, Prod: 0x1
[ 90.084653] mlx4_core 0000:01:00.0: command 0x49 timed out (go bit not cleared)
[ 90.085345] mlx4_core 0000:01:00.0: device is going to be reset
[ 90.085890] mlx4_core 0000:01:00.0: crdump: devlink snapshot disabled, skipping
[ 91.133684] mlx4_core 0000:01:00.0: device was reset successfully
[ 91.134269] mlx4_en 0000:01:00.0: Internal error detected, restarting device
[ 91.135035] mlx4_en: enp1s0: Failed disabling multicast filter
[ 91.135588] mlx4_en: enp1s0: Failed enabling multicast filter
[ 91.136132] mlx4_en: enp1s0: Fail to attach multicast address
[ 91.164623] mlx4_core 0000:01:00.0: Fail to set mac in port 1 during unregister
[ 91.186059] mlx4_en: enp1s0: Failed activating Rx CQ
[ 91.196219] mlx4_en: enp1s0: Failed restarting port 1
[ 94.004577] mlx4_core 0000:01:00.0: Internal error mark was detected on device
[ 94.096323] mlx4_en 0000:01:00.0: removed PHC
[ OK ] Stopped Serial Getty on ttyS2.
[ OK ] Started Serial Getty on ttyS2.
[ 105.204611] mlx4_core 0000:01:00.0: Failed to obtain HW semaphore, aborting
[ 105.205288] mlx4_core 0000:01:00.0: Failed to reset HCA, aborting
[ 106.244652] mlx4_core 0000:01:00.0: mlx4_restart_one_up: ERROR: mlx4_load_one failed, pci_name=0000:01:00.0, err=-11
[ 106.245627] mlx4_core 0000:01:00.0: mlx4_restart_one was ended, ret=-11
Also, I tried to install the official drivers from mellanox, I've got an error about dkms config missing.
I really don't know what else to try.
Any help is really appreciated.
ps. yes the card is working fine on any other linux x86/x64 system.