Freezes and kernel panics with Debian trixie
#1
Hi there,

after running quite smoothly for several years, since the upgrade to Debian trixie, my ROCKPro64 has become quite unstable.

I am using a JMicron Technology Corp. JMB58x AHCI SATA controller in the PCIe slot. The first symptom I had was that after the reboot after the upgrade, the status LEDs on the controller did not turn on. I connected a monitor on HDMI and there was no signal.

I have now advanced a couple iterations of debugging and here is what I have.

If the PCIe card is in, I get panics related to PCIe, such as:

Code:
[    4.965205] SError Interrupt on CPU5, code 0x00000000bf000002 -- SError
[    4.965230] CPU: 5 UID: 0 PID: 52 Comm: kworker/u25:3 Tainted: G   M               6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[    4.965249] Tainted: [M]=MACHINE_CHECK
[    4.965253] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    4.965260] Workqueue: events_unbound deferred_probe_work_func
[    4.965285] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.965297] pc : rockchip_pcie_rd_conf+0x194/0x2c0
[    4.965315] lr : rockchip_pcie_rd_conf+0x188/0x2c0
[    4.965326] sp : ffff8000828037a0
[    4.965331] x29: ffff8000828037a0 x28: ffff000001fbf800 x27: 0000000000000001
[    4.965348] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
[    4.965362] x23: ffff800082485000 x22: 0000000000000000 x21: ffff8000828037e4
[    4.965377] x20: 0000000000000000 x19: 0000000000000004 x18: ffffffffffffffff
[    4.965390] x17: 30302f30303a3030 x16: 30306963702f6569 x15: 63702e3030303030
[    4.965404] x14: ffff8000824bb460 x13: 0000000000000326 x12: 0000000000000000
[    4.965418] x11: 0000000000000001 x10: 0000000000000000 x9 : ffff8000808698d0
[    4.965431] x8 : 0000000124f798bc x7 : ffff000005740380 x6 : ffff000005747000
[    4.965445] x5 : ffff000001fbf800 x4 : ffff800087000000 x3 : 0000000000c00008
[    4.965458] x2 : 000000000080000a x1 : ffff800087c00008 x0 : ffff800087c0000c
[    4.965475] Kernel panic - not syncing: Asynchronous SError Interrupt
[    4.965481] CPU: 5 UID: 0 PID: 52 Comm: kworker/u25:3 Tainted: G   M               6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[    4.965496] Tainted: [M]=MACHINE_CHECK
[    4.965500] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    4.965505] Workqueue: events_unbound deferred_probe_work_func
[    4.965517] Call trace:
[    4.965521]  dump_backtrace+0xd8/0x130
[    4.965534]  show_stack+0x20/0x38
[    4.965543]  dump_stack_lvl+0x60/0x80
[    4.965556]  dump_stack+0x18/0x28
[    4.965566]  panic+0x164/0x378
[    4.965582]  nmi_panic+0x90/0x98
[    4.965598]  arm64_serror_panic+0x78/0x90
[    4.965608]  do_serror+0x30/0x80
[    4.965617]  el1h_64_error_handler+0x30/0x48
[    4.965629]  el1h_64_error+0x64/0x68
[    4.965638]  rockchip_pcie_rd_conf+0x194/0x2c0
[    4.965650]  pci_bus_read_config_dword+0x8c/0x140
[    4.965663]  pci_bus_generic_read_dev_vendor_id+0x38/0x178
[    4.965678]  pci_scan_single_device+0xb4/0x120
[    4.965691]  pci_scan_slot+0x60/0x230
[    4.965703]  pci_scan_child_bus_extend+0x4c/0x2e0
[    4.965717]  pci_scan_bridge_extend+0x180/0x5a8
[    4.965731]  pci_scan_child_bus_extend+0x1c4/0x2e0
[    4.965744]  pci_scan_root_bus_bridge+0x6c/0xe8
[    4.965758]  pci_host_probe+0x38/0xe0
[    4.965771]  rockchip_pcie_probe+0x3a0/0x530
[    4.965782]  platform_probe+0x70/0xe8
[    4.965796]  really_probe+0xc8/0x3a0
[    4.965806]  __driver_probe_device+0x84/0x160
[    4.965815]  driver_probe_device+0x44/0x130
[    4.965825]  __device_attach_driver+0xc4/0x170
[    4.965836]  bus_for_each_drv+0x90/0x100
[    4.965845]  __device_attach+0xa8/0x1c8
[    4.965854]  device_initial_probe+0x1c/0x30
[    4.965864]  bus_probe_device+0xb0/0xc0
[    4.965873]  deferred_probe_work_func+0xbc/0x120
[    4.965883]  process_one_work+0x178/0x3e0
[    4.965895]  worker_thread+0x204/0x3f0
[    4.965907]  kthread+0xe8/0xf8
[    4.965916]  ret_from_fork+0x10/0x20
[    4.965929] SMP: stopping secondary CPUs
[    4.965945] Kernel Offset: disabled
[    4.965948] CPU features: 0x08,00002082,c0200000,4200421b
[    4.965957] Memory Limit: none
[    4.994272] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

If I boot without the PCIe card, I got what looked like a freeze on HDMI, but the UART logged this kernel panic:

Code:
[  106.672016] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000037
[  106.676856] Mem abort info:
[  106.681157]   ESR = 0x0000000096000004
[  106.685537]   EC = 0x25: DABT (current EL), IL = 32 bits
[  106.690074]   SET = 0, FnV = 0
[  106.694416]   EA = 0, S1PTW = 0
[  106.698736]   FSC = 0x04: level 0 translation fault
[  106.703208] Data abort info:
[  106.707515]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  106.712069]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  106.716632]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  106.721166] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000df71000
[  106.725827] [0000000000000037] pgd=0000000000000000, p4d=0000000000000000
[  106.730563] Internal error: Oops: 0000000096000004 [#1] SMP
[  106.735036] Modules linked in: nft_limit nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables binfmt_misc snd_soc_hdmi_codec hantro_vpu aes_ce_blk rockchip_vdec(C) hci_uart v4l2_jpeg aes_ce_cipher crct10dif_ce v4l2_vp9 btqca polyval_ce v4l2_h264 polyval_generic rockchip_rga btrtl videobuf2_dma_contig btintel ghash_ce videobuf2_dma_sg gf128mul v4l2_mem2mem btbcm sha2_ce videobuf2_memops sha256_arm64 videobuf2_v4l2 snd_soc_audio_graph_card snd_soc_simple_card sha1_ce panfrost snd_soc_rockchip_i2s bluetooth snd_soc_spdif_tx snd_soc_es8316 snd_soc_simple_card_utils snd_soc_core videodev ofpart gpu_sched dw_hdmi_i2s_audio gpio_ir_recv des_generic dw_hdmi_cec pwm_fan snd_compress ecdh_generic leds_gpio spi_nor snd_pcm_dmaengine rk_crypto rfkill drm_shmem_helper snd_pcm videobuf2_common pwrseq_core crypto_engine snd_timer mtd mc libdes snd rockchip_saradc coresight_cpu_debug industrialio_triggered_buffer soundcore coresight_etm4x kfifo_buf rockchip_thermal industrialio coresight
[  106.735344]  cpufreq_dt evdev dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc efi_pstore configfs nfnetlink ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c crc32c_generic raid1 raid0 realtek md_mod xhci_plat_hcd xhci_hcd dwc3 rockchipdrm fusb302 udc_core rk808_regulator dw_hdmi tcpm cec ulpi dwmac_rk typec rc_core stmmac_platform fan53555 stmmac dw_mipi_dsi analogix_dp pwm_regulator gpio_rockchip drm_display_helper pcs_xpcs dwc3_of_simple phylink ohci_platform gpio_keys sdhci_of_arasan ohci_hcd mdio_devres drm_dma_helper ehci_platform phy_rockchip_inno_usb2 ehci_hcd of_mdio drm_kms_helper phy_rockchip_emmc sdhci_pltfm phy_rockchip_typec fixed_phy phy_rockchip_pcie usbcore nvmem_rockchip_efuse pl330 drm dw_wdt fwnode_mdio pwm_rockchip io_domain rockchip_dfi libphy cqhci dw_mmc_rockchip i2c_rk3x usb_common spi_rockchip dw_mmc_pltfm sdhci dw_mmc fixed
[  106.806464] CPU: 2 UID: 0 PID: 900 Comm: nft Tainted: G         C         6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[  106.811298] Tainted: [C]=CRAP
[  106.815326] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  106.819469] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  106.823730] pc : nf_ct_iterate_cleanup+0xd4/0x240 [nf_conntrack]
[  106.827937] lr : nf_ct_iterate_cleanup+0xc0/0x240 [nf_conntrack]
[  106.832067] sp : ffff8000832f33b0
[  106.835817] x29: ffff8000832f33b0 x28: ffff8000832f3450 x27: 0000000000000000
[  106.839909] x26: ffff80007b451680 x25: ffff00000b594a00 x24: ffff80007b443538
[  106.844018] x23: ffff80007b452688 x22: ffff80007b451c40 x21: 000000000001eb80
[  106.848126] x20: 0000000000003d70 x19: ffff000020700000 x18: ffffffffffffffff
[  106.852210] x17: 000000000f7574be x16: 0000000094c09be4 x15: ffff00000529f895
[  106.856295] x14: ffff8000832f3240 x13: 0000000000000801 x12: ffff0000f77e0178
[  106.860330] x11: 000000007fffffff x10: 0000000000000064 x9 : ffff80007b43b308
[  106.864328] x8 : ffff0000f1072920 x7 : ffff0000f105a9c0 x6 : ffff80007b47eb10
[  106.868342] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  106.872386] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000000
[  106.876346] Call trace:
[  106.879844]  nf_ct_iterate_cleanup+0xd4/0x240 [nf_conntrack]
[  106.883757]  nf_ct_iterate_cleanup_net+0x50/0x70 [nf_conntrack]
[  106.887678]  nf_ct_netns_do_get+0x1c0/0x220 [nf_conntrack]
[  106.891556]  nf_ct_netns_get+0xc8/0x100 [nf_conntrack]
[  106.895426]  nft_ct_get_init+0xa8/0x1b0 [nft_ct]
[  106.899134]  nf_tables_newrule+0x2d4/0x898 [nf_tables]
[  106.902984]  nfnetlink_rcv_batch+0x698/0x960 [nfnetlink]
[  106.906751]  nfnetlink_rcv+0x16c/0x1b0 [nfnetlink]
[  106.910483]  netlink_unicast+0x304/0x380
[  106.914126]  netlink_sendmsg+0x1ac/0x410
[  106.917709]  __sock_sendmsg+0x64/0xc0
[  106.921245]  ____sys_sendmsg+0x270/0x308
[  106.924786]  ___sys_sendmsg+0xb8/0x118
[  106.928209]  __sys_sendmsg+0x90/0x100
[  106.931533]  __arm64_sys_sendmsg+0x2c/0x40
[  106.934808]  invoke_syscall+0x6c/0x100
[  106.937919]  el0_svc_common.constprop.0+0x48/0xf0
[  106.941022]  do_el0_svc+0x24/0x38
[  106.943932]  el0_svc+0x38/0x150
[  106.946699]  el0t_64_sync_handler+0x120/0x138
[  106.949477]  el0t_64_sync+0x190/0x198
[  106.952089] Code: 3600009b 1400003b f940037b 3700073b (3940df60)
[  106.954847] ---[ end trace 0000000000000000 ]---
[  106.957390] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  106.960080] SMP: stopping secondary CPUs
[  106.962488] Kernel Offset: disabled
[  106.964664] CPU features: 0x08,00002082,c0200000,4200421b
[  106.966962] Memory Limit: none
[  106.968990] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

These two panics were captured with the Debian trixie kernel 6.12.63+deb13-arm64, but I managed to get the same HDMI-level symptoms (cursor stops blinking) as with the second panic with the Debian bookworm kernel 6.1.0-42-arm64.

I am currently running memtest86.com on the board, but so far (58% of the first pass) I see no errors. The panics do not always occur. Sometimes I can get it to boot through completely, at which point it seems to be stable for multiple days. The likelihood of a successful boot is lower if the SATA controller is in, to the point that I haven't yet checked if (other) panics occur if I manage to boot through with the SATA controller installed. I don't want to find out, because that'd likely risk the data on the attached disks.

(there are no peripherials attached to the Pi header, except a Raspberry Pico-based UART adapter on UART0. there's nothing connected to any other port except a keyboard on USB, a display on HDMI, and a network cable on the 8P8C/RJ-45 port.)

Once the memtest86 is done, I'll try to capture tracebacks with the 6.1.0 kernel. As mentioned, though, the system used to run fine (as far as I can tell: there *were* issues where reboots got stuck, but I had those attributed to something on UART0 interrupting u-boot. there's a chance >0 that there were, in fact, similar issues before the trixie upgrade).

One thing I already investigated is the kernel_comp_size variable for u-boot, which I found as a possible cause for funny crashes in another thread. I raised it to 128 MiB, which initially seemed to fix things, but then I managed to create the errors again. That seems plausible, because the distance between initramfs and kernel (according to the kernel_addr_r and ramdisk_addr_r in u-boot) was ~94 MiB anyway and the trixie kernel is only ~38 MiB in size. I'm running u-boot from June 2021 from here: https://github.com/sigmaris/u-boot/releases

To me, this looks like some kind of hardware fault, most likely bad RAM. Does anyone have another idea?
  Reply
#2
So memtest86 finished a complete pass successfully ("Finished pass #1 (of 4) (Total errors: 0, ECC errors: 0)"). Given the rate at which boots fail, I don't consider a second pass sensible.

Also ran dpkg -V to see if maybe the kernel image was corrupted or something, but that doesn't seem to be the case either.
  Reply
#3
Okay, more insights:
  • The bookworm kernel 6.1.0-39 manages to enumerate the SATA card reliably if I limit the number of CPUs to 2.
  • The trixie kernel 6.12-something reliably fails to boot with the SATA card in no matter the number of CPUs allowed.
  • Without the SATA card, the trixie kernel boots cleanly with only two CPUs.

For the things dependent on the number of CPUs, I found a suspect:

   

I suspect this to be L9 or L10 from the schematics (EDIT: it's in fact more likely to be L2000, opening a different thread about that). I'll try to find someone who can help me replace that sucker. Before that, there's probably little sense in trying to resolve the other issue.
  Reply
#4
I got the inductor replaced which didn't change the error pattern at all.

* 6.1.0 from Debian bookworm: boots fine with or without PCIe, sometimes crashes when running nft(1) for the first time after a boot.
* 6.12.63 from Debian trixie: boots fine without PCIe, may also have the nft(1) issue, haven't investigated. When the SATA card is in, always crashes pre-HDMI enumerating the SATA card.
* The 6.17.x backports kernel for trixie has the same issue.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Exclamation Ethernet regression on Linux Kernel 6.5.4? Deathcrow 3 4,366 09-22-2023, 04:27 AM
Last Post: diederik
  Vanilla mainline Debian 11 (Bullseye) on the RockPro64 Pete Tandy 22 36,133 08-16-2023, 01:34 AM
Last Post: varac
Question How do I compile an arbitrary kernel for U-Boot? Valenoern 3 4,832 06-16-2023, 10:54 AM
Last Post: CounterPillow
  How do I enable Pine touchdisplay as display on Debian? Thisone 0 2,357 04-23-2023, 11:02 PM
Last Post: Thisone
  Is some u-boot required on the SPI for installing debian with the official installer? callegar 1 3,062 10-25-2022, 10:07 AM
Last Post: ratzzupaltuff
  [OS] SkiffOS and Buildroot for Rockpro64 w/ 5.17 kernel paralin1 1 3,664 05-08-2022, 03:26 PM
Last Post: paralin1
  Kernel OOPs triggered by big writes to ext4 FS ajtravis 1 3,185 04-04-2022, 05:29 PM
Last Post: ajtravis
  Install Mali 400 Drivers for Debian 11 on RockPro64 MaverickPi 2 5,040 02-19-2022, 06:44 AM
Last Post: sigmaris
  Debian image configured for USB-C OTG? djonathan 2 5,000 01-06-2022, 03:09 AM
Last Post: susy1075
Bug Unreliable display in Armbian and Debian ksattic 3 5,898 11-17-2021, 05:42 AM
Last Post: PakoSt

Forum Jump:


Users browsing this thread: 1 Guest(s)