Freezes and kernel panics with Debian trixie
#1
Hi there,

after running quite smoothly for several years, since the upgrade to Debian trixie, my ROCKPro64 has become quite unstable.

I am using a JMicron Technology Corp. JMB58x AHCI SATA controller in the PCIe slot. The first symptom I had was that after the reboot after the upgrade, the status LEDs on the controller did not turn on. I connected a monitor on HDMI and there was no signal.

I have now advanced a couple iterations of debugging and here is what I have.

If the PCIe card is in, I get panics related to PCIe, such as:

Code:
[    4.965205] SError Interrupt on CPU5, code 0x00000000bf000002 -- SError
[    4.965230] CPU: 5 UID: 0 PID: 52 Comm: kworker/u25:3 Tainted: G   M               6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[    4.965249] Tainted: [M]=MACHINE_CHECK
[    4.965253] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    4.965260] Workqueue: events_unbound deferred_probe_work_func
[    4.965285] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.965297] pc : rockchip_pcie_rd_conf+0x194/0x2c0
[    4.965315] lr : rockchip_pcie_rd_conf+0x188/0x2c0
[    4.965326] sp : ffff8000828037a0
[    4.965331] x29: ffff8000828037a0 x28: ffff000001fbf800 x27: 0000000000000001
[    4.965348] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
[    4.965362] x23: ffff800082485000 x22: 0000000000000000 x21: ffff8000828037e4
[    4.965377] x20: 0000000000000000 x19: 0000000000000004 x18: ffffffffffffffff
[    4.965390] x17: 30302f30303a3030 x16: 30306963702f6569 x15: 63702e3030303030
[    4.965404] x14: ffff8000824bb460 x13: 0000000000000326 x12: 0000000000000000
[    4.965418] x11: 0000000000000001 x10: 0000000000000000 x9 : ffff8000808698d0
[    4.965431] x8 : 0000000124f798bc x7 : ffff000005740380 x6 : ffff000005747000
[    4.965445] x5 : ffff000001fbf800 x4 : ffff800087000000 x3 : 0000000000c00008
[    4.965458] x2 : 000000000080000a x1 : ffff800087c00008 x0 : ffff800087c0000c
[    4.965475] Kernel panic - not syncing: Asynchronous SError Interrupt
[    4.965481] CPU: 5 UID: 0 PID: 52 Comm: kworker/u25:3 Tainted: G   M               6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[    4.965496] Tainted: [M]=MACHINE_CHECK
[    4.965500] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    4.965505] Workqueue: events_unbound deferred_probe_work_func
[    4.965517] Call trace:
[    4.965521]  dump_backtrace+0xd8/0x130
[    4.965534]  show_stack+0x20/0x38
[    4.965543]  dump_stack_lvl+0x60/0x80
[    4.965556]  dump_stack+0x18/0x28
[    4.965566]  panic+0x164/0x378
[    4.965582]  nmi_panic+0x90/0x98
[    4.965598]  arm64_serror_panic+0x78/0x90
[    4.965608]  do_serror+0x30/0x80
[    4.965617]  el1h_64_error_handler+0x30/0x48
[    4.965629]  el1h_64_error+0x64/0x68
[    4.965638]  rockchip_pcie_rd_conf+0x194/0x2c0
[    4.965650]  pci_bus_read_config_dword+0x8c/0x140
[    4.965663]  pci_bus_generic_read_dev_vendor_id+0x38/0x178
[    4.965678]  pci_scan_single_device+0xb4/0x120
[    4.965691]  pci_scan_slot+0x60/0x230
[    4.965703]  pci_scan_child_bus_extend+0x4c/0x2e0
[    4.965717]  pci_scan_bridge_extend+0x180/0x5a8
[    4.965731]  pci_scan_child_bus_extend+0x1c4/0x2e0
[    4.965744]  pci_scan_root_bus_bridge+0x6c/0xe8
[    4.965758]  pci_host_probe+0x38/0xe0
[    4.965771]  rockchip_pcie_probe+0x3a0/0x530
[    4.965782]  platform_probe+0x70/0xe8
[    4.965796]  really_probe+0xc8/0x3a0
[    4.965806]  __driver_probe_device+0x84/0x160
[    4.965815]  driver_probe_device+0x44/0x130
[    4.965825]  __device_attach_driver+0xc4/0x170
[    4.965836]  bus_for_each_drv+0x90/0x100
[    4.965845]  __device_attach+0xa8/0x1c8
[    4.965854]  device_initial_probe+0x1c/0x30
[    4.965864]  bus_probe_device+0xb0/0xc0
[    4.965873]  deferred_probe_work_func+0xbc/0x120
[    4.965883]  process_one_work+0x178/0x3e0
[    4.965895]  worker_thread+0x204/0x3f0
[    4.965907]  kthread+0xe8/0xf8
[    4.965916]  ret_from_fork+0x10/0x20
[    4.965929] SMP: stopping secondary CPUs
[    4.965945] Kernel Offset: disabled
[    4.965948] CPU features: 0x08,00002082,c0200000,4200421b
[    4.965957] Memory Limit: none
[    4.994272] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

If I boot without the PCIe card, I got what looked like a freeze on HDMI, but the UART logged this kernel panic:

Code:
[  106.672016] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000037
[  106.676856] Mem abort info:
[  106.681157]   ESR = 0x0000000096000004
[  106.685537]   EC = 0x25: DABT (current EL), IL = 32 bits
[  106.690074]   SET = 0, FnV = 0
[  106.694416]   EA = 0, S1PTW = 0
[  106.698736]   FSC = 0x04: level 0 translation fault
[  106.703208] Data abort info:
[  106.707515]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  106.712069]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  106.716632]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  106.721166] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000df71000
[  106.725827] [0000000000000037] pgd=0000000000000000, p4d=0000000000000000
[  106.730563] Internal error: Oops: 0000000096000004 [#1] SMP
[  106.735036] Modules linked in: nft_limit nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables binfmt_misc snd_soc_hdmi_codec hantro_vpu aes_ce_blk rockchip_vdec(C) hci_uart v4l2_jpeg aes_ce_cipher crct10dif_ce v4l2_vp9 btqca polyval_ce v4l2_h264 polyval_generic rockchip_rga btrtl videobuf2_dma_contig btintel ghash_ce videobuf2_dma_sg gf128mul v4l2_mem2mem btbcm sha2_ce videobuf2_memops sha256_arm64 videobuf2_v4l2 snd_soc_audio_graph_card snd_soc_simple_card sha1_ce panfrost snd_soc_rockchip_i2s bluetooth snd_soc_spdif_tx snd_soc_es8316 snd_soc_simple_card_utils snd_soc_core videodev ofpart gpu_sched dw_hdmi_i2s_audio gpio_ir_recv des_generic dw_hdmi_cec pwm_fan snd_compress ecdh_generic leds_gpio spi_nor snd_pcm_dmaengine rk_crypto rfkill drm_shmem_helper snd_pcm videobuf2_common pwrseq_core crypto_engine snd_timer mtd mc libdes snd rockchip_saradc coresight_cpu_debug industrialio_triggered_buffer soundcore coresight_etm4x kfifo_buf rockchip_thermal industrialio coresight
[  106.735344]  cpufreq_dt evdev dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc efi_pstore configfs nfnetlink ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c crc32c_generic raid1 raid0 realtek md_mod xhci_plat_hcd xhci_hcd dwc3 rockchipdrm fusb302 udc_core rk808_regulator dw_hdmi tcpm cec ulpi dwmac_rk typec rc_core stmmac_platform fan53555 stmmac dw_mipi_dsi analogix_dp pwm_regulator gpio_rockchip drm_display_helper pcs_xpcs dwc3_of_simple phylink ohci_platform gpio_keys sdhci_of_arasan ohci_hcd mdio_devres drm_dma_helper ehci_platform phy_rockchip_inno_usb2 ehci_hcd of_mdio drm_kms_helper phy_rockchip_emmc sdhci_pltfm phy_rockchip_typec fixed_phy phy_rockchip_pcie usbcore nvmem_rockchip_efuse pl330 drm dw_wdt fwnode_mdio pwm_rockchip io_domain rockchip_dfi libphy cqhci dw_mmc_rockchip i2c_rk3x usb_common spi_rockchip dw_mmc_pltfm sdhci dw_mmc fixed
[  106.806464] CPU: 2 UID: 0 PID: 900 Comm: nft Tainted: G         C         6.12.63+deb13-arm64 #1  Debian 6.12.63-1
[  106.811298] Tainted: [C]=CRAP
[  106.815326] Hardware name: Pine64 RockPro64 v2.1 (DT)
[  106.819469] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  106.823730] pc : nf_ct_iterate_cleanup+0xd4/0x240 [nf_conntrack]
[  106.827937] lr : nf_ct_iterate_cleanup+0xc0/0x240 [nf_conntrack]
[  106.832067] sp : ffff8000832f33b0
[  106.835817] x29: ffff8000832f33b0 x28: ffff8000832f3450 x27: 0000000000000000
[  106.839909] x26: ffff80007b451680 x25: ffff00000b594a00 x24: ffff80007b443538
[  106.844018] x23: ffff80007b452688 x22: ffff80007b451c40 x21: 000000000001eb80
[  106.848126] x20: 0000000000003d70 x19: ffff000020700000 x18: ffffffffffffffff
[  106.852210] x17: 000000000f7574be x16: 0000000094c09be4 x15: ffff00000529f895
[  106.856295] x14: ffff8000832f3240 x13: 0000000000000801 x12: ffff0000f77e0178
[  106.860330] x11: 000000007fffffff x10: 0000000000000064 x9 : ffff80007b43b308
[  106.864328] x8 : ffff0000f1072920 x7 : ffff0000f105a9c0 x6 : ffff80007b47eb10
[  106.868342] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  106.872386] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000000
[  106.876346] Call trace:
[  106.879844]  nf_ct_iterate_cleanup+0xd4/0x240 [nf_conntrack]
[  106.883757]  nf_ct_iterate_cleanup_net+0x50/0x70 [nf_conntrack]
[  106.887678]  nf_ct_netns_do_get+0x1c0/0x220 [nf_conntrack]
[  106.891556]  nf_ct_netns_get+0xc8/0x100 [nf_conntrack]
[  106.895426]  nft_ct_get_init+0xa8/0x1b0 [nft_ct]
[  106.899134]  nf_tables_newrule+0x2d4/0x898 [nf_tables]
[  106.902984]  nfnetlink_rcv_batch+0x698/0x960 [nfnetlink]
[  106.906751]  nfnetlink_rcv+0x16c/0x1b0 [nfnetlink]
[  106.910483]  netlink_unicast+0x304/0x380
[  106.914126]  netlink_sendmsg+0x1ac/0x410
[  106.917709]  __sock_sendmsg+0x64/0xc0
[  106.921245]  ____sys_sendmsg+0x270/0x308
[  106.924786]  ___sys_sendmsg+0xb8/0x118
[  106.928209]  __sys_sendmsg+0x90/0x100
[  106.931533]  __arm64_sys_sendmsg+0x2c/0x40
[  106.934808]  invoke_syscall+0x6c/0x100
[  106.937919]  el0_svc_common.constprop.0+0x48/0xf0
[  106.941022]  do_el0_svc+0x24/0x38
[  106.943932]  el0_svc+0x38/0x150
[  106.946699]  el0t_64_sync_handler+0x120/0x138
[  106.949477]  el0t_64_sync+0x190/0x198
[  106.952089] Code: 3600009b 1400003b f940037b 3700073b (3940df60)
[  106.954847] ---[ end trace 0000000000000000 ]---
[  106.957390] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  106.960080] SMP: stopping secondary CPUs
[  106.962488] Kernel Offset: disabled
[  106.964664] CPU features: 0x08,00002082,c0200000,4200421b
[  106.966962] Memory Limit: none
[  106.968990] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

These two panics were captured with the Debian trixie kernel 6.12.63+deb13-arm64, but I managed to get the same HDMI-level symptoms (cursor stops blinking) as with the second panic with the Debian bookworm kernel 6.1.0-42-arm64.

I am currently running memtest86.com on the board, but so far (58% of the first pass) I see no errors. The panics do not always occur. Sometimes I can get it to boot through completely, at which point it seems to be stable for multiple days. The likelihood of a successful boot is lower if the SATA controller is in, to the point that I haven't yet checked if (other) panics occur if I manage to boot through with the SATA controller installed. I don't want to find out, because that'd likely risk the data on the attached disks.

(there are no peripherials attached to the Pi header, except a Raspberry Pico-based UART adapter on UART0. there's nothing connected to any other port except a keyboard on USB, a display on HDMI, and a network cable on the 8P8C/RJ-45 port.)

Once the memtest86 is done, I'll try to capture tracebacks with the 6.1.0 kernel. As mentioned, though, the system used to run fine (as far as I can tell: there *were* issues where reboots got stuck, but I had those attributed to something on UART0 interrupting u-boot. there's a chance >0 that there were, in fact, similar issues before the trixie upgrade).

One thing I already investigated is the kernel_comp_size variable for u-boot, which I found as a possible cause for funny crashes in another thread. I raised it to 128 MiB, which initially seemed to fix things, but then I managed to create the errors again. That seems plausible, because the distance between initramfs and kernel (according to the kernel_addr_r and ramdisk_addr_r in u-boot) was ~94 MiB anyway and the trixie kernel is only ~38 MiB in size. I'm running u-boot from June 2021 from here: https://github.com/sigmaris/u-boot/releases

To me, this looks like some kind of hardware fault, most likely bad RAM. Does anyone have another idea?
  Reply
#2
So memtest86 finished a complete pass successfully ("Finished pass #1 (of 4) (Total errors: 0, ECC errors: 0)"). Given the rate at which boots fail, I don't consider a second pass sensible.

Also ran dpkg -V to see if maybe the kernel image was corrupted or something, but that doesn't seem to be the case either.
  Reply
#3
Okay, more insights:
  • The bookworm kernel 6.1.0-39 manages to enumerate the SATA card reliably if I limit the number of CPUs to 2.
  • The trixie kernel 6.12-something reliably fails to boot with the SATA card in no matter the number of CPUs allowed.
  • Without the SATA card, the trixie kernel boots cleanly with only two CPUs.

For the things dependent on the number of CPUs, I found a suspect:

   

I suspect this to be L9 or L10 from the schematics. I'll try to find someone who can help me replace that sucker. Before that, there's probably little sense in trying to resolve the other issue.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Exclamation Ethernet regression on Linux Kernel 6.5.4? Deathcrow 3 4,201 09-22-2023, 04:27 AM
Last Post: diederik
  Vanilla mainline Debian 11 (Bullseye) on the RockPro64 Pete Tandy 22 34,634 08-16-2023, 01:34 AM
Last Post: varac
Question How do I compile an arbitrary kernel for U-Boot? Valenoern 3 4,575 06-16-2023, 10:54 AM
Last Post: CounterPillow
  How do I enable Pine touchdisplay as display on Debian? Thisone 0 2,284 04-23-2023, 11:02 PM
Last Post: Thisone
  Is some u-boot required on the SPI for installing debian with the official installer? callegar 1 2,966 10-25-2022, 10:07 AM
Last Post: ratzzupaltuff
  [OS] SkiffOS and Buildroot for Rockpro64 w/ 5.17 kernel paralin1 1 3,587 05-08-2022, 03:26 PM
Last Post: paralin1
  Kernel OOPs triggered by big writes to ext4 FS ajtravis 1 3,085 04-04-2022, 05:29 PM
Last Post: ajtravis
  Install Mali 400 Drivers for Debian 11 on RockPro64 MaverickPi 2 4,850 02-19-2022, 06:44 AM
Last Post: sigmaris
  Debian image configured for USB-C OTG? djonathan 2 4,861 01-06-2022, 03:09 AM
Last Post: susy1075
Bug Unreliable display in Armbian and Debian ksattic 3 5,757 11-17-2021, 05:42 AM
Last Post: PakoSt

Forum Jump:


Users browsing this thread: 1 Guest(s)