Best way to avoid SMP internal errors when building RAID?
#1
Hi,

as some of you might know there is a known issue with handling of PCIe errors on the RP64 as discussed here:

https://forum.pine64.org/showthread.php?tid=8374
https://forum.pine64.org/showthread.php?tid=6329

I am getting this error when I try to build/rebuild my RAID for the first time. Sometimes I am lucky and it works, but most of the time it does not and I don't have much confidence in putting my backup on a machine with a malfunctioning PCIe interface. If it happens, I see entries like these in the logs:

Code:
kernel:[  658.490457] Internal error: synchronous external abort: 96000210 [#1] SMP
Message from syslogd@debian at Feb 26 23:35:03 ...
kernel:[  658.518345] Code: b8615881 340001c1 8b21c061 8b010001 (b9400021)
Message from syslogd@debian at Feb 26 23:35:03 ...
kernel:[  658.490457] Internal error: synchronous external abort: 96000210 [#1] SMP
Message from syslogd@debian at Feb 26 23:35:03 ...
kernel:[  658.518345] Code: b8615881 340001c1 8b21c061 8b010001 (b9400021)

As I am still encountering these issues when running the latest Debian unstable kernel, I fear that this issue won't be fixed in the upcoming Debian stable either (because it's a hardware issue and not Debian's fault). Now, I wonder what the best workaround could be. Recompiling the kernel with the hack discussed in said thread seems to work, but it's not a longterm solution if we want to get regular security updates for our kernel without the need for manual patching and recompiling...

Is there any workaround we could apply in software to avoid these issues? I would be happy with anything, even at the cost of performance like disabling all but one CPU core etc.

Thank you!

I tried the most radical approach and completely disabled SMP by adding the following kernel command line parameter:

Code:
nosmp

As a result, my RP64 now runs with only one cortex a53 core. Yet, the performance seems to be enough to build the RAID:

Code:
%Cpu(s):  0.7 us, 44.6 sy,  0.0 ni, 49.4 id,  0.0 wa,  0.0 hi,  5.2 si,  0.0 st

At least, I didn't get any SMP error yet, so I am optimistic this will work...

Suggestions for less drastic and more performant workarounds (e.g. enabling one of the A72 cores) welcome!

Bad news and correction - I encountered the same issue now even with SMP disabled :-(

Code:
kernel:[  922.683235] Internal error: synchronous external abort: 96000210 [#1] SMP

Message from syslogd@debian at Feb 27 00:32:22 ...
kernel:[  922.711924] Code: b8615881 340001c1 8b21c061 8b010001 (b9400021)
Message from syslogd@debian at Feb 27 00:32:22 ...
kernel:[  922.683235] Internal error: synchronous external abort: 96000210 [#1] SMP

Message from syslogd@debian at Feb 27 00:32:22 ...
kernel:[  922.711924] Code: b8615881 340001c1 8b21c061 8b010001 (b9400021)

Details in dmesg:

Code:
[  922.696664] CPU: 0 PID: 171 Comm: scsi_eh_1 Not tainted 5.10.0-3-arm64 #1 Debian 5.10.13-
1                                                                                          
[  922.697415] Hardware name: Pine64 RockPro64 v2.1 (DT)                                  
[  922.697888] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)                        
[  922.698456] pc : ahci_scr_read+0x50/0x90 [libahci]                                      
[  922.698935] lr : sata_scr_read+0x7c/0xa0 [libata]                                      
[  922.699371] sp : ffff8000120cbab0      
[  922.699684] x29: ffff8000120cbab0 x28: ffff0000f0e50000                                
[  922.700185] x27: ffff0000f0e52368 x26: ffff0000f0e523e0                                  
[  922.700686] x25: 0000000000000000 x24: 0000000000000000                                  
[  922.701187] x23: 0000000000000001 x22: ffff0000f0e539b8                
[  922.700686] x25: 0000000000000000 x24: 0000000000000000                          [0/1980]
[  922.701187] x23: 0000000000000001 x22: ffff0000f0e539b8                                  
[  922.701688] x21: ffff0000f0e523e0 x20: ffff0000f0e52040                                  
[  922.702189] x19: ffff0000f0e52440 x18: 0000000000000000                                  
[  922.702689] x17: 0000000000000000 x16: 0000000000000000                                  
[  922.703189] x15: 0000000000000000 x14: 0000000000000000                                  
[  922.703689] x13: 0000000000000000 x12: 0000000000000000                                  
[  922.704189] x11: 0000000000000000 x10: 0000000000000000                                  
[  922.704690] x9 : ffff800008ea46fc x8 : 0000000000000000                                  
[  922.705191] x7 : ffff0000f0e52040 x6 : ffff8000120cbb44                                
[  922.705691] x5 : 0000000000000001 x4 : ffff800008e30078                                  
[  922.706191] x3 : 0000000000000180 x2 : ffff8000120cbb44                                  
[  922.706692] x1 : ffff800011e5d1b0 x0 : ffff800011e5d000                                  
[  922.707192] Call trace:                                                                  
[  922.707440]  ahci_scr_read+0x50/0x90 [libahci]                                          
[  922.707881]  ata_eh_link_autopsy+0x8c/0xb4c [libata]                                    
[  922.708368]  ata_eh_autopsy+0x40/0x144 [libata]                                          
[  922.708817]  sata_pmp_error_handler+0x48/0x930 [libata]                                
[  922.709308]  ahci_error_handler+0x4c/0x90 [libahci]                                    
[  922.709786]  ata_scsi_port_error_handler+0x2a4/0x744 [libata]                          
[  922.710343]  ata_scsi_error+0xa4/0xec [libata]                                          
[  922.710786]  scsi_error_handler+0xc0/0x5d0 [scsi_mod]                                  
[  922.711261]  kthread+0x130/0x134        
[  922.711574]  ret_from_fork+0x10/0x38                                                    
[  922.711924] Code: b8615881 340001c1 8b21c061 8b010001 (b9400021)                        
[  922.712486] ---[ end trace 687bc2ded22b1d30 ]---
  Reply
#2
One more update: After I patched dtb file that uses the faster gen2 link speed, my resync is progressing since a half hour without issues so far (before, it usually failed after aprox 5-10 minutes).

Marking as solved, since this does the trick for me.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question nand-sata-install on RAID-1 array ARMandHammer 1 826 02-12-2023, 03:10 AM
Last Post: igorp
  Building a new SD card linuxha 2 3,283 04-27-2020, 12:32 AM
Last Post: linuxha
  Building a custom kernel on @mrfixit's Debian distro? Tim Jones 1 2,816 06-14-2019, 11:21 PM
Last Post: tllim
  Is PINE64 building a Custom Linux for RockPro64? Ben Hayat 3 5,781 06-03-2018, 08:33 PM
Last Post: elatllat

Forum Jump:


Users browsing this thread: 1 Guest(s)