BETA edition rcu errors
#1
As required by Pine64 technical support I am posting to see if anyone else has this problem and a solution. Ticket ##9675##


Quote:    Please be inform you are require to post your question at PINE64 forum, http://forum.pine64.org and there is forum user that has same experience and provide answer to you.
    Thank you.
    Regards,
    support team #4...


I am having random lockups that freeze the phone for 5-10 minutes at a time. In dmesg there are rcu errors that correspond to these lockups.
Occasionally the time will jump to 2116, sometimes the phone will freeze until a power cycle, or reboot itself. This happens 2-3 times a day.
Here are the errors from Manjaro Phosh. https://gist.github.com/8bitgc/84a68844c...1107-L1244
I have also tried Mobian with similar results.

Pine64 tech support had me try Manjaro Plasma. I have similar results. https://gist.github.com/8bitgc/691edf69b...-L502-L855

Someone with similar issues https://forum.pine64.org/showthread.php?tid=10354
It is not a timezone problem. I have mine manually set. If your timezone were off, the most you should get is 1 day off, not ~100 years.

Another time traveler https://social.librem.one/@chrichri/105179322289326690

I suspect this is a hardware problem as I have another PinePhone that does not do this, running from the same Manjaro Phosh SD card.

There is a problem with the A64 SOC https://forum.armbian.com/topic/7423-pin...k-problem/
There are kernel patches to try to workaround the timer issue, but I suspect that some A64 SOC are especially out of spec. Otherwise everyone would be having these issues

8bit
  Reply
#2
Some users have reported "time travel" or general instability issues when the DRAM inside the PinePhone is configured to run at a higher frequency.  I would suggest that you try lowering the DRAM frequency and see if the issues go away.

Manjaro ARM builds and packages a few variants of U-Boot for the PinePhone, which allow different DRAM frequencies (492, 528, 552, 592 and 624 MHz) to be easily tried out.  Just write the desired variant of the U-Boot to your boot device, reboot and test.

See also this thread for further information.
  Reply
#3
(05-03-2021, 11:12 PM)dsimic Wrote: Some users have reported "time travel" or general instability issues when the DRAM inside the PinePhone is configured to run at a higher frequency.  I would suggest that you try lowering the DRAM frequency and see if the issues go away.

Manjaro ARM builds and packages a few variants of U-Boot for the PinePhone, which allow different DRAM frequencies (492, 528, 552, 592 and 624 MHz) to be easily tried out.  Just write the desired variant of the U-Boot to your boot device, reboot and test.

See also this thread for further information.

Thanks. Testing it now at 492MHz on Manjaro Plasma.
Code:
sudo cat /sys/kernel/debug/clk/dram/clk_rate
984000000
  Reply
#4
Ran it overnight and it had 3 separate rcu-errors while running at 492MHz.
https://gist.github.com/8bitgc/05a893684...-L500-L948

It may be related to RAM speed, but this board is still unstable at 492MHz. Either the SOC is bad or the RAM is bad.

Interesting that this time there was an error with the SD card https://gist.github.com/8bitgc/05a893684...s-txt-L499
Probably why it completely locks up sometimes depending on what devices are being accessed when the time changes.
  Reply
#5
You could also try to lower the highest frequency for the CPU cores, and to apply more restrictive thermal thottling to the CPU.  It's worth trying to see if all that makes the issue go away.  It could be that the issue is related to the thermals, but that would actually isolate a defective SoC as the root cause.

Edit: You could also try making each of the four CPU cores offline, or leaving only one of the CPU cores online, to see if using a particular CPU core causes the issue.  However, that would also just pinpoint a defective SoC, but would provide a very good evidence to request a replacement board from Pine64.
  Reply
#6
Unfortunately, none of those suggestions are going to help. The problem has nothing to do with timezone, and nothing to do with DRAM. There may be a relationship to CPU clocks or thermals, but if there is one, it is only a weak effect. The issue is a bug with the timer in the A64 SoC (all of them, to some degree -- a replacement board could be the same). The timer tends to jump backwards, which the kernel (incorrectly) interprets as it wrapping around, so the date jumps 2^56/24000000 seconds forward.

There is already a workaround in Linux to filter out those timer jumps. However, as you can plainly see, the workaround is insufficient. I have been aware of this for a while, due to reports from users like yourself; but I had so far been unable to trigger any further jumps on my handful of A64 boards/devices. So I didn't know how to improve the workaround. I recently improved my testing tools, which are now able to catch some jumps on my PinePhone that the kernel workaround misses.

However, the likelihood of any one timer jump causing the date to skip forward (and thus an RCU stall) is incredibly low. Since you have been experiencing many of these, it appears you have close to a "worst-case" A64 chip as it relates to this bug. This is helpful, as making the workaround work for you will ensure it covers as many chips as possible.

Can you try running the tool at https://github.com/smaeul/timer-tools and paste the output here? On the PinePhone, you should be able to build it with "make". Then please try running it for an hour or so: "for i in $(seq 30); do ./build/target/src/timer_test -Cc; done". You'll want to do this with the phone plugged in, since it will use quite a bit of CPU. If there are any lines that say "Failed after XXXXX reads", that means something got past the kernel's filter.

Thanks!
  Reply
#7
(05-10-2021, 12:41 AM)smaeul Wrote: Can you try running the tool at https://github.com/smaeul/timer-tools and paste the output here? On the PinePhone, you should be able to build it with "make". Then please try running it for an hour or so: "for i in $(seq 30); do ./build/target/src/timer_test -Cc; done". You'll want to do this with the phone plugged in, since it will use quite a bit of CPU. If there are any lines that say "Failed after XXXXX reads", that means something got past the kernel's filter.

Thanks!

Hi smaeul,

I had to add -pthread to compile the utility.

I ran this on 3 different devices that experience the RCU errors. The files are quite large:
https://github.com/8bitgc/timer-tools/ra...ut/beta.7z
https://github.com/8bitgc/timer-tools/ra...put/eb3.7z
https://github.com/8bitgc/timer-tools/ra...t/frank.7z

Let me know if there are more tests to run.
  Reply
#8
@smaeul 
Setting the max frequency to the minimum frequency eliminates almost all the errors.
Code:
echo 648000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

This seems to only work reliably are 648MHz. I tried setting min/max to 816MHz and the errors returned, though it seems less then if allowing the full frequency range.
Code:
echo 816000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
echo 816000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
  Reply
#9
Yes, the frequency of the errors depends greatly on the CPU clock rate. Lower CPU clock seems to be better, though there's no obvious pattern. It also varies greatly for me from run to run; I added random delays to the tool to help compensate for that.



I sent a patch to Linux based on your data (thanks again!): https://lore.kernel.org/linux-sunxi/2021...and.org/T/



If you are able to recompile your kernel, please try with that patch and see if your issues are resolved. If so, feel free to respond to that email with:



Quote:Reported-and-tested-by: Your Name <your@email>


Though you don't have to by any means (if you prefer to stay pseudonymous).
  Reply
#10
(05-13-2021, 10:53 AM)8bit Wrote: Setting the max frequency to the minimum frequency eliminates almost all the errors.

As a note, you can do the same using the "powersave" CPU governor.  It's a bit easier than setting the frequencies.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  No calls in or out on beta IsaiahSp 0 910 09-28-2023, 10:44 PM
Last Post: IsaiahSp
  No calls on Pine Phone Beta convergence edition polypode 1 1,511 03-06-2023, 08:04 PM
Last Post: rocket2nfinity
  PinePhone Beta dead - no boot-related output on serial interface horalocal 1 1,729 02-15-2023, 11:21 AM
Last Post: fxc
  PinePhone Beta Edition will not boot from MicroSD card Timothy_Ecc 27 27,964 01-29-2023, 12:20 PM
Last Post: skandigraun
  Severe screen flicker + occasional ghosting | Pinephone Beta edition legowave440 5 4,379 07-25-2022, 07:35 AM
Last Post: bedtime
Question Upgrading a 2020 Community Edition Pinephone? danimations 5 4,476 03-10-2022, 08:18 AM
Last Post: danimations
  Offer: PinePhone UB Edition in Germany firefox-58 2 3,148 12-11-2021, 11:26 AM
Last Post: Dr. Dreamer
Sad PnePhone Braveheart Edition - Microphone NOT Working valentin 15 10,343 08-09-2021, 07:43 PM
Last Post: tllim
  Vibrator not working - Manjaro Community Edition Beta3 Gerhard 2 3,609 11-29-2020, 07:52 AM
Last Post: saba
Question No HDMI signal from new Manjaro Community Edition phone dock brianary 13 16,329 11-29-2020, 07:22 AM
Last Post: wibble

Forum Jump:


Users browsing this thread: 1 Guest(s)