Time drift issues
#1
Some users and me have experienced time drift in some sopine modules. This manifests as a date jumping randomly in the future, for example around the year 2114.

Some system output:


Code:
root@master:~# date
Fri Nov 30 15:16:29 UTC 2114
root@master:~# cat /etc/fake-hwclock.data
2114-11-30 14:46:01
root@master:~# hwclock
2019-10-10 23:45:40.014568+00:00


This was discussed before in another thread (original) but I've decided to make a new thread due to the original one describing a different problem.


EDIT1: Strangely the drift is consistent between nodes. Here the output of another tainted node:


Code:
root@worker1:~# date
Fri Nov 30 15:33:07 UTC 2114
root@worker1:~# cat /etc/fake-hwclock.data
2114-11-30 14:46:01
root@worker1:~# hwclock
2019-10-10 23:54:15.532921+00:00


The actual hardware clock shows a different time than the master, but the 'fake' hardware clock time is exactly the same.
#2
(10-17-2019, 06:30 AM)Unkn0wn Wrote: Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Moved from link see for background details.  Unfortunately, I've had varied results with hwclock -s. If it works it's better than nothing but it can throw the clock a few microseconds in either direction and software may not like that.  However, while module 1 appears to be completely cured module 2 is still going down.  If we look at the PSU as changing parameters, then I believe the attached serial wires on module 1 may be affecting this as well.  If this is true, then it's very possibly a hardware issue with the board itself. Changing PSU and having dangling wires could change the noise and parasitic capacitance causing the SoC to misbehave. I'm not an EE and lack the tools to properly investigate that line of thinking.

I've updated module 2 to use hwclock -w; hwclock -s .  This should minimize clock jumps by first saving the RTC and then loading it but it's only been 24 hours.  Next time it goes down, I'm pulling the serial wires and moving them to module 2 and observing what happens.
#3
(10-17-2019, 09:16 AM)venix1 Wrote:
(10-17-2019, 06:30 AM)Unkn0wn Wrote: Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Moved from link see for background details.  Unfortunately, I've had varied results with hwclock -s. If it works it's better than nothing but it can throw the clock a few microseconds in either direction and software may not like that.  However, while module 1 appears to be completely cured module 2 is still going down.  If we look at the PSU as changing parameters, then I believe the attached serial wires on module 1 may be affecting this as well.  If this is true, then it's very possibly a hardware issue with the board itself. Changing PSU and having dangling wires could change the noise and parasitic capacitance causing the SoC to misbehave. I'm not an EE and lack the tools to properly investigate that line of thinking.

I've updated module 2 to use hwclock -w; hwclock -s .  This should minimize clock jumps by first saving the RTC and then loading it but it's only been 24 hours.  Next time it goes down, I'm pulling the serial wires and moving them to module 2 and observing what happens.

I'm no EE either, but shouldn't the electrical noise from the PSU be filtered out? Anyway, I had varying results with what node went haywire first.

In your other comment you said this:

Quote:However, in my case the time jump results in a network outage so I believe both issues are symptoms of the same underlying root problem.

When one of my nodes is affected, it does remain accessible to the network. I'm able to SSH in and it still has a valid IP address. The OS more or less keeps working, just everything using time stops working (certificates, apt, ssl, kubernetes). Are you completely unable to access a node over the network?
#4
(10-17-2019, 10:01 AM)Unkn0wn Wrote: I'm no EE either, but shouldn't the electrical noise from the PSU be filtered out? Anyway, I had varying results with what node went haywire first.
My basic understanding is that the quality of the components has a good determination for PSU quality. I would expect a modern ATX power supply to be of higher quality and have better filtering and a cleaner output then the $15 brick recommended for use.  As for order, mine did not deviate from it until I began playing with the RTC and system clocks. I also didn't let more than 2 nodes go down.

(10-17-2019, 06:30 AM)Unkn0wn Wrote: Are you completely unable to access a node over the network?

Completely unable to access it .  I was also unable to reach out to the network from the node. My standard test is a simple ping. I haven't looked for ARPs on the node side. I'll try that next time I get a serial console when it goes down.

EDIT 1: Moved the serial console from module 1 to module 2.  Module 1 is losing networking again, after a week of stability. I don't believe that's a coincident.  The serial console is connected to low cost CH340 TTL USB dongle.  The USB connector isn't plugged in but an onboard LED is activated by the serial console so power is flowing through it.
#5
This appears to be a known and long running issue. 

https://github.com/torvalds/linux/commit...ff7e1d975c

You can check the Soc with this script by Andre Przywara. For me, 2 of the 7 modules have issues, and the script shows 2 of the 7 have time skew issues.

Will time jump within a couple of days.
Code:
pine64so-cluster2 | CHANGED | rc=0 >>
TAP version 13
# number of cores: 4
ok 1 same timer frequency on all cores
# timer frequency is 24000000 Hz (24 MHz)
# time1: 93dc77ff9, time2: 93dc741ff, diff: -15866
# time1: 93de37dff, time2: 93de37c06, diff: -505
# time1: 93de8dff8, time2: 93de8de00, diff: -504
# time1: 93dee1ff9, time2: 93dee1e00, diff: -505
# time1: 93dfd1ff9, time2: 93dfd1e00, diff: -505
# time1: 93e061ff9, time2: 93e061e00, diff: -505
# time1: 93e09aff9, time2: 93e09ae00, diff: -505
# time1: 93e0fd3f9, time2: 93e0fd200, diff: -505
# time1: 93e116bf9, time2: 93e116a00, diff: -505
# time1: 93e145ff9, time2: 93e145e00, diff: -505
# time1: 93e157ff9, time2: 93e1541ff, diff: -15866
# time1: 93e1f6ff9, time2: 93e1f6e00, diff: -505
# time1: 93e283ff9, time2: 93e2801ff, diff: -15866
# time1: 93e32bff9, time2: 93e3281ff, diff: -15866
# time1: 93e3c1ff9, time2: 93e3c1e00, diff: -505
# time1: 93e3d3ff9, time2: 93e3d01ff, diff: -15866
# too many errors, stopping reports
not ok 2 native counter reads are monotonic # 166 errors
# min: -15866, avg: 6, max: 6307335
# diffs: -660791, -20792, -20750, -660833, -20792, -20750, -20750, -660833, -20792, -20792, -20792, -20792, -20792, -20750, -660833, -20792
# too many errors, stopping reports
not ok 3 Linux counter reads are monotonic # 661 errors
# min: -88042125, avg: 533, max: 661917
# core 0: counter value: 40082686052 => 1670 sec
# core 0: offsets: back-to-back: 8, b-t-b synced: 11, b-t-b w/ delay: 8
# core 1: counter value: 40082687216 => 1670 sec
# core 1: offsets: back-to-back: 10, b-t-b synced: 6, b-t-b w/ delay: 13
# core 2: counter value: 40082688217 => 1670 sec
# core 2: offsets: back-to-back: 8, b-t-b synced: 6, b-t-b w/ delay: 8
# core 3: counter value: 40082689151 => 1670 sec
# core 3: offsets: back-to-back: 8, b-t-b synced: 6, b-t-b w/ delay: 8
1..3

pine64so-cluster3 has never experienced this jump.
Code:
pine64so-cluster3 | CHANGED | rc=0 >>
TAP version 13
# number of cores: 4
ok 1 same timer frequency on all cores
# timer frequency is 24000000 Hz (24 MHz)
ok 2 native counter reads are monotonic # 0 errors
# min: 7, avg: 7, max: 4541
ok 3 Linux counter reads are monotonic # 0 errors
# min: 541, avg: 550, max: 213417
# core 0: counter value: 40118769909 => 1671 sec
# core 0: offsets: back-to-back: 10, b-t-b synced: 13, b-t-b w/ delay: 11
# core 1: counter value: 40118770884 => 1671 sec
# core 1: offsets: back-to-back: 10, b-t-b synced: 9, b-t-b w/ delay: 10
# core 2: counter value: 40118772200 => 1671 sec
# core 2: offsets: back-to-back: 13, b-t-b synced: 8, b-t-b w/ delay: 10
# core 3: counter value: 40118773120 => 1671 sec
# core 3: offsets: back-to-back: 9, b-t-b synced: 8, b-t-b w/ delay: 11
1..3

I haven't been able to find activity related to this since the patch went mainline other than additional reports the issue still isn't resolved.
#6
This may finally have been resolved with Armbian Buster and the 5.4 kernel. Since upgrading the sopines from Stretch, the issue has not surfaced.

Code:
pine64so-cluster2 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster2 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:16 up 9 days, 21:09,  1 user,  load average: 1.81, 1.16, 0.83

pine64so-cluster3 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster3 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:16 up 9 days, 21:09,  1 user,  load average: 0.78, 0.63, 0.61

pine64so-cluster1 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster1 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:16 up 9 days, 21:09,  1 user,  load average: 0.39, 0.48, 0.53

pine64so-cluster4 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster4 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:16 up 9 days, 21:09,  1 user,  load average: 1.46, 0.90, 0.77

pine64so-cluster0 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster0 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:16 up 9 days, 21:09,  1 user,  load average: 2.24, 2.57, 2.54

pine64so-cluster5 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster5 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:18 up 9 days, 21:09,  1 user,  load average: 0.57, 0.94, 1.14

pine64so-cluster6 | CHANGED | rc=0 >>
Description: Debian GNU/Linux 10 (buster)
Linux pine64so-cluster6 5.4.14-sunxi64 #rc1 SMP Sat Jan 25 15:46:08 CET 2020 aarch64 GNU/Linux
14:56:18 up 9 days, 21:09,  1 user,  load average: 0.31, 0.50, 0.54

Prior to the upgrade, 2 days would be a miracle for pine64so-cluster1 and pine64so-cluster2.


Possibly Related Threads…
Thread Author Replies Views Last Post
  Issues with the RTL8370N jwh 0 1,998 10-04-2019, 08:09 PM
Last Post: jwh
  Clusterboard startup issues WhiteDragon 10 14,302 05-09-2019, 01:34 PM
Last Post: WhiteDragon
  Issues with first start, what do the different LED states mean? Botagar 4 6,812 07-27-2018, 03:13 PM
Last Post: paradise
  Networking Issues (Actually really Solved x2!) aww 11 16,647 04-09-2018, 09:27 PM
Last Post: aww
  Clusterboard startup issues PigLover 7 8,453 04-09-2018, 11:24 AM
Last Post: aww

Forum Jump:


Users browsing this thread: 1 Guest(s)