Time drift issues
#1
Some users and me have experienced time drift in some sopine modules. This manifests as a date jumping randomly in the future, for example around the year 2114.

Some system output:


Code:
[email protected]:~# date
Fri Nov 30 15:16:29 UTC 2114
[email protected]:~# cat /etc/fake-hwclock.data
2114-11-30 14:46:01
[email protected]:~# hwclock
2019-10-10 23:45:40.014568+00:00


This was discussed before in another thread (original) but I've decided to make a new thread due to the original one describing a different problem.


EDIT1: Strangely the drift is consistent between nodes. Here the output of another tainted node:


Code:
[email protected]:~# date
Fri Nov 30 15:33:07 UTC 2114
[email protected]:~# cat /etc/fake-hwclock.data
2114-11-30 14:46:01
[email protected]:~# hwclock
2019-10-10 23:54:15.532921+00:00


The actual hardware clock shows a different time than the master, but the 'fake' hardware clock time is exactly the same.
  Reply
#2
(10-17-2019, 06:30 AM)Unkn0wn Wrote: Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Moved from link see for background details.  Unfortunately, I've had varied results with hwclock -s. If it works it's better than nothing but it can throw the clock a few microseconds in either direction and software may not like that.  However, while module 1 appears to be completely cured module 2 is still going down.  If we look at the PSU as changing parameters, then I believe the attached serial wires on module 1 may be affecting this as well.  If this is true, then it's very possibly a hardware issue with the board itself. Changing PSU and having dangling wires could change the noise and parasitic capacitance causing the SoC to misbehave. I'm not an EE and lack the tools to properly investigate that line of thinking.

I've updated module 2 to use hwclock -w; hwclock -s .  This should minimize clock jumps by first saving the RTC and then loading it but it's only been 24 hours.  Next time it goes down, I'm pulling the serial wires and moving them to module 2 and observing what happens.
  Reply
#3
(10-17-2019, 09:16 AM)venix1 Wrote:
(10-17-2019, 06:30 AM)Unkn0wn Wrote: Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Moved from link see for background details.  Unfortunately, I've had varied results with hwclock -s. If it works it's better than nothing but it can throw the clock a few microseconds in either direction and software may not like that.  However, while module 1 appears to be completely cured module 2 is still going down.  If we look at the PSU as changing parameters, then I believe the attached serial wires on module 1 may be affecting this as well.  If this is true, then it's very possibly a hardware issue with the board itself. Changing PSU and having dangling wires could change the noise and parasitic capacitance causing the SoC to misbehave. I'm not an EE and lack the tools to properly investigate that line of thinking.

I've updated module 2 to use hwclock -w; hwclock -s .  This should minimize clock jumps by first saving the RTC and then loading it but it's only been 24 hours.  Next time it goes down, I'm pulling the serial wires and moving them to module 2 and observing what happens.

I'm no EE either, but shouldn't the electrical noise from the PSU be filtered out? Anyway, I had varying results with what node went haywire first.

In your other comment you said this:

Quote:However, in my case the time jump results in a network outage so I believe both issues are symptoms of the same underlying root problem.

When one of my nodes is affected, it does remain accessible to the network. I'm able to SSH in and it still has a valid IP address. The OS more or less keeps working, just everything using time stops working (certificates, apt, ssl, kubernetes). Are you completely unable to access a node over the network?
  Reply
#4
(10-17-2019, 10:01 AM)Unkn0wn Wrote: I'm no EE either, but shouldn't the electrical noise from the PSU be filtered out? Anyway, I had varying results with what node went haywire first.
My basic understanding is that the quality of the components has a good determination for PSU quality. I would expect a modern ATX power supply to be of higher quality and have better filtering and a cleaner output then the $15 brick recommended for use.  As for order, mine did not deviate from it until I began playing with the RTC and system clocks. I also didn't let more than 2 nodes go down.

(10-17-2019, 06:30 AM)Unkn0wn Wrote: Are you completely unable to access a node over the network?

Completely unable to access it .  I was also unable to reach out to the network from the node. My standard test is a simple ping. I haven't looked for ARPs on the node side. I'll try that next time I get a serial console when it goes down.

EDIT 1: Moved the serial console from module 1 to module 2.  Module 1 is losing networking again, after a week of stability. I don't believe that's a coincident.  The serial console is connected to low cost CH340 TTL USB dongle.  The USB connector isn't plugged in but an onboard LED is activated by the serial console so power is flowing through it.
  Reply


Possibly Related Threads...
Thread Author Replies Views Last Post
  Issues with the RTL8370N jwh 0 74 10-04-2019, 08:09 PM
Last Post: jwh
  Clusterboard startup issues WhiteDragon 10 547 05-09-2019, 01:34 PM
Last Post: WhiteDragon
  Issues with first start, what do the different LED states mean? Botagar 4 513 07-27-2018, 03:13 PM
Last Post: paradise
  Networking Issues (Actually really Solved x2!) aww 11 1,938 04-09-2018, 09:27 PM
Last Post: aww
  Clusterboard startup issues PigLover 7 1,156 04-09-2018, 11:24 AM
Last Post: aww

Forum Jump:


Users browsing this thread: 1 Guest(s)