Network problems (actually bad power supply)
#1
But different from the problems I've read so far.

Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:


Code:
C:\WINDOWS\system32>ping 10.10.10.70

Pinging 10.10.10.70 with 32 bytes of data:
Reply from 169.254.1.1: Destination host unreachable.
Reply from 169.254.1.1: Destination host unreachable.
Reply from 169.254.1.1: Destination host unreachable.
Reply from 169.254.1.1: Destination host unreachable.

Ping statistics for 10.10.10.70:
   Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),

C:\WINDOWS\system32>

I do not know whether the system is actually running. What kind of serial interface should I be using to debug this?
The extra software I am running on the modules is docker (18.06.3~ce~3-0~debian), containerd.io (latest) and kubernetes (1.15.2).


p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

EDIT2: Ordered a ATX PSU, will see if that goes better.

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.
  Reply
#2
Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.
  Reply
#3
(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

I'll try that, thanks.
  Reply
#4
Just thought I would update you. I've not had a node gone down now since I made the changes and also after unexpected switch off as well all come up. Reboots still don't work :S
  Reply
#5
Thank you for your update, thats really good to hear. I'm still waiting for a new power supply and will update this thread accordingly.
  Reply
#6
(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

Had any problems since then? I installed 2 AA batteries in the RTC slot, but the time drift still occurs.

Now I've installed chrony and will test if this improves things.
  Reply
#7
(08-26-2019, 03:00 PM)Unkn0wn Wrote: Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:

p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

Just want to add that my clusterboard is having the exact same symptoms and is also using the recommended PSU.  The date is slightly different but still far future, module 1 was 2210-01-15 09:14:01 and module 2 was 2210-01-14 03:14:01 so they're surprisingly consistent. Such consistently is not usually chance, so I think current time probably factors in. For reference, I name them 0-6 starting with the eMMC capable module being 0.

To add more to this, the failure is always in the same order. First module 1 goes followed by module 2(I haven't let it get further than that).  I have since enabled RTC and chronyd and the issue became less frequent.  I got a couple of days instead of a few hours but still occurred and in the same order.  I believe the delayed start may have been due to Update interval from chronyd.  This can be seen with chronyc tracking.  Tracking this in module 1, showed it reached max interval before having the problem.

Quote:EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

...

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.

This strongly points a finger at the PSU.  The only counter point I can add is that my system is not loaded.  Only one board is using CPU <50%, and the other 6 are idle. Not sure what to make of the time drifts with ATX as that does seem to rule most of what I observed and based my hypothesis on.   Even still, I'm going to mess with chronyd and minimize the time between updates.  Such a large unexpected jump could be causing kernel panics and/or issues with kernel modules.

EDIT1: Power supply doesn't* kill networking the time jump does. I had the serial console connected and monitoring for this event.  Eventually networking died by the device was up and responsive. I managed to record some stateful information but it eventually locked up after attempting to reset time using the RTC.  Here's a copy of the logs. 
Code:
Oct  8 13:30:03 pine64so-cluster1 systemd[1]: apt-daily.timer: Adding 10h 39min 34.521783s random time.
Oct  8 13:39:15 pine64so-cluster1 chronyd[9132]: Forward time jump detected!
Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Killing process 30217 (systemd-udevd) with signal SIGABRT.
Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Killing process 592 (systemd-logind) with signal SIGABRT.
Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.273992] rcu: INFO: rcu_sched self-detected stall on CPU
Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.274017] rcu:   2-...!: (2 GPs behind) idle=206/1/0x4000000000000002 softirq=2784284/2784285 fqs=0


Attempts to reset the time proved pointless and eventually things locked up.

Code:
[email protected]:~# date
Sun Nov 25 10:28:13 UTC 2114

[email protected]:~# hwclock -D -s
hwclock from util-linux 2.29.2
Using the /dev interface to the clock.
Last drift adjustment done at 1568752335 seconds after 1969
Last calibration done at 1568752335 seconds after 1969
Hardware clock is on UTC time
Assuming hardware clock is kept in UTC time.
Waiting for clock tick...
...got clock tick
Time read from Hardware Clock: 2019/10/08 15:26:41
Hw clock time : 2019/10/08 15:26:41 = 1570548401 seconds since 1969
Time since last adjustment is 1796066 seconds
Calculated Hardware Clock drift is 0.000000 seconds
Calling settimeofday:
       tv.tv_sec = 1570548401, tv.tv_usec = 0
       tz.tz_minuteswest = 0
hwclock: settimeofday() failed: Invalid argument
Unable to set system clock.

The same command worked without issue after a restart, fake-hwclock is enabled and restored the future date. In my case, the time jump is the point everything breaks down and I think it's not recoverable/preventable in software without a reboot.
  Reply
#8
Only just booted mine after been off for a few weeks. All ok except 1 I can't connect to not sure on reason yet.
  Reply
#9
Pull the sd and check /etc/fake-hwclock.data .  It may be restoring a future time which breaks DHCP.
  Reply
#10
(10-09-2019, 01:22 PM)venix1 Wrote: Pull the sd and check /etc/fake-hwclock.data .  It may be restoring a future time which breaks DHCP.

This is the output from a faulty node:
Code:
[email protected]:~# cat /etc/fake-hwclock.data
2114-11-30 14:46:01

EDIT: I've made a thread for this issue, as this thread solves a different one: new thread.
  Reply


Possibly Related Threads...
Thread Author Replies Views Last Post
  Clusterboard networking problems BryanS 25 956 03-31-2019, 04:06 PM
Last Post: aww
  Power Switch AZClusterboard 1 186 02-16-2019, 06:55 AM
Last Post: mdmbc
  Individual SOPINE Power On After Shutdown? Pine 2 281 01-30-2019, 08:04 AM
Last Post: mdmbc
  Question on the power resistors bergera 2 394 02-15-2018, 08:20 AM
Last Post: bergera

Forum Jump:


Users browsing this thread: 1 Guest(s)