Network problems (actually bad power supply)

Network problems (actually bad power supply) - Printable Version

+- PINE64 (https://forum.pine64.org)
+-- Forum: PINE A64-LTS / SOPINE Compute Module (https://forum.pine64.org/forumdisplay.php?fid=66)
+--- Forum: Clusterboard (https://forum.pine64.org/forumdisplay.php?fid=91)
+--- Thread: Network problems (actually bad power supply) (/showthread.php?tid=7907)

Pages: 1 2

Network problems (actually bad power supply) - Unkn0wn - 08-26-2019

But different from the problems I've read so far.

Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:

Code:
C:\WINDOWS\system32>ping 10.10.10.70

Pinging 10.10.10.70 with 32 bytes of data:

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Ping statistics for 10.10.10.70:

    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),

C:\WINDOWS\system32>

I do not know whether the system is actually running. What kind of serial interface should I be using to debug this?
The extra software I am running on the modules is docker (18.06.3~ce~3-0~debian), containerd.io (latest) and kubernetes (1.15.2).

p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

EDIT2: Ordered a ATX PSU, will see if that goes better.

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.

RE: Network problems - Dreamwalker - 09-11-2019

Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

RE: Network problems - Unkn0wn - 09-11-2019

(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

I'll try that, thanks.

RE: Network problems - Dreamwalker - 09-17-2019

Just thought I would update you. I've not had a node gone down now since I made the changes and also after unexpected switch off as well all come up. Reboots still don't work :S

RE: Network problems - Unkn0wn - 09-17-2019

Thank you for your update, thats really good to hear. I'm still waiting for a new power supply and will update this thread accordingly.

RE: Network problems - Unkn0wn - 09-27-2019

(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

Had any problems since then? I installed 2 AA batteries in the RTC slot, but the time drift still occurs.

Now I've installed chrony and will test if this improves things.

RE: Network problems (actually bad power supply) - venix1 - 10-06-2019

(08-26-2019, 03:00 PM)Unkn0wn Wrote: Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:

p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

Just want to add that my clusterboard is having the exact same symptoms and is also using the recommended PSU. The date is slightly different but still far future, module 1 was 2210-01-15 09:14:01 and module 2 was 2210-01-14 03:14:01 so they're surprisingly consistent. Such consistently is not usually chance, so I think current time probably factors in. For reference, I name them 0-6 starting with the eMMC capable module being 0.

To add more to this, the failure is always in the same order. First module 1 goes followed by module 2(I haven't let it get further than that). I have since enabled RTC and chronyd and the issue became less frequent. I got a couple of days instead of a few hours but still occurred and in the same order. I believe the delayed start may have been due to Update interval from chronyd. This can be seen with chronyc tracking. Tracking this in module 1, showed it reached max interval before having the problem.

Quote:EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

...

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.

This strongly points a finger at the PSU. The only counter point I can add is that my system is not loaded. Only one board is using CPU <50%, and the other 6 are idle. Not sure what to make of the time drifts with ATX as that does seem to rule most of what I observed and based my hypothesis on. Even still, I'm going to mess with chronyd and minimize the time between updates. Such a large unexpected jump could be causing kernel panics and/or issues with kernel modules.

EDIT1: Power supply doesn't* kill networking the time jump does. I had the serial console connected and monitoring for this event. Eventually networking died by the device was up and responsive. I managed to record some stateful information but it eventually locked up after attempting to reset time using the RTC. Here's a copy of the logs.

Code:
Oct  8 13:30:03 pine64so-cluster1 systemd[1]: apt-daily.timer: Adding 10h 39min 34.521783s random time.

Oct  8 13:39:15 pine64so-cluster1 chronyd[9132]: Forward time jump detected!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Killing process 30217 (systemd-udevd) with signal SIGABRT.

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Killing process 592 (systemd-logind) with signal SIGABRT.

Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.273992] rcu: INFO: rcu_sched self-detected stall on CPU

Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.274017] rcu:   2-...!: (2 GPs behind) idle=206/1/0x4000000000000002 softirq=2784284/2784285 fqs=0

Attempts to reset the time proved pointless and eventually things locked up.

Code:
root@pine64so-cluster1:~# date

Sun Nov 25 10:28:13 UTC 2114

root@pine64so-cluster1:~# hwclock -D -s

hwclock from util-linux 2.29.2

Using the /dev interface to the clock.

Last drift adjustment done at 1568752335 seconds after 1969

Last calibration done at 1568752335 seconds after 1969

Hardware clock is on UTC time

Assuming hardware clock is kept in UTC time.

Waiting for clock tick...

...got clock tick

Time read from Hardware Clock: 2019/10/08 15:26:41

Hw clock time : 2019/10/08 15:26:41 = 1570548401 seconds since 1969

Time since last adjustment is 1796066 seconds

Calculated Hardware Clock drift is 0.000000 seconds

Calling settimeofday:

        tv.tv_sec = 1570548401, tv.tv_usec = 0

        tz.tz_minuteswest = 0

hwclock: settimeofday() failed: Invalid argument

Unable to set system clock.

The same command worked without issue after a restart, fake-hwclock is enabled and restored the future date. In my case, the time jump is the point everything breaks down and I think it's not recoverable/preventable in software without a reboot.

RE: Network problems (actually bad power supply) - Dreamwalker - 10-09-2019

Only just booted mine after been off for a few weeks. All ok except 1 I can't connect to not sure on reason yet.

RE: Network problems (actually bad power supply) - venix1 - 10-09-2019

Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP.

RE: Network problems (actually bad power supply) - Unkn0wn - 10-10-2019

(10-09-2019, 01:22 PM)venix1 Wrote: Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP.

This is the output from a faulty node:

Code:
root@master:~# cat /etc/fake-hwclock.data

2114-11-30 14:46:01

EDIT: I've made a thread for this issue, as this thread solves a different one: new thread.