Network problems (actually bad power supply)

Unkn0wn · (This post was last modified: 09-27-2019, 03:51 AM by Unkn0wn.)

But different from the problems I've read so far.

Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:

Code:
C:\WINDOWS\system32>ping 10.10.10.70

Pinging 10.10.10.70 with 32 bytes of data:

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Reply from 169.254.1.1: Destination host unreachable.

Ping statistics for 10.10.10.70:

    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),

C:\WINDOWS\system32>

I do not know whether the system is actually running. What kind of serial interface should I be using to debug this?
The extra software I am running on the modules is docker (18.06.3~ce~3-0~debian), containerd.io (latest) and kubernetes (1.15.2).

p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

EDIT2: Ordered a ATX PSU, will see if that goes better.

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.

Dreamwalker · 09-11-2019, 10:31 AM

Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

Unkn0wn · 09-11-2019, 12:45 PM

(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

I'll try that, thanks.

Dreamwalker · 09-17-2019, 10:38 AM

Just thought I would update you. I've not had a node gone down now since I made the changes and also after unexpected switch off as well all come up. Reboots still don't work :S

Unkn0wn · 09-17-2019, 03:50 PM

Thank you for your update, thats really good to hear. I'm still waiting for a new power supply and will update this thread accordingly.

Unkn0wn · 09-27-2019, 04:07 AM

(09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically).

Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct.

I only did this the other day and all seem to be staying up ok.

Had any problems since then? I installed 2 AA batteries in the RTC slot, but the time drift still occurs.

Now I've installed chrony and will test if this improves things.

venix1

(08-26-2019, 03:00 PM)Unkn0wn Wrote: Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module:

p.s.
the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention.

Just want to add that my clusterboard is having the exact same symptoms and is also using the recommended PSU. The date is slightly different but still far future, module 1 was 2210-01-15 09:14:01 and module 2 was 2210-01-14 03:14:01 so they're surprisingly consistent. Such consistently is not usually chance, so I think current time probably factors in. For reference, I name them 0-6 starting with the eMMC capable module being 0.

To add more to this, the failure is always in the same order. First module 1 goes followed by module 2(I haven't let it get further than that). I have since enabled RTC and chronyd and the issue became less frequent. I got a couple of days instead of a few hours but still occurred and in the same order. I believe the delayed start may have been due to Update interval from chronyd. This can be seen with chronyc tracking. Tracking this in module 1, showed it reached max interval before having the problem.

Quote:EDIT: After having physical access I think I've found the issue. I believe it is twofold:
1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had.
2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit.

...

EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues.

This strongly points a finger at the PSU. The only counter point I can add is that my system is not loaded. Only one board is using CPU <50%, and the other 6 are idle. Not sure what to make of the time drifts with ATX as that does seem to rule most of what I observed and based my hypothesis on. Even still, I'm going to mess with chronyd and minimize the time between updates. Such a large unexpected jump could be causing kernel panics and/or issues with kernel modules.

EDIT1: Power supply doesn't* kill networking the time jump does. I had the serial console connected and monitoring for this event. Eventually networking died by the device was up and responsive. I managed to record some stateful information but it eventually locked up after attempting to reset time using the RTC. Here's a copy of the logs.

Code:
Oct  8 13:30:03 pine64so-cluster1 systemd[1]: apt-daily.timer: Adding 10h 39min 34.521783s random time.

Oct  8 13:39:15 pine64so-cluster1 chronyd[9132]: Forward time jump detected!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-udevd.service: Killing process 30217 (systemd-udevd) with signal SIGABRT.

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!

Oct  8 13:39:15 pine64so-cluster1 systemd[1]: systemd-logind.service: Killing process 592 (systemd-logind) with signal SIGABRT.

Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.273992] rcu: INFO: rcu_sched self-detected stall on CPU

Nov 25 08:39:12 pine64so-cluster1 kernel: [78496.274017] rcu:   2-...!: (2 GPs behind) idle=206/1/0x4000000000000002 softirq=2784284/2784285 fqs=0

Attempts to reset the time proved pointless and eventually things locked up.

Code:
root@pine64so-cluster1:~# date

Sun Nov 25 10:28:13 UTC 2114

root@pine64so-cluster1:~# hwclock -D -s

hwclock from util-linux 2.29.2

Using the /dev interface to the clock.

Last drift adjustment done at 1568752335 seconds after 1969

Last calibration done at 1568752335 seconds after 1969

Hardware clock is on UTC time

Assuming hardware clock is kept in UTC time.

Waiting for clock tick...

...got clock tick

Time read from Hardware Clock: 2019/10/08 15:26:41

Hw clock time : 2019/10/08 15:26:41 = 1570548401 seconds since 1969

Time since last adjustment is 1796066 seconds

Calculated Hardware Clock drift is 0.000000 seconds

Calling settimeofday:

        tv.tv_sec = 1570548401, tv.tv_usec = 0

        tz.tz_minuteswest = 0

hwclock: settimeofday() failed: Invalid argument

Unable to set system clock.

The same command worked without issue after a restart, fake-hwclock is enabled and restored the future date. In my case, the time jump is the point everything breaks down and I think it's not recoverable/preventable in software without a reboot.

Dreamwalker · 10-09-2019, 11:04 AM

Only just booted mine after been off for a few weeks. All ok except 1 I can't connect to not sure on reason yet.

venix1 · (This post was last modified: 10-09-2019, 01:23 PM by venix1.)

Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP.

Unkn0wn · (This post was last modified: 10-10-2019, 05:49 PM by Unkn0wn.)

(10-09-2019, 01:22 PM)venix1 Wrote: Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP.

This is the output from a faulty node:

Code:
root@master:~# cat /etc/fake-hwclock.data

2114-11-30 14:46:01

EDIT: I've made a thread for this issue, as this thread solves a different one: new thread.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Creating a current armbian-Image with network-fix	clusterDude	15	44,026	05-29-2024, 03:50 PM Last Post: poVoq
	Version/Date of last armbian build that came with network patches?	Bazmundi	0	1,700	12-07-2023, 03:23 PM Last Post: Bazmundi
	Clusterboard not getting IP address after network fix	Norlark	14	22,487	08-30-2021, 05:00 PM Last Post: poVoq
	ArchLinux Network Booting	xblack86	2	7,068	02-25-2021, 08:42 AM Last Post: xblack86
	sopine socket power problem	cgiraldo	1	4,665	06-17-2020, 02:10 PM Last Post: cgiraldo
	Clusterboard networking problems	BryanS	25	43,923	03-31-2019, 04:06 PM Last Post: aww
	Power Switch	AZClusterboard	1	4,125	02-16-2019, 06:55 AM Last Post: mdmbc
	Individual SOPINE Power On After Shutdown?	Pine	2	5,708	01-30-2019, 08:04 AM Last Post: mdmbc
	Question on the power resistors	bergera	2	5,763	02-15-2018, 08:20 AM Last Post: bergera

Login




Remember me Lost Password?

About Us