Network problems (actually bad power supply) - Printable Version +- PINE64 (https://forum.pine64.org) +-- Forum: PINE A64-LTS / SOPINE Compute Module (https://forum.pine64.org/forumdisplay.php?fid=66) +--- Forum: Clusterboard (https://forum.pine64.org/forumdisplay.php?fid=91) +--- Thread: Network problems (actually bad power supply) (/showthread.php?tid=7907) Pages:
1
2
|
Network problems (actually bad power supply) - Unkn0wn - 08-26-2019 But different from the problems I've read so far. Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module: Code: C:\WINDOWS\system32>ping 10.10.10.70 I do not know whether the system is actually running. What kind of serial interface should I be using to debug this? The extra software I am running on the modules is docker (18.06.3~ce~3-0~debian), containerd.io (latest) and kubernetes (1.15.2). p.s. the software date on some of the modules jumps randomly to october 2119 on the command date, but the hardware clock shows the correct time (hwclock). This is resolved after a reboot and is unkown whether this has something to do with the network, I just thought it was important enough to mention. EDIT: After having physical access I think I've found the issue. I believe it is twofold: 1. Misconfigured switch and gateway. Couldn't prove this one, but I believe it caused a part of the issues I had. 2. Faulty PSU. I use the 5v 15A PSU from the Pine64 store, and on top of making poor contact between the AC cord and the adapter, it is faulty and not able to sustain all modules under a full load. I've asked for a replacement unit. EDIT2: Ordered a ATX PSU, will see if that goes better. EDIT3: With an ATX power supply the modules are working perfectly, even under heavy load and for sustained amounts of time. However I still have the time drifting issues. RE: Network problems - Dreamwalker - 09-11-2019 Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically). Have you inserted the 2 batteries for the RTC? I also installed chrony on each of mine to ensure the time stays correct. I only did this the other day and all seem to be staying up ok. RE: Network problems - Unkn0wn - 09-11-2019 (09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically). I'll try that, thanks. RE: Network problems - Dreamwalker - 09-17-2019 Just thought I would update you. I've not had a node gone down now since I made the changes and also after unexpected switch off as well all come up. Reboots still don't work :S RE: Network problems - Unkn0wn - 09-17-2019 Thank you for your update, thats really good to hear. I'm still waiting for a new power supply and will update this thread accordingly. RE: Network problems - Unkn0wn - 09-27-2019 (09-11-2019, 10:31 AM)Dreamwalker Wrote: Yeah the clock is wrong on my nodes as well so it causes problems with networking and such (certificate errors basically). Had any problems since then? I installed 2 AA batteries in the RTC slot, but the time drift still occurs. Now I've installed chrony and will test if this improves things. RE: Network problems (actually bad power supply) - venix1 - 10-06-2019 (08-26-2019, 03:00 PM)Unkn0wn Wrote: Using the latest version of Armbian Buster all the modules boot and are accessible. However after a couple of hours the modules become unreachable one by one. After ~1 week only 2 of the 7 modules are resolve-able over the network. All modules have a static IP natively, however I have tried with DHCP and a static IP on the gateway. Here a network ping to a "off-line" module: Just want to add that my clusterboard is having the exact same symptoms and is also using the recommended PSU. The date is slightly different but still far future, module 1 was 2210-01-15 09:14:01 and module 2 was 2210-01-14 03:14:01 so they're surprisingly consistent. Such consistently is not usually chance, so I think current time probably factors in. For reference, I name them 0-6 starting with the eMMC capable module being 0. To add more to this, the failure is always in the same order. First module 1 goes followed by module 2(I haven't let it get further than that). I have since enabled RTC and chronyd and the issue became less frequent. I got a couple of days instead of a few hours but still occurred and in the same order. I believe the delayed start may have been due to Update interval from chronyd. This can be seen with chronyc tracking. Tracking this in module 1, showed it reached max interval before having the problem. Quote:EDIT: After having physical access I think I've found the issue. I believe it is twofold: This strongly points a finger at the PSU. The only counter point I can add is that my system is not loaded. Only one board is using CPU <50%, and the other 6 are idle. Not sure what to make of the time drifts with ATX as that does seem to rule most of what I observed and based my hypothesis on. Even still, I'm going to mess with chronyd and minimize the time between updates. Such a large unexpected jump could be causing kernel panics and/or issues with kernel modules. EDIT1: Power supply doesn't* kill networking the time jump does. I had the serial console connected and monitoring for this event. Eventually networking died by the device was up and responsive. I managed to record some stateful information but it eventually locked up after attempting to reset time using the RTC. Here's a copy of the logs. Code: Oct 8 13:30:03 pine64so-cluster1 systemd[1]: apt-daily.timer: Adding 10h 39min 34.521783s random time. Attempts to reset the time proved pointless and eventually things locked up. Code: root@pine64so-cluster1:~# date The same command worked without issue after a restart, fake-hwclock is enabled and restored the future date. In my case, the time jump is the point everything breaks down and I think it's not recoverable/preventable in software without a reboot. RE: Network problems (actually bad power supply) - Dreamwalker - 10-09-2019 Only just booted mine after been off for a few weeks. All ok except 1 I can't connect to not sure on reason yet. RE: Network problems (actually bad power supply) - venix1 - 10-09-2019 Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP. RE: Network problems (actually bad power supply) - Unkn0wn - 10-10-2019 (10-09-2019, 01:22 PM)venix1 Wrote: Pull the sd and check /etc/fake-hwclock.data . It may be restoring a future time which breaks DHCP. This is the output from a faulty node: Code: root@master:~# cat /etc/fake-hwclock.data EDIT: I've made a thread for this issue, as this thread solves a different one: new thread. |