Network problems (actually bad power supply)

venix1 · 10-14-2019, 08:20 AM

Some additional notes, and a possible fix that doesn't involve an ATX power supply. During the last week, I ran module 1 with a cron to reset the system clock from RTC by running /sbin/hwclock -s at 5 minute intervals. The module would usually go down in 24 hours but has not. Instead module 2 is the first and only one to go down.

The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes. This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime. So it has had a positive effect. Five minutes is arbitrary, I tried with 1 minute but it confused chronyd. Five minutes has the affect of keeping the "Update Interval" to 60 seconds. So a internal time server is probably recommended with this to avoid frequent polling of external servers.

Even with this, I still get the following errors which may be related to the underlying issue. Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.

Code:
[Mon Oct 14 08:00:36 2019] rcu: INFO: rcu_sched self-detected stall on CPU

[Mon Oct 14 08:00:36 2019] rcu:         1-...!: (102 GPs behind) idle=23e/0/0x1 softirq=5005523/5005524 fqs=12 

[Mon Oct 14 08:00:36 2019] rcu:          (t=259866 jiffies g=12543365 q=28)

[Mon Oct 14 08:00:36 2019] rcu: rcu_sched kthread starved for 259842 jiffies! g12543365 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3

[Mon Oct 14 08:00:36 2019] rcu: RCU grace-period kthread stack dump:

[Mon Oct 14 08:00:36 2019] rcu_sched       I    0    10      2 0x00000028

[Mon Oct 14 08:00:36 2019] Call trace:

[Mon Oct 14 08:00:36 2019]  __switch_to+0x94/0xd8

[Mon Oct 14 08:00:36 2019]  __schedule+0x1e8/0x640

[Mon Oct 14 08:00:36 2019]  schedule+0x24/0x80

[Mon Oct 14 08:00:36 2019]  schedule_timeout+0x90/0x398

[Mon Oct 14 08:00:36 2019]  rcu_gp_kthread+0x550/0x8f8

[Mon Oct 14 08:00:36 2019]  kthread+0x128/0x130

[Mon Oct 14 08:00:36 2019]  ret_from_fork+0x10/0x1c

[Mon Oct 14 08:00:36 2019] Task dump for CPU 1:

[Mon Oct 14 08:00:36 2019] swapper/1       R  running task        0     0      1 0x0000002a

[Mon Oct 14 08:00:36 2019] Call trace:

[Mon Oct 14 08:00:36 2019]  dump_backtrace+0x0/0x1a0

[Mon Oct 14 08:00:36 2019]  show_stack+0x14/0x20

[Mon Oct 14 08:00:36 2019]  sched_show_task+0x160/0x198

[Mon Oct 14 08:00:36 2019]  dump_cpu_task+0x40/0x50

[Mon Oct 14 08:00:36 2019]  rcu_dump_cpu_stacks+0xc0/0x100

[Mon Oct 14 08:00:36 2019]  rcu_check_callbacks+0x594/0x780

[Mon Oct 14 08:00:36 2019]  update_process_times+0x2c/0x58

[Mon Oct 14 08:00:36 2019]  tick_sched_handle.isra.5+0x30/0x48

[Mon Oct 14 08:00:36 2019]  tick_sched_timer+0x48/0x98

[Mon Oct 14 08:00:36 2019]  __hrtimer_run_queues+0xe4/0x1f8

[Mon Oct 14 08:00:36 2019]  hrtimer_interrupt+0xf4/0x2b0

[Mon Oct 14 08:00:36 2019]  arch_timer_handler_phys+0x28/0x40

[Mon Oct 14 08:00:36 2019]  handle_percpu_devid_irq+0x80/0x138

[Mon Oct 14 08:00:36 2019]  generic_handle_irq+0x24/0x38

[Mon Oct 14 08:00:36 2019]  __handle_domain_irq+0x5c/0xb0

[Mon Oct 14 08:00:36 2019]  gic_handle_irq+0x58/0xa8

[Mon Oct 14 08:00:36 2019]  el1_irq+0xb0/0x140

[Mon Oct 14 08:00:36 2019]  arch_cpu_idle+0x10/0x18

[Mon Oct 14 08:00:36 2019]  do_idle+0x1d4/0x298

[Mon Oct 14 08:00:36 2019]  cpu_startup_entry+0x24/0x28

[Mon Oct 14 08:00:36 2019]  secondary_start_kernel+0x18c/0x1c8

Unkn0wn · (This post was last modified: 10-17-2019, 06:31 AM by Unkn0wn.)

(10-14-2019, 08:20 AM)venix1 Wrote: Some additional notes, and a possible fix that doesn't involve an ATX power supply. During the last week, I ran module 1 with a cron to reset the system clock from RTC by running /sbin/hwclock -s at 5 minute intervals. The module would usually go down in 24 hours but has not. Instead module 2 is the first and only one to go down.

The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes. This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime. So it has had a positive effect. Five minutes is arbitrary, I tried with 1 minute but it confused chronyd. Five minutes has the affect of keeping the "Update Interval" to 60 seconds. So a internal time server is probably recommended with this to avoid frequent polling of external servers.

Even with this, I still get the following errors which may be related to the underlying issue. Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.

Code:
[Mon Oct 14 08:00:36 2019] rcu: INFO: rcu_sched self-detected stall on CPU [Mon Oct 14 08:00:36 2019] rcu: 1-...!: (102 GPs behind) idle=23e/0/0x1 softirq=5005523/5005524 fqs=12 [Mon Oct 14 08:00:36 2019] rcu: (t=259866 jiffies g=12543365 q=28) [Mon Oct 14 08:00:36 2019] rcu: rcu_sched kthread starved for 259842 jiffies! g12543365 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3 [Mon Oct 14 08:00:36 2019] rcu: RCU grace-period kthread stack dump: [Mon Oct 14 08:00:36 2019] rcu_sched I 0 10 2 0x00000028 [Mon Oct 14 08:00:36 2019] Call trace: [Mon Oct 14 08:00:36 2019] __switch_to+0x94/0xd8 [Mon Oct 14 08:00:36 2019] __schedule+0x1e8/0x640 [Mon Oct 14 08:00:36 2019] schedule+0x24/0x80 [Mon Oct 14 08:00:36 2019] schedule_timeout+0x90/0x398 [Mon Oct 14 08:00:36 2019] rcu_gp_kthread+0x550/0x8f8 [Mon Oct 14 08:00:36 2019] kthread+0x128/0x130 [Mon Oct 14 08:00:36 2019] ret_from_fork+0x10/0x1c [Mon Oct 14 08:00:36 2019] Task dump for CPU 1: [Mon Oct 14 08:00:36 2019] swapper/1 R running task 0 0 1 0x0000002a [Mon Oct 14 08:00:36 2019] Call trace: [Mon Oct 14 08:00:36 2019] dump_backtrace+0x0/0x1a0 [Mon Oct 14 08:00:36 2019] show_stack+0x14/0x20 [Mon Oct 14 08:00:36 2019] sched_show_task+0x160/0x198 [Mon Oct 14 08:00:36 2019] dump_cpu_task+0x40/0x50 [Mon Oct 14 08:00:36 2019] rcu_dump_cpu_stacks+0xc0/0x100 [Mon Oct 14 08:00:36 2019] rcu_check_callbacks+0x594/0x780 [Mon Oct 14 08:00:36 2019] update_process_times+0x2c/0x58 [Mon Oct 14 08:00:36 2019] tick_sched_handle.isra.5+0x30/0x48 [Mon Oct 14 08:00:36 2019] tick_sched_timer+0x48/0x98 [Mon Oct 14 08:00:36 2019] __hrtimer_run_queues+0xe4/0x1f8 [Mon Oct 14 08:00:36 2019] hrtimer_interrupt+0xf4/0x2b0 [Mon Oct 14 08:00:36 2019] arch_timer_handler_phys+0x28/0x40 [Mon Oct 14 08:00:36 2019] handle_percpu_devid_irq+0x80/0x138 [Mon Oct 14 08:00:36 2019] generic_handle_irq+0x24/0x38 [Mon Oct 14 08:00:36 2019] __handle_domain_irq+0x5c/0xb0 [Mon Oct 14 08:00:36 2019] gic_handle_irq+0x58/0xa8 [Mon Oct 14 08:00:36 2019] el1_irq+0xb0/0x140 [Mon Oct 14 08:00:36 2019] arch_cpu_idle+0x10/0x18 [Mon Oct 14 08:00:36 2019] do_idle+0x1d4/0x298 [Mon Oct 14 08:00:36 2019] cpu_startup_entry+0x24/0x28 [Mon Oct 14 08:00:36 2019] secondary_start_kernel+0x18c/0x1c8

Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Also, would everyone mind continueing in the other thread, as the original problem of this one has been fixed Smile

(link)

venix1 · 10-17-2019, 09:17 AM

(10-17-2019, 06:30 AM)Unkn0wn Wrote: Also, would everyone mind continueing in the other thread, as the original problem of this one has been fixed (link)

Sure I've put your answer over there. However, in my case the time jump results in a network outage so I believe both issues are symptoms of the same underlying root problem.

ykanello · (This post was last modified: 11-26-2019, 12:53 PM by ykanello.)

Hello all,
I am having network timeouts towards all SoC of the clusterboard.
After reading the posts with regards to the PSU, I removed two SoC and now is running with 5, but the problem still persists.
I do not have the time drift problem (date set to a far future) anymore after I installed the recommended kernel package from the community. [linux-image-next-sunxi64_5.99_arm64.deb]

I rely heavily on the network of the clusterboard and this timeout issue makes the clusterboard quite useless for me.
Does any one have a solid idea on what the problem could be?

Sample of the timeouts:

Code:
64 bytes from 10.13.3.74: icmp_seq=600 ttl=64 time=2.118 ms

Request timeout for icmp_seq 601

Request timeout for icmp_seq 602

Request timeout for icmp_seq 603

Request timeout for icmp_seq 604

64 bytes from 10.13.3.74: icmp_seq=605 ttl=64 time=1.943 ms

64 bytes from 10.13.3.74: icmp_seq=606 ttl=64 time=2.017 ms

thanks

y

venix1 · 12-01-2019, 10:02 AM

(11-26-2019, 12:48 PM)ykanello Wrote: I am having network timeouts towards all SoC of the clusterboard.

I have never noticed this. However, due to the other issues the board doesn't see much network traffic. I would suggest checking bandwidth utilization and pinging between the modules. They only have 100Mbit interfaces and saturation causes packet loss.

(11-26-2019, 12:48 PM)ykanello Wrote: After reading the posts with regards to the PSU, I removed two SoC and now is running with 5, but the problem still persists.

My problem SoC produce these errors within dmesg. Check if any show it. You may also want to try using an ATX power supply.

Code:
[15507.597377] rcu: rcu_sched kthread starved for 15755 jiffies! g1064201 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0

[15507.597380] rcu: RCU grace-period kthread stack dump:

[15507.597384] rcu_sched       I    0    10      2 0x00000028

(11-26-2019, 12:48 PM)ykanello Wrote: I do not have the time drift problem (date set to a far future) anymore after I installed the recommended kernel package from the community. [linux-image-next-sunxi64_5.99_arm64.deb]

I recently moved to linux-image-current-sunxi64_5.3.9 and the time jumps have ceased. However, now the Ethernet device will shut off.

(11-26-2019, 12:48 PM)ykanello Wrote: I rely heavily on the network of the clusterboard and this timeout issue makes the clusterboard quite useless for me.
Does any one have a solid idea on what the problem could be?

Not that my research has showed. You could try adding some fans to the case and heatsinks to cool chips. Additional, cooling and an ATX power supply are on my todo list. The clusterboard chips show around 90F when ambient is around 60F.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Creating a current armbian-Image with network-fix	clusterDude	15	27,004	05-29-2024, 03:50 PM Last Post: poVoq
	Version/Date of last armbian build that came with network patches?	Bazmundi	0	486	12-07-2023, 03:23 PM Last Post: Bazmundi
	Clusterboard not getting IP address after network fix	Norlark	14	14,774	08-30-2021, 05:00 PM Last Post: poVoq
	ArchLinux Network Booting	xblack86	2	4,424	02-25-2021, 08:42 AM Last Post: xblack86
	sopine socket power problem	cgiraldo	1	3,633	06-17-2020, 02:10 PM Last Post: cgiraldo
	Clusterboard networking problems	BryanS	25	33,837	03-31-2019, 04:06 PM Last Post: aww
	Power Switch	AZClusterboard	1	3,121	02-16-2019, 06:55 AM Last Post: mdmbc
	Individual SOPINE Power On After Shutdown?	Pine	2	4,291	01-30-2019, 08:04 AM Last Post: mdmbc
	Question on the power resistors	bergera	2	4,473	02-15-2018, 08:20 AM Last Post: bergera

Login




Remember me Lost Password?

About Us