10-14-2019, 08:20 AM
Some additional notes, and a possible fix that doesn't involve an ATX power supply. During the last week, I ran module 1 with a cron to reset the system clock from RTC by running /sbin/hwclock -s at 5 minute intervals. The module would usually go down in 24 hours but has not. Instead module 2 is the first and only one to go down.
The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes. This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime. So it has had a positive effect. Five minutes is arbitrary, I tried with 1 minute but it confused chronyd. Five minutes has the affect of keeping the "Update Interval" to 60 seconds. So a internal time server is probably recommended with this to avoid frequent polling of external servers.
Even with this, I still get the following errors which may be related to the underlying issue. Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.
The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes. This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime. So it has had a positive effect. Five minutes is arbitrary, I tried with 1 minute but it confused chronyd. Five minutes has the affect of keeping the "Update Interval" to 60 seconds. So a internal time server is probably recommended with this to avoid frequent polling of external servers.
Even with this, I still get the following errors which may be related to the underlying issue. Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.
Code:
[Mon Oct 14 08:00:36 2019] rcu: INFO: rcu_sched self-detected stall on CPU
[Mon Oct 14 08:00:36 2019] rcu: 1-...!: (102 GPs behind) idle=23e/0/0x1 softirq=5005523/5005524 fqs=12
[Mon Oct 14 08:00:36 2019] rcu: (t=259866 jiffies g=12543365 q=28)
[Mon Oct 14 08:00:36 2019] rcu: rcu_sched kthread starved for 259842 jiffies! g12543365 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[Mon Oct 14 08:00:36 2019] rcu: RCU grace-period kthread stack dump:
[Mon Oct 14 08:00:36 2019] rcu_sched I 0 10 2 0x00000028
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019] __switch_to+0x94/0xd8
[Mon Oct 14 08:00:36 2019] __schedule+0x1e8/0x640
[Mon Oct 14 08:00:36 2019] schedule+0x24/0x80
[Mon Oct 14 08:00:36 2019] schedule_timeout+0x90/0x398
[Mon Oct 14 08:00:36 2019] rcu_gp_kthread+0x550/0x8f8
[Mon Oct 14 08:00:36 2019] kthread+0x128/0x130
[Mon Oct 14 08:00:36 2019] ret_from_fork+0x10/0x1c
[Mon Oct 14 08:00:36 2019] Task dump for CPU 1:
[Mon Oct 14 08:00:36 2019] swapper/1 R running task 0 0 1 0x0000002a
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019] dump_backtrace+0x0/0x1a0
[Mon Oct 14 08:00:36 2019] show_stack+0x14/0x20
[Mon Oct 14 08:00:36 2019] sched_show_task+0x160/0x198
[Mon Oct 14 08:00:36 2019] dump_cpu_task+0x40/0x50
[Mon Oct 14 08:00:36 2019] rcu_dump_cpu_stacks+0xc0/0x100
[Mon Oct 14 08:00:36 2019] rcu_check_callbacks+0x594/0x780
[Mon Oct 14 08:00:36 2019] update_process_times+0x2c/0x58
[Mon Oct 14 08:00:36 2019] tick_sched_handle.isra.5+0x30/0x48
[Mon Oct 14 08:00:36 2019] tick_sched_timer+0x48/0x98
[Mon Oct 14 08:00:36 2019] __hrtimer_run_queues+0xe4/0x1f8
[Mon Oct 14 08:00:36 2019] hrtimer_interrupt+0xf4/0x2b0
[Mon Oct 14 08:00:36 2019] arch_timer_handler_phys+0x28/0x40
[Mon Oct 14 08:00:36 2019] handle_percpu_devid_irq+0x80/0x138
[Mon Oct 14 08:00:36 2019] generic_handle_irq+0x24/0x38
[Mon Oct 14 08:00:36 2019] __handle_domain_irq+0x5c/0xb0
[Mon Oct 14 08:00:36 2019] gic_handle_irq+0x58/0xa8
[Mon Oct 14 08:00:36 2019] el1_irq+0xb0/0x140
[Mon Oct 14 08:00:36 2019] arch_cpu_idle+0x10/0x18
[Mon Oct 14 08:00:36 2019] do_idle+0x1d4/0x298
[Mon Oct 14 08:00:36 2019] cpu_startup_entry+0x24/0x28
[Mon Oct 14 08:00:36 2019] secondary_start_kernel+0x18c/0x1c8