Network problems (actually bad power supply)
#11
Some additional notes, and a possible fix that doesn't involve an ATX power supply.  During the last week, I ran module 1 with a cron to reset the system clock from RTC by running /sbin/hwclock -s at 5 minute intervals.  The module would usually go down in 24 hours but has not.  Instead module 2 is the first and only one to go down.

The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes.  This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime.  So it has had a positive effect.  Five minutes is arbitrary, I tried with 1 minute but it confused chronyd.  Five minutes has the affect of keeping the "Update Interval" to 60 seconds.  So a internal time server is probably recommended with this to avoid frequent polling of external servers.

Even with this, I still get the following errors which may be related to the underlying issue.  Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.
Code:
[Mon Oct 14 08:00:36 2019] rcu: INFO: rcu_sched self-detected stall on CPU
[Mon Oct 14 08:00:36 2019] rcu:         1-...!: (102 GPs behind) idle=23e/0/0x1 softirq=5005523/5005524 fqs=12 
[Mon Oct 14 08:00:36 2019] rcu:          (t=259866 jiffies g=12543365 q=28)
[Mon Oct 14 08:00:36 2019] rcu: rcu_sched kthread starved for 259842 jiffies! g12543365 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[Mon Oct 14 08:00:36 2019] rcu: RCU grace-period kthread stack dump:
[Mon Oct 14 08:00:36 2019] rcu_sched       I    0    10      2 0x00000028
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019]  __switch_to+0x94/0xd8
[Mon Oct 14 08:00:36 2019]  __schedule+0x1e8/0x640
[Mon Oct 14 08:00:36 2019]  schedule+0x24/0x80
[Mon Oct 14 08:00:36 2019]  schedule_timeout+0x90/0x398
[Mon Oct 14 08:00:36 2019]  rcu_gp_kthread+0x550/0x8f8
[Mon Oct 14 08:00:36 2019]  kthread+0x128/0x130
[Mon Oct 14 08:00:36 2019]  ret_from_fork+0x10/0x1c
[Mon Oct 14 08:00:36 2019] Task dump for CPU 1:
[Mon Oct 14 08:00:36 2019] swapper/1       R  running task        0     0      1 0x0000002a
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019]  dump_backtrace+0x0/0x1a0
[Mon Oct 14 08:00:36 2019]  show_stack+0x14/0x20
[Mon Oct 14 08:00:36 2019]  sched_show_task+0x160/0x198
[Mon Oct 14 08:00:36 2019]  dump_cpu_task+0x40/0x50
[Mon Oct 14 08:00:36 2019]  rcu_dump_cpu_stacks+0xc0/0x100
[Mon Oct 14 08:00:36 2019]  rcu_check_callbacks+0x594/0x780
[Mon Oct 14 08:00:36 2019]  update_process_times+0x2c/0x58
[Mon Oct 14 08:00:36 2019]  tick_sched_handle.isra.5+0x30/0x48
[Mon Oct 14 08:00:36 2019]  tick_sched_timer+0x48/0x98
[Mon Oct 14 08:00:36 2019]  __hrtimer_run_queues+0xe4/0x1f8
[Mon Oct 14 08:00:36 2019]  hrtimer_interrupt+0xf4/0x2b0
[Mon Oct 14 08:00:36 2019]  arch_timer_handler_phys+0x28/0x40
[Mon Oct 14 08:00:36 2019]  handle_percpu_devid_irq+0x80/0x138
[Mon Oct 14 08:00:36 2019]  generic_handle_irq+0x24/0x38
[Mon Oct 14 08:00:36 2019]  __handle_domain_irq+0x5c/0xb0
[Mon Oct 14 08:00:36 2019]  gic_handle_irq+0x58/0xa8
[Mon Oct 14 08:00:36 2019]  el1_irq+0xb0/0x140
[Mon Oct 14 08:00:36 2019]  arch_cpu_idle+0x10/0x18
[Mon Oct 14 08:00:36 2019]  do_idle+0x1d4/0x298
[Mon Oct 14 08:00:36 2019]  cpu_startup_entry+0x24/0x28
[Mon Oct 14 08:00:36 2019]  secondary_start_kernel+0x18c/0x1c8
#12
(10-14-2019, 08:20 AM)venix1 Wrote: Some additional notes, and a possible fix that doesn't involve an ATX power supply.  During the last week, I ran module 1 with a cron to reset the system clock from RTC by running /sbin/hwclock -s at 5 minute intervals.  The module would usually go down in 24 hours but has not.  Instead module 2 is the first and only one to go down.

The systems also run chronyd to maintain time and is set to synchronize to the RTC clock every 11 minutes.  This by itself was not sufficient. After adding the cron to module 2 both are on 2 days of uptime.  So it has had a positive effect.  Five minutes is arbitrary, I tried with 1 minute but it confused chronyd.  Five minutes has the affect of keeping the "Update Interval" to 60 seconds.  So a internal time server is probably recommended with this to avoid frequent polling of external servers.

Even with this, I still get the following errors which may be related to the underlying issue.  Those messages have only appeared on module 1 and 2 which so far have been the only devices to exhibit time jumps and network outages.
Code:
[Mon Oct 14 08:00:36 2019] rcu: INFO: rcu_sched self-detected stall on CPU
[Mon Oct 14 08:00:36 2019] rcu:         1-...!: (102 GPs behind) idle=23e/0/0x1 softirq=5005523/5005524 fqs=12 
[Mon Oct 14 08:00:36 2019] rcu:          (t=259866 jiffies g=12543365 q=28)
[Mon Oct 14 08:00:36 2019] rcu: rcu_sched kthread starved for 259842 jiffies! g12543365 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[Mon Oct 14 08:00:36 2019] rcu: RCU grace-period kthread stack dump:
[Mon Oct 14 08:00:36 2019] rcu_sched       I    0    10      2 0x00000028
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019]  __switch_to+0x94/0xd8
[Mon Oct 14 08:00:36 2019]  __schedule+0x1e8/0x640
[Mon Oct 14 08:00:36 2019]  schedule+0x24/0x80
[Mon Oct 14 08:00:36 2019]  schedule_timeout+0x90/0x398
[Mon Oct 14 08:00:36 2019]  rcu_gp_kthread+0x550/0x8f8
[Mon Oct 14 08:00:36 2019]  kthread+0x128/0x130
[Mon Oct 14 08:00:36 2019]  ret_from_fork+0x10/0x1c
[Mon Oct 14 08:00:36 2019] Task dump for CPU 1:
[Mon Oct 14 08:00:36 2019] swapper/1       R  running task        0     0      1 0x0000002a
[Mon Oct 14 08:00:36 2019] Call trace:
[Mon Oct 14 08:00:36 2019]  dump_backtrace+0x0/0x1a0
[Mon Oct 14 08:00:36 2019]  show_stack+0x14/0x20
[Mon Oct 14 08:00:36 2019]  sched_show_task+0x160/0x198
[Mon Oct 14 08:00:36 2019]  dump_cpu_task+0x40/0x50
[Mon Oct 14 08:00:36 2019]  rcu_dump_cpu_stacks+0xc0/0x100
[Mon Oct 14 08:00:36 2019]  rcu_check_callbacks+0x594/0x780
[Mon Oct 14 08:00:36 2019]  update_process_times+0x2c/0x58
[Mon Oct 14 08:00:36 2019]  tick_sched_handle.isra.5+0x30/0x48
[Mon Oct 14 08:00:36 2019]  tick_sched_timer+0x48/0x98
[Mon Oct 14 08:00:36 2019]  __hrtimer_run_queues+0xe4/0x1f8
[Mon Oct 14 08:00:36 2019]  hrtimer_interrupt+0xf4/0x2b0
[Mon Oct 14 08:00:36 2019]  arch_timer_handler_phys+0x28/0x40
[Mon Oct 14 08:00:36 2019]  handle_percpu_devid_irq+0x80/0x138
[Mon Oct 14 08:00:36 2019]  generic_handle_irq+0x24/0x38
[Mon Oct 14 08:00:36 2019]  __handle_domain_irq+0x5c/0xb0
[Mon Oct 14 08:00:36 2019]  gic_handle_irq+0x58/0xa8
[Mon Oct 14 08:00:36 2019]  el1_irq+0xb0/0x140
[Mon Oct 14 08:00:36 2019]  arch_cpu_idle+0x10/0x18
[Mon Oct 14 08:00:36 2019]  do_idle+0x1d4/0x298
[Mon Oct 14 08:00:36 2019]  cpu_startup_entry+0x24/0x28
[Mon Oct 14 08:00:36 2019]  secondary_start_kernel+0x18c/0x1c8

Thanks for your help. Would you say running hwclock -s is a reliable temporary solution?
Those kernel messages are interesting though.

Also, would everyone mind continueing in the other thread, as the original problem of this one has been fixed Smile  (link)
#13
(10-17-2019, 06:30 AM)Unkn0wn Wrote: Also, would everyone mind continueing in the other thread, as the original problem of this one has been fixed Smile  (link)

Sure I've put your answer over there.  However, in my case the time jump results in a network outage so I believe both issues are symptoms of the same underlying root problem.
#14
Hello all,
I am having network timeouts towards all SoC of the clusterboard.
After reading the posts with regards to the PSU, I removed two SoC and now is running with 5, but the problem still persists.
I do not have the time drift problem (date set to a far future) anymore after I installed the recommended kernel package from the community. [linux-image-next-sunxi64_5.99_arm64.deb]

I rely heavily on the network of the clusterboard and this timeout issue makes the clusterboard quite useless for me.
Does any one have a solid idea on what the problem could be?

Sample of the timeouts:


Code:
64 bytes from 10.13.3.74: icmp_seq=600 ttl=64 time=2.118 ms
Request timeout for icmp_seq 601
Request timeout for icmp_seq 602
Request timeout for icmp_seq 603
Request timeout for icmp_seq 604
64 bytes from 10.13.3.74: icmp_seq=605 ttl=64 time=1.943 ms
64 bytes from 10.13.3.74: icmp_seq=606 ttl=64 time=2.017 ms

thanks

y
#15
(11-26-2019, 12:48 PM)ykanello Wrote: I am having network timeouts towards all SoC of the clusterboard.

I have never noticed this. However, due to the other issues the board doesn't see much network traffic. I would suggest checking bandwidth utilization and pinging between the modules. They only have 100Mbit interfaces and saturation causes packet loss.

(11-26-2019, 12:48 PM)ykanello Wrote: After reading the posts with regards to the PSU, I removed two SoC and now is running with 5, but the problem still persists.
My problem SoC produce these errors within dmesg. Check if any show it. You may also want to try using an ATX power supply.
Code:
[15507.597377] rcu: rcu_sched kthread starved for 15755 jiffies! g1064201 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[15507.597380] rcu: RCU grace-period kthread stack dump:
[15507.597384] rcu_sched       I    0    10      2 0x00000028

(11-26-2019, 12:48 PM)ykanello Wrote: I do not have the time drift problem (date set to a far future) anymore after I installed the recommended kernel package from the community. [linux-image-next-sunxi64_5.99_arm64.deb]
I recently moved to linux-image-current-sunxi64_5.3.9 and the time jumps have ceased. However, now the Ethernet device will shut off.

(11-26-2019, 12:48 PM)ykanello Wrote: I rely heavily on the network of the clusterboard and this timeout issue makes the clusterboard quite useless for me.
Does any one have a solid idea on what the problem could be?
Not that my research has showed. You could try adding some fans to the case and heatsinks to cool chips. Additional, cooling and an ATX power supply are on my todo list. The clusterboard chips show around 90F when ambient is around 60F.


Possibly Related Threads…
Thread Author Replies Views Last Post
  Creating a current armbian-Image with network-fix clusterDude 15 27,842 05-29-2024, 03:50 PM
Last Post: poVoq
Sad Version/Date of last armbian build that came with network patches? Bazmundi 0 549 12-07-2023, 03:23 PM
Last Post: Bazmundi
  Clusterboard not getting IP address after network fix Norlark 14 15,354 08-30-2021, 05:00 PM
Last Post: poVoq
  ArchLinux Network Booting xblack86 2 4,560 02-25-2021, 08:42 AM
Last Post: xblack86
  sopine socket power problem cgiraldo 1 3,722 06-17-2020, 02:10 PM
Last Post: cgiraldo
  Clusterboard networking problems BryanS 25 34,829 03-31-2019, 04:06 PM
Last Post: aww
  Power Switch AZClusterboard 1 3,201 02-16-2019, 06:55 AM
Last Post: mdmbc
  Individual SOPINE Power On After Shutdown? Pine 2 4,398 01-30-2019, 08:04 AM
Last Post: mdmbc
  Question on the power resistors bergera 2 4,567 02-15-2018, 08:20 AM
Last Post: bergera

Forum Jump:


Users browsing this thread: 2 Guest(s)