RockPro64 has bad Memory (Software sogfaults and kernel panics)
#11
(11-19-2020, 12:27 PM)LMM Wrote:
(11-19-2020, 12:36 AM)wildering Wrote: I cracked open the two other boards and ran memtest on them with no errors reported. I then transplanted that eMMC module onto the date code 5219 board and ran the test again. I was presented with a slew of errors. It's evident that, that ROCKPro64 (v2.1, 2018-07-02 5219) is also defective and will warrant an RMA.

I ran memtest with Debian an it seems ok. (v2.1, 2018-07-02). What is noticeable is the high temperature reached (70°C) in spite of a heatsink. Then I put it over a fan and it dropped below 50°C after 9 min

(11-19-2020, 12:27 PM)LMM Wrote:
(11-19-2020, 12:36 AM)wildering Wrote: I cracked open the two other boards and ran memtest on them with no errors reported. I then transplanted that eMMC module onto the date code 5219 board and ran the test again. I was presented with a slew of errors. It's evident that, that ROCKPro64 (v2.1, 2018-07-02 5219) is also defective and will warrant an RMA.

I ran memtest with Debian an it seems ok. (v2.1, 2018-07-02). What is noticeable is the high temperature reached (70°C) in spite of a heatsink. Then I put it over a fan and it dropped below 50°C after 9 min

I don't know if it is a good practice (and a good idea) but I cut the conductive pad in order to be able to double the layer on the DDR chip to make the contact with the heatsink. Otherwise it probably does not

I don’t see how doubling the conductive pad would hurt, but I imagine conductivity would be impacted. It shouldn’t be necessary though, as I’ve found a single layer makes good contact with the heat sync. Plus in my case I didn’t have the chance to do anything other than run apt update, which shouldn’t cause a temperature spike at all. Plus both other boards run at a cool 45-50 degrees under moderate load with the tall heat sync and no fan. 

The most likely cause of the segmentation faults we’ve been experiencing is defective chips from the manufacturer or them breaking during product assembly.
  Reply
#12
I started the RMA process using the Pine64 support portal at: https://support.pine64.org/ and submitting a ticket. Started it just after my last post was made and 4 hours later got a reply to start the process. Probably the quickest RMA reply I’ve ever had, 10/10.
  Reply
#13
(11-19-2020, 06:53 PM)wildering Wrote:
(11-19-2020, 12:27 PM)LMM Wrote:
(11-19-2020, 12:36 AM)wildering Wrote: I cracked open the two other boards and ran memtest on them with no errors reported. I then transplanted that eMMC module onto the date code 5219 board and ran the test again. I was presented with a slew of errors. It's evident that, that ROCKPro64 (v2.1, 2018-07-02 5219) is also defective and will warrant an RMA.

I ran memtest with Debian an it seems ok. (v2.1, 2018-07-02). What is noticeable is the high temperature reached (70°C) in spite of a heatsink. Then I put it over a fan and it dropped below 50°C after 9 min

(11-19-2020, 12:27 PM)LMM Wrote:
(11-19-2020, 12:36 AM)wildering Wrote: I cracked open the two other boards and ran memtest on them with no errors reported. I then transplanted that eMMC module onto the date code 5219 board and ran the test again. I was presented with a slew of errors. It's evident that, that ROCKPro64 (v2.1, 2018-07-02 5219) is also defective and will warrant an RMA.

I ran memtest with Debian an it seems ok. (v2.1, 2018-07-02). What is noticeable is the high temperature reached (70°C) in spite of a heatsink. Then I put it over a fan and it dropped below 50°C after 9 min

I don't know if it is a good practice (and a good idea) but I cut the conductive pad in order to be able to double the layer on the DDR chip to make the contact with the heatsink. Otherwise it probably does not

I don’t see how doubling the conductive pad would hurt, but I imagine conductivity would be impacted. It shouldn’t be necessary though, as I’ve found a single layer makes good contact with the heat sync. Plus in my case I didn’t have the chance to do anything other than run apt update, which shouldn’t cause a temperature spike at all. Plus both other boards run at a cool 45-50 degrees under moderate load with the tall heat sync and no fan. 

The most likely cause of the segmentation faults we’ve been experiencing is defective chips from the manufacturer or them breaking during product assembly.

I double the pad on the DDR chips because they are lower than the processor and the heatsink covers both.
You're right, running apt should not hurt !
  Reply
#14
I'm having a very similar experience with a RockPro64 4GB that I received a few weeks ago. I'm trying to decide whether to RMA the board or not. I'm not sure how to check the hardware version to compare with yours.

Initially I started with Armbian_20.08.1_Rockpro64_focal_current_5.8.6_desktop.img.xz from https://www.armbian.com/rockpro64/ and the desktop came up. But, "apt update" was getting segmentation faults and then showing what looks like memory corruption (characters in dependencies using non-ascii, etc)

Thinking that the desktop put more stress on the system, I then flashed Armbian_20.08.1_Rockpro64_focal_current_5.8.6.img.xz and booted. In this configuration, I was able to apt update, install, and run the memtester:
memtester 3G 1

It reported 549 errors similar to:
FAILURE: 0x00000000 != 0xa0000000000 at offset 0x47f44688.

Out of the reported errors, 520 have only 1 or 2 bits incorrect. This leads me to think that it's a hardware problem, since software typically overwrites whole bytes.

But I'm still considering the "Older firmware overwrites actively used memory" issue noted at https://wiki.pine64.org/wiki/ROCKPro64#H...ility_Page which was mentioned earlier in this thread.

From the instructions there and additional details at https://forum.pine64.org/showthread.php?tid=8174 I tried to build the bootloader and add it to the sdcard with the ubuntu focal server image, but the device didn't boot. I ordered the necessary hardware to debug with the serial console but it hasn't arrived yet.

If anyone knows for certain that a particular image doesn't have the "blob" firmware which can overwrite memory, I'd love to flash it and run memtester so that I could determine whether to RMA the board.
  Reply
#15
Hi!

I'm booting Debian stable, based on the official debian unstable installer image. It does not include Rockchip binary blobs in u-boot. I roughly outlined my approach here.

You could also try the u-boot version made by sigmaris, which also is build from mainline u-boot (without Rockchip blobs). Read the first post he made, there are links to emmc/SD-card version you could try, so no need to flash SPI to test this out. 

On my system I don't see any errors with memtest.
  Reply
#16
(11-25-2020, 04:42 AM)n4tter4ngell Wrote: I'm booting Debian stable, based on the official debian unstable installer image. It does not include Rockchip binary blobs in u-boot. I roughly outlined my approach here.

Thanks for this idea, I gather that you're following this:
https://www.kulesz.me/post/140-debian-de...4-install/

I found your comment detailing the differences in your procedure:
http://forum.pine64.org/showthread.php?t...1#pid82701

This looked promising, but fairly involved, so I decided to try your second option.

(11-25-2020, 04:42 AM)n4tter4ngell Wrote: You could also try the u-boot version made by sigmaris, which also is build from mainline u-boot (without Rockchip blobs). Read the first post he made, there are links to emmc/SD-card version you could try, so no need to flash SPI to test this out. 

Thanks for the pointer to this, seems like sigmaris has been doing some great work. I used dd to install mmc_idbloader.img and mmc_u-boot.itb from https://github.com/sigmaris/u-boot/relea...ckpro64-ci to my sdcard. It was really nice to see output from the bootloader on the display, and my armbian/debian server image booted fine with this bootloader.

In less than 5 min memtester has already produced many errors. Since this test has eliminated the blob overwriting memory as a potential cause, I'll proceed with the RMA process.

Thanks for your help!
  Reply
#17
Well, another issue here could be bad DRAM clocks. As far as I know, there is a blob in u-boot (or is it open?) that does the RAM training. If this does not work out correctly, your RP64 might run with wrong timings, resulting in these errors. To quote some dated u-boot docs:

Quote:The most frequent cause of problems when porting U-Boot to new
hardware, or when using a sloppy port on some board, is memory errors.
In most cases these are not caused by failing hardware, but by
incorrect initialization of the memory controller.  So it appears to
be a good idea to always test if the memory is working correctly,
before looking for any other potential causes of any problems.

Source: https://gitlab.denx.de/u-boot/u-boot/-/b...emory-test

I am not sure how to check the current RAM clock settings, but it might be worth to compare them between your boards.

With that being said, it's common for DRAMs become defunct over time. If you are out of warranty, replacing the DRAMs is not easy as they are soldered. However, you could try to map out the damaged areas using badmem as described here:

https://web.archive.org/web/201408061750...mory-howto

Also, it would be better to run memtester directly from u-boot without booting a whole OS like Linux as you could test much more memory this way. It seems like there is an option in u-boot for doing this (haven't tried, but sounds interesting). Again, see this document in u-boot:

https://gitlab.denx.de/u-boot/u-boot/-/b...emory-test
  Reply
#18
Hello.

I would like to know if pine64 team know about this issue ?
I do have 2 rockpro64. One is working fine. The other has memory corruption. I bought both board maybe 2 years ago.
My 2nd board never worked ... I would like to buy a new one, but I am afraid to get an another bad board. This is sad because the other one is still running fine with manjaro (I am using it as a NAS).
The bad one is on the small fanless case, but the temperature was ok to me. (I dont remember the value, but it was ok to me. And I wasnt doing intensive stuff on my board. Just installing some packages)

memtester is throwing lots of error on my 2nd board ... and debian, manjaro or freebsd are getting lots of random error (like at package installation).
I tried to use the blob less uboot, following pine64 wiki ... but same issue.
  Reply
#19
I recommend decreasing the memory system clock and trying again.
  Reply
#20
(08-03-2021, 02:21 PM)t4_4t Wrote: I recommend decreasing the memory system clock and trying again.

Thank you. Will check that.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  ROCKPro64 Battery Power Supply hoarfrosty 12 4,005 04-01-2024, 10:24 PM
Last Post: nano2
  Powering on the Rockpro64 JPT223 1 1,253 09-24-2023, 12:23 PM
Last Post: hoarfrosty
  ROCKPro64 with 16 ports SATA controller ZeblodS 19 28,829 12-18-2022, 06:25 PM
Last Post: heyghoge
  Using RPI hat on RockPro64? misterc 0 1,096 12-16-2022, 02:05 PM
Last Post: misterc
  GPIO on the RockPro64 - all pins high? colinmarc 2 2,140 11-18-2022, 10:20 AM
Last Post: colinmarc
  ROCKPro64 DOA ajtravis 8 7,997 11-08-2022, 03:40 AM
Last Post: ajtravis
  RockPro64 not booting up mvicha@gmail.com 0 1,397 09-19-2022, 07:35 AM
Last Post: mvicha@gmail.com
  GPIO Expander HAT for ROCKPro64 and Quartz64 CounterPillow 7 6,510 07-18-2022, 10:05 PM
Last Post: zer0sig
  RockPro64 premium aluminium casing justwantin 2 2,664 05-27-2022, 06:51 PM
Last Post: justwantin
  charging 3 rockpro64 from single source rpt312 1 2,071 05-16-2022, 02:34 AM
Last Post: dukla2000

Forum Jump:


Users browsing this thread: 1 Guest(s)