Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 8082

Off topic discussion • Re: A bit of Raspberry with 144C/288T

$
0
0

While below arrived today, I will not switch 8-socket CPUs right now. Instead I will complete OpenMP work and determine OpenMP sweet spot for TSP greedy R&R on my systems. In case it is not 144C/288T but 28C/56T or 16C/32T, I will upgrade and see if the picture changes with 192C/384T system then. I have 30 days to return, so no hurry to upgrade the 8-socket system:
I think it's a good idea to write the multithreaded recreate code, as that will verify reliability before swapping the processors.

As another stress test I'd suggest installing Julia and executing something like

Code:

A=randn(N,N)b=randn(N)x=A\bprintln(norm(A*x-b))
in a loop over different sizes of N with OpenBlas threading turned on.

The kitten named Scratchy created a script that saves the matrices in a list so each computation exercises a different memory region. Let me know if you want me to look and post it.

Check with

Code:

$ edac-util -v
to make sure there are no ECC errors. Also try

Code:

$ sudo dmesg
and check for weird processor faults.

After a number of years an unhappy system here has developed ECC errors on five out of twelve DIMMs. You don't want the output to look like

Code:

$ edac-util -vmc0: 0 Uncorrected Errors with no DIMM infomc0: 0 Corrected Errors with no DIMM infomc0: csrow0: 0 Uncorrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 4 Corrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 8375 Corrected Errorsmc1: 0 Uncorrected Errors with no DIMM infomc1: 0 Corrected Errors with no DIMM infomc1: csrow0: 0 Uncorrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 1394 Corrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 24 Corrected Errorsmc2: 0 Uncorrected Errors with no DIMM infomc2: 0 Corrected Errors with no DIMM infomc2: csrow0: 0 Uncorrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errorsmc3: 0 Uncorrected Errors with no DIMM infomc3: 0 Corrected Errors with no DIMM infomc3: csrow0: 0 Uncorrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 234 Corrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
If that computer were not secured in the North Data Center, I would reseat the DIMMs and clean their fingers. Fortunately, enabling additional redundancy keeps it from crashing every day. I wouldn't accept such errors on newly acquired hardware even if recycled.

Statistics: Posted by ejolson — Fri Sep 19, 2025 6:02 am



Viewing all articles
Browse latest Browse all 8082

Trending Articles