I think it's a good idea to write the multithreaded recreate code, as that will verify reliability before swapping the processors.
While below arrived today, I will not switch 8-socket CPUs right now. Instead I will complete OpenMP work and determine OpenMP sweet spot for TSP greedy R&R on my systems. In case it is not 144C/288T but 28C/56T or 16C/32T, I will upgrade and see if the picture changes with 192C/384T system then. I have 30 days to return, so no hurry to upgrade the 8-socket system:
As another stress test I'd suggest installing Julia and executing something like
Code:
A=randn(N,N)b=randn(N)x=A\bprintln(norm(A*x-b))The kitten named Scratchy created a script that saves the matrices in a list so each computation exercises a different memory region. Let me know if you want me to look and post it.
Check with
Code:
$ edac-util -vCode:
$ sudo dmesgAfter a number of years an unhappy system here has developed ECC errors on five out of twelve DIMMs. You don't want the output to look like
Code:
$ edac-util -vmc0: 0 Uncorrected Errors with no DIMM infomc0: 0 Corrected Errors with no DIMM infomc0: csrow0: 0 Uncorrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 4 Corrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 8375 Corrected Errorsmc1: 0 Uncorrected Errors with no DIMM infomc1: 0 Corrected Errors with no DIMM infomc1: csrow0: 0 Uncorrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 1394 Corrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errorsmc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 24 Corrected Errorsmc2: 0 Uncorrected Errors with no DIMM infomc2: 0 Corrected Errors with no DIMM infomc2: csrow0: 0 Uncorrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errorsmc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errorsmc3: 0 Uncorrected Errors with no DIMM infomc3: 0 Corrected Errors with no DIMM infomc3: csrow0: 0 Uncorrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 234 Corrected Errorsmc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected ErrorsStatistics: Posted by ejolson — Fri Sep 19, 2025 6:02 am