Wednesday, 1 September 2021

LinuxCNC freezing intermittently - fixed? - stress testing

What's all this about, Fatty?

Damned Linux PC has been running in the background on and off over the past month or so. I'm doing this to try to convince myself that it's robust, having seen a number of worrying issues during that time.

Generally that's involved the screen, mouse and keyboard randomly freezing. This can happen at any moment and the only remedy is a hard reboot. Not quite what you hope for in a CNC controller really.

It doesn't seem to be associated with any particular operation or program - or indeed any actual input from the user. True, it's happened while I've been using it but equally it's more often happened when it's been left to run by itself. 

I'm no Linux power user by any means, so what can I do to troubleshoot this? What's available by way of tools? 

Crash logs?

Yes, there are crash logs created within Linux that might help to give some insight but when I looked at these, they were (for me) overwhelmingly detailed and voluminous and the text outputs looked pretty inconsistent ie look as if they had been written by a whole host of authors. I'm guessing each module of code came with its own debug message. So I deleted mine, to allow new messages to be somewhat visible.

Update Manager - was that the answer?

These crashes could happen minutes or even a day or more apart. In line with good / obvious practice, it seemed sensible to do an(other) system update before getting more serious about this. This happened 2 days ago and in itself caused a crash towards the end of the update process, followed by another crash after restarting. However, since then, I've been unable to provoke any further misbehaviour. 

Fixed, Fatty?

Can it really have been fixed by that last update? The problem with intermittent faults is that you can never be 100% certain you've fixed them. Another failure could happen seconds after you celebrate the problem being "fixed", particularly if you didn't actually find a clear "smoking gun" root cause and implement a convincing corrective action.

Stress testing!

Yes, let's try to stress test the system by loading it up with a range of programs and then run it flat out for several days.

  • Run LinuxCNC Axis Sim with an endless program loop. 
  • Stream music videos on YouTube
  • Gobble up more memory with Visual Studio Code
  • Run the "stress" program from Linux command prompt from time to time
  • Report system status (up time, core temp etc)
Like this:


Screencast video:

System status:

Here's where to see stuff like the core temp, uptime etc:


I've never managed to get the core temp up above 59C (local ambient is 25C at max). The 8MB memory is almost maxed out at ~7.8MB. I'm using all the cores, on-board graphics, network connection, real time kernel etc.

The real time kernel is shown by typing the "uname -a" command:

muzzer@LinuxCNC:~$ uname -a
Linux LinuxCNC 4.19.106-rt46-lcnc #1 SMP PREEMPT RT Mon Sep 14 12:23:06 AEST 2020 x86_64 x86_64 x86_64 GNU/Linux

This seems to be the PREEMPT version. It's what got installed by default when I installed Linux Mint and LinuxCNC 2.8.0.

Let's leave this to run....

Dammit!
No, not fixed yet, although it seems to be a lot more stable. Since just after the Update Manager did its magic, I've not had any further crashes apart from one different kind of freeze - the screen went blank and wouldn't recover. Feels like a different problem.

Let's have a look at that crash report again.....
 

There seem to be 2 types of error mentioned: the "pam_kwallet5.so" and "drm:intel_pipe_update_end[i915].....Atomic update failure....". 

The first seems to be some sort of passwork locker app which is of no relevance, so needs to be disabled - this post tells you how. Sorted - and the thing still boots up and works.

The second is one of those annoyingly cryptic messages that needs a good slapping. Seems that "i915" refers to the driver used to drive the Intel HD graphics chipset. The "Atomic" reference refers to the fact that this error won't time out - the processor will hang until it's resolved itself. Sounds like the ideal basis for a system freeze, yet the process it's talking about (the "Panel Self-Refresh (PSR) implementation") seems to be something we can live without. Interestingly, that sounds like the same material that was referred to as "micro code" and was installed as part of that Update Manager event. So let's disable that fucker for starters.
 

Again, that seems to have stuck without breaking anything. Are we onto something here?

1st Sept, 09:00
Sensors:   System Temperatures: cpu: 53.0 C mobo: N/A
           Fan Speeds (RPM): N/A
Repos:     No active apt repos in: /etc/apt/sources.list
           Active apt repos in: /etc/apt/sources.list.d/additional-repositories.list
           1: deb http: //cnc.beaglebrainz.net/mintcnc/ bionic 2.8-rtpreempt
           Active apt repos in: /etc/apt/sources.list.d/hardkernel-ppa-bionic.list
           1: deb http: //ppa.launchpad.net/hardkernel/ppa/ubuntu bionic main
           Active apt repos in: /etc/apt/sources.list.d/kelebek333-kablosuz-bionic.list
           1: deb http: //ppa.launchpad.net/kelebek333/kablosuz/ubuntu bionic main
           Active apt repos in: /etc/apt/sources.list.d/official-package-repositories.list
           1: deb http: //packages.linuxmint.com tricia main upstream import backport #id:linuxmint_main
           2: deb http: //archive.ubuntu.com/ubuntu bionic main restricted universe multiverse
           3: deb http: //archive.ubuntu.com/ubuntu bionic-updates main restricted universe multiverse
           4: deb http: //archive.ubuntu.com/ubuntu bionic-backports main restricted universe multiverse
           5: deb http: //security.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse
           6: deb http: //archive.canonical.com/ubuntu/ bionic partner
           Active apt repos in: /etc/apt/sources.list.d/vscode.list
           1: deb [arch=amd64,arm64,armhf] http: //packages.microsoft.com/repos/code stable main
Info:      Processes: 217 Uptime: 1d 19h 02m Memory: 7.50 GiB used: 2.19 GiB (29.2%) Init: systemd
           v: 237 runlevel: 5 Compilers: gcc: 7.5.0 alt: 7 Client: Unknown python3.6 client
           inxi: 3.0.32


Looking fixed now - conclusions:
  • I suspect I didn't have the "micro code" installed until I ran the Update Manager. This includes (comprises?) the i915 graphics driver.
  • I've disabled the Intel HD graphics "PSR" implementation that came with the "micro code", as it seems to have an "open bug".
  • I've also disabled that KWallet password app that was causing errors.
  • No crashes since then, 4 days later, with LinuxCNC and Youtube running continuously.
I think I may have fixed it now but of course I'll only know for sure either way if it freezes...

No comments:

Post a Comment

TIG welder up and running - after some fault diagnostics and repair

Finally got some time to connect up the flow meter and argon hose. Plugged in the torch and ground cables and the torch hose etc. Powered it...