How to remotely reboot a Linux host if SSH fails to connect?
Edit2: Thanks all for your responses! I have checked the logs, https://lemmy.nz/comment/6192604, and based on that removed tracker-miner-fs as it's a search/index tool which I don't need. No idea why it took over all memory. I'll also get a WiFi Smartplug as a kill switch. Hopefully that solves it.
Thanks again heaps!
I've got a HP ProDesk G3 which I'm using as home server, I've installed Ubuntu on it. Earlier this week the services I host on it stopped (Immich & Frigate). I tried to SSH, but it just hung after asking for a password.
I could ping it, but it was just unresponsive.
I had to force reboot it manually. This is fine, but I'm not always at home.
The chip has Intel vPro as far as I know, which could be an option, but I have no idea how this works. The documentation on the Intel site seems focused on enterprises. I tried to connect with RealVNC which does not work, so I think I've got to install/configure something on the server first.
I also asked Bing Chat but it came up with non existing packages & commands.
Welcome your thoughts!
This is how we handled camera servers at one of my former jobs, we just setup HP SFF desktops with Windows and the software and turned on the watchdog timer, always did the trick when power outages or system hangups happened.
No, this is a tool that can be used in a well designed architecture. Would I do this with a single database server, probably not. Would I ever run a single database server? Also probably not.
Also, by this point, you've probably already kernel panicked or something. There's not much left that can be saved and you probably needed that backup five minutes before the host came up.
A unifi power strip on a unifi network so you can control the power switch, and setting the motherboard to auto turn on after power failure. Though this is the nuclear option for restarting the system. Maybe while you're at it, diagnose why it keeps hanging up on you.
I wouldn't say that. Sure, it's not the preferred way of restarting a system, but it is a good backup to have if nothing else works. Remotely messing up the network connections for example.
When power on your ProDesk G3, you can access the MEBx setup by pressing Ctrl+P or they also say F6 or Escape will get you there. Intel AMT runs on a different IP address than what your OS gets. You can assign DHCP or a static IP address and setup your admin password. You can then access the portal from http://ipaddress:16992 There should be a method of access what would show on the screen through a KVM like access but I use MeshCentral for that so I couldn't tell you how to do it without.
Hopefully, that gives you a start. Feel free to reach back out if you have any questions. Thank you!
You might also want to bump up logging and try to identify what all is causing that unresponsiveness in the first place. It could be something that pam_limits could solve.
Yes, thanks for that. Good point. I checked the logs, and minutes before it crashed I can see below in the logs. Seems like either a GPU error or out of memory error. I've deleted tracker-miner-fs as I don't need it.
It also shows a massive list of processes with their memory usage.
Feb 21 17:27:49 hppd600-g3 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:0:00000000
Feb 21 17:32:43 hppd600-g3 kernel: 1305621 total pagecache pages
Feb 21 17:32:43 hppd600-g3 kernel: 16258 pages in swap cache
Feb 21 17:32:43 hppd600-g3 kernel: Free swap = 0kB
Feb 21 17:32:43 hppd600-g3 kernel: Total swap = 1000444kB
Feb 21 17:32:43 hppd600-g3 kernel: 2065206 pages RAM
Feb 21 17:32:43 hppd600-g3 kernel: 0 pages HighMem/MovableOnly
Feb 21 17:32:43 hppd600-g3 kernel: 64196 pages reserved
Feb 21 17:32:43 hppd600-g3 kernel: 0 pages hwpoisoned
Feb 21 17:32:43 hppd600-g3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-113.slice/user@113.service/background.slice/tracker-miner-fs-3.service,task=t>
Feb 21 17:32:43 hppd600-g3 kernel: Out of memory: Killed process 833 (tracker-miner-f) total-vm:625676kB, anon-rss:3144kB, file-rss:4816kB, shmem-rss:4kB, UID:113 pgtables:280kB oom_score_adj:200
Feb 21 17:32:43 hppd600-g3 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Yes, thanks for that. Good point. I checked the logs, and minutes before it crashed I can see below in the logs. Seems like either a GPU error or out of memory error.
No idea what tracker-miner-f is by the way.
It also shows a massive list of processes with their memory usage.
This goes beyond my knowledge :(
Feb 21 17:27:49 hppd600-g3 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:0:00000000
Feb 21 17:32:43 hppd600-g3 kernel: 1305621 total pagecache pages
Feb 21 17:32:43 hppd600-g3 kernel: 16258 pages in swap cache
Feb 21 17:32:43 hppd600-g3 kernel: Free swap = 0kB
Feb 21 17:32:43 hppd600-g3 kernel: Total swap = 1000444kB
Feb 21 17:32:43 hppd600-g3 kernel: 2065206 pages RAM
Feb 21 17:32:43 hppd600-g3 kernel: 0 pages HighMem/MovableOnly
Feb 21 17:32:43 hppd600-g3 kernel: 64196 pages reserved
Feb 21 17:32:43 hppd600-g3 kernel: 0 pages hwpoisoned
Feb 21 17:32:43 hppd600-g3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-113.slice/user@113.service/background.slice/tracker-miner-fs-3.service,task=t>
Feb 21 17:32:43 hppd600-g3 kernel: Out of memory: Killed process 833 (tracker-miner-f) total-vm:625676kB, anon-rss:3144kB, file-rss:4816kB, shmem-rss:4kB, UID:113 pgtables:280kB oom_score_adj:200
Feb 21 17:32:43 hppd600-g3 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Tracker miner fs generates thumbnails for files iirc. There was a recent vulnerability where malicious files could crash it and execute code just by being on disk. Make sure you haven't been hit by malware
You could connect an ESP32 to the power and reset switches through opto-isolators or relays. You will have to do a little bit of programming, but you can host a website on the ESP32 that will allow you to operate the switches remotely.
If you want to get a bit fancier, you could connect the UART on the ESP32 to a serial port on the server through a TTL to RS-232 level converter and have a remote serial terminal embedded in the web page too. That won't do much good if the server is completely locked up though.
On actual server motherboards (as opposed to repurposed home PC's) there is sometimes a special KVM like interface (keyboard/video/mouse, not the VM hypervisor) so you can connect to it with VNC and have the equivalent of local access. This is called IDRAC on Dell servers and other vendors have something similar.
On a home PC, hmm, you might be able to set up some kind of remote power cycle and serial console connection, using a second computer (Raspberry Pi or the like). I'm unfamiliar with Intel AMT that you linked to, but it seems like another idea.
I do remember hearing of a DRAC-like board for PC's but the name of it escapes me right now.
At the end of the day, if you want a long running server, you probably should host it in a data center, maybe with failover and other HA provisions. Home environments are a pain to set up for that. If your computer goes offline and you can't reach it, how do you even know that your home isn't having a power outage? Home ISP's are flaky too, so maybe you want a backup route over mobile data, etc. Yes you can make workarounds for everything but it amounts to turning your home into a crappy low capacity data center.
PiKVM or a similar device could work for OP - is that what you are thinking of? I've used it and it works well.
I think a lot of people who self-host get caught up in the excitement of getting the services up and running and neglect disaster planning, prevention, and recovery (myself included). Either they put it off for later or don't realize it could be a problem down the road until it happens. We always say not to self host anything you can't live without, and most take that advice, others don't. Not saying OP falls in either category, necessarily, just adding on to some of your points.
Self hosting really is the land of compromise where we all have to balance our requirements, budget, time and effort. Personally, I have a little disposable income that I spend on hardware to host non-critical services so I can learn and tinker. It could all go away and all I will have lost is the time and money I put into it, but I gained some knowledge and enjoyment. Needless to say, I don't have much in the way of backups and monitoring.
Thanks, but a data center is probably overkill for my needs. I've got it power loss protected with a UPS, and that's more than enough for us. Thanks anyway :)
I have a RPI, but of course that one can hang too. I'll buy a simple WiFi smart plug, standalone, as a kill switch.
Thanks! Yeah it seemed to be an OOM issue, but based on my Kagi qualities it seems like an OS issue.
But, it also has an error about the GPU.
Normal memory usage is more than fine, so perhaps it was a one time thing.
See logs: https://lemmy.nz/comment/6192604
I’m not in front of my computer atm, but I think I have something that can help you out. I have a 3-node Lenovo Thin client cluster that I manage their KVMs using the Intel vPro. I even went a step further using MeshCentral running on a VM to centralize my KVM access since I have 3 of them, but that’s another story.
Anyway, I’ll see if I can grab you some URLs in the morning if someone else doesn’t beat me to it or you find it on your own running google queries.
Thanks mate. It was a bit of a rabbit hole, I found stuff about the watchdog package, and you can configure it to use the iTCO_wdt module, but I also read it was blacklisted, and then I just gave up.
I posted somewhere else in the thread what lead up to the hang. And, I think I'll buy a WiFi smartplug so I can remotely reboot everything; assuming the WiFi still works :D