SCOM: What’s wrong with my Unix agents? Why they are grey out?

It is quite difficult to work with the Offline Unix agents especially in a large monitoring environment. Though SCOM offers native heartbeat monitor, it is hard to quickly determine whether the computer is actually down or something wrong with the agent configuration.

An Unix agent may be down due to various reasons like issue with SCX process not running or a run as account password got changed or certificate got reset or the computer might be down. SCOM has “UNIX/Linux Heartbeat Monitor”, “WS-Management Run As Account Health” and “WS-Management Certificate Health” monitors to monitor each of above mentioned criteria and alert for offline agents. But it would be tedious job for support guy to handle multiple alerts for same issue and correlating them to fix the agent which may cost considerable time.

Will it not be easy to have only one alert in case of heartbeat failure with the status of all other monitors in the summary?

But wait, should we also track down the ping status in the alert summary so that the support guy knows what he should do first?

Yes, that’s what we are going to do now using powershell. The below script which can be run in any management server logs event in “Operations Manager” event log.

You can create a rule to look for the events and create an alert. The alert will indicate the agent which is offline and details of other monitors.

Import-Module OperationsManagerNew-SCOMManagementGroupConnection
$mc = get-scclass -name Microsoft.Unix.Computer
$agents = get-scommonitoringobject -class $mc | where {$_.isavailable -ne 'True'}
foreach ($agent in $agents)
    $maintmode = $agent.InMaintenanceMode
    # Ignore Servers in Maintenance
    if ($maintmode -eq $false){
        $agentname = $agent.displayname
        $RespondsToPing = Test-Connection -ComputerName $agent.displayname -quiet
        $sh = $agent.GetMonitoringStateHierarchy()
        $avail_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Availability'}
        $hb_mon = $avail_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'Unix/Linux Heartbeat Monitor'}
        $hb_mon_state = $hb_mon.item.healthstate
        if ($hb_mon_state -ne "Success" -and $hb_mon_state -ne "Uninitialized"){
            $config_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Configuration'}
            $cert_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Certificate Health'}
            $runas_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Run As Account Health'}
            $cert_mon_state = $cert_mon.item.healthstate
            $runas_mon_state = $runas_mon.item.healthstate
            if ($RespondsToPing) {
                $pingable = "Pingable"
            else {
                $pingable = "Not Pingable"
            $status = "PING_STATUS: $pingable HEARTBEAT_STATUS: $hb_mon_state CERTIFICATE_STATUS:$cert_mon_state, USER_ACCOUNT_STATUS: $runas_mon_state"
            write-eventlog -LogName 'Operations Manager' -source 'Health Service Script' -id 1041 -entrytype Error -Category 0 -Message "UNIX SCOM agent on $agentname is not sending a heartbeat - $status"

You may also like...

Leave a Reply

%d bloggers like this: