Monitoring Best Practices

Practical guidance for implementing effective monitoring in MSP and enterprise environments. Focus on what actually matters and avoid alert fatigue.

Monitoring Philosophy

Monitor for Impact, Not Activity

Bad: Alert when disk usage reaches 80%
Good: Alert when disk usage trend will fill disk in 48 hours
Bad: Alert on every failed login attempt
Good: Alert on 5 failed attempts in 5 minutes from same IP
Bad: Monitor CPU usage continuously
Good: Alert when CPU over 90% for 15 minutes

Alert Priorities

Critical (Page immediately)

Services completely down
Data loss in progress
Security breach detected
Primary systems offline
Backup failures
Certificate expired

Warning (Email/ticket during business hours)

Services degraded
Disk space trending to full
Patch compliance falling behind
Non-critical service failures
Performance degradation

Information (Log only)

Normal operations
Successful backups
Routine maintenance
Trend data
Capacity planning metrics

What to Monitor

Servers

Critical Metrics

Lang: text

- [ ] CPU usage (sustained high load)
- [ ] Memory usage (low available memory)
- [ ] Disk space (all volumes)
- [ ] Disk I/O (performance degradation)
- [ ] Network connectivity (ping/port checks)
- [ ] Critical services (running status)
- [ ] Windows Update status
- [ ] Antivirus status and definitions
- [ ] System event log errors

Service-Specific

Lang: text

Domain Controllers:
- [ ] AD replication status
- [ ] SYSVOL replication
- [ ] DNS service responding
- [ ] DHCP scope utilization
- [ ] FSMO role holder status

File Servers:
- [ ] Share accessibility
- [ ] DFS replication status
- [ ] Shadow copy success
- [ ] Disk queue length

Database Servers:
- [ ] Database service status
- [ ] Transaction log size
- [ ] Deadlocks and blocking
- [ ] Backup job status
- [ ] Connection pool usage

Web Servers:
- [ ] HTTP/HTTPS response
- [ ] Application pool status
- [ ] Certificate expiration
- [ ] Response time
- [ ] Error rates

Workstations

Essential Only

Lang: text

- [ ] Online/offline status
- [ ] Antivirus status
- [ ] Last boot time (stale systems)
- [ ] Patch compliance
- [ ] Disk encryption status

Note: Too much workstation monitoring creates noise. Focus on security and compliance.

Network Devices

Lang: text

Firewalls:
- [ ] VPN tunnel status
- [ ] Interface status
- [ ] High availability failover
- [ ] Policy sync status
- [ ] License expiration

Switches:
- [ ] Port status (critical uplinks)
- [ ] VLAN configuration
- [ ] Spanning tree changes
- [ ] Power supply status (if redundant)

Wireless:
- [ ] Controller connectivity
- [ ] AP status
- [ ] Client association issues
- [ ] Channel utilization

Backups

Lang: text

- [ ] Backup job completion status
- [ ] Backup duration trends
- [ ] Backup size trends
- [ ] Failed files/errors
- [ ] Backup storage capacity
- [ ] Offsite/cloud backup status
- [ ] Test restore success

Cloud Services

Lang: text

Microsoft 365:
- [ ] Service health
- [ ] Mailbox full warnings
- [ ] License expiration
- [ ] Exchange Online Protection alerts
- [ ] SharePoint site quota

Azure/AWS:
- [ ] Virtual machine status
- [ ] Storage account capacity
- [ ] Subscription spending alerts
- [ ] Security Center alerts
- [ ] Backup job status

Alert Thresholds

Disk Space

Bad Approach: Static threshold

Lang: text

Alert when C: drive reaches 80% full

Good Approach: Trend-based

Lang: text

Alert when disk will fill in 7 days at current rate

Example calculation:
- Monday: 60GB free
- Friday: 55GB free
- Rate: 5GB/4 days = 1.25GB/day
- Days until full: 55GB / 1.25GB = 44 days
- Action: Monitor, no alert yet

If rate increases to 5GB/day:
- Days until full: 55GB / 5GB = 11 days
- Action: Warning alert

Implementation Example

Lang: powershell

# Check disk space trend
$drive = "C:"
$threshold = 7  # days

$history = Get-DiskSpaceHistory -Drive $drive -Days 7
$currentFree = (Get-PSDrive $drive).Free / 1GB
$avgDailyDecrease = ($history[0].FreeGB - $history[-1].FreeGB) / 7

if ($avgDailyDecrease -gt 0) {
    $daysUntilFull = $currentFree / $avgDailyDecrease

    if ($daysUntilFull -lt $threshold) {
        Send-Alert -Severity Critical -Message "Drive $drive will fill in $daysUntilFull days"
    }
}

CPU Usage

Appropriate Thresholds

Lang: text

Critical: >90% for 15 minutes
Warning: >80% for 30 minutes
Info: >70% for 1 hour

Avoid: Alert on any spike above 80%. Temporary spikes are normal.

Memory

For Windows Servers

Lang: text

Critical: <10% available physical RAM for 10 minutes
Warning: <20% available for 15 minutes

Note: Available memory, not % used. Page file usage is normal.

For Linux Servers

Lang: text

Critical: <100MB available (not cached/buffered)
Warning: <500MB available

Note: Linux uses RAM for caching. Check actual available, not just free.

Service Monitoring

HTTP/HTTPS

Lang: text

Check frequency: Every 5 minutes
Timeout: 30 seconds
Alert after: 2 consecutive failures
Check: HTTP 200 response code
Response time warning: >3 seconds
Response time critical: >10 seconds

Database Connectivity

Lang: text

Check frequency: Every 5 minutes
Alert after: 2 consecutive failures
Test: Actual query execution, not just port check

Windows Services

Lang: text

Check: Service status (Running/Stopped)
Alert: Immediately if stopped (for critical services)
Action: Attempt automatic restart before alerting
Escalate: If restart fails or service stops 3 times in 1 hour

Preventing Alert Fatigue

Maintenance Windows

Lang: powershell

# Silence alerts during maintenance
# Configure in monitoring system or:

# Example: Disable monitoring for server during patching
Disable-Monitoring -Server "SERVER01" -Duration 60 -Reason "Windows Updates"

# Auto-enable after duration or:
Enable-Monitoring -Server "SERVER01"

Alert Grouping

Instead of:

Lang: text

EMAIL 1: Server1 - Disk C: high
EMAIL 2: Server1 - Disk D: high
EMAIL 3: Server1 - Disk E: high

Group to:

Lang: text

EMAIL 1: Server1 - Multiple disk space alerts (C:, D:, E:)

Alert Suppression Rules

Lang: text

If backup fails between 2 AM - 3 AM:
  Wait 30 minutes
  Check again
  If still failing: Alert

If service restarts successfully:
  Log event
  Don't alert

If service restarts 3 times in 1 hour:
  Alert critical

Dependency-Aware Monitoring

Lang: text

If Firewall offline:
  Suppress alerts for:
    - All internet-dependent checks
    - VPN tunnel status
    - External website monitoring

If Domain Controller offline:
  Suppress alerts for:
    - AD authentication failures on other systems
    - Group Policy errors

Send single alert: "DC01 offline - 15 dependent checks suppressed"

Monitoring Tools

RMM Platforms

Commercial Options

Datto RMM
ConnectWise Automate
NinjaOne
Atera
Kaseya VSA
N-able N-central

Key Features to Look For

Agent-based monitoring
Automated remediation (restart services, clear space)
Patch management integration
Alert escalation
Mobile app for alerts
Reporting and dashboards
Scripting capabilities

Specialized Tools

Network Monitoring

PRTG Network Monitor
Nagios / Icinga
Zabbix
LibreNMS

Application Performance

New Relic
Datadog
Application Insights
Dynatrace

Log Management

Splunk
ELK Stack (Elasticsearch, Logstash, Kibana)
Graylog
Papertrail

Uptime Monitoring (External)

Pingdom
UptimeRobot
StatusCake
Site24x7

Free/Built-in Options

Windows

Lang: powershell

# Performance Monitor
perfmon

# Event Viewer
eventvwr.msc

# Resource Monitor
resmon

# Task Scheduler for automated checks
$action = New-ScheduledTaskAction -Execute 'PowerShell.exe' -Argument '-File C:\Scripts\DailyCheck.ps1'
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "Daily Health Check"

Linux

Lang: bash

# Built-in monitoring
top
htop
iotop
netstat
ss

# System logs
journalctl
tail -f /var/log/syslog

# Cron for scheduled checks
crontab -e
0 8 * * * /usr/local/bin/daily-check.sh

Monitoring Scripts

Windows Server Health Check

Lang: powershell

# Basic server health check
function Get-ServerHealth {
    param($ComputerName)

    $result = @{}

    # CPU
    $cpu = Get-Counter "\Processor(_Total)\% Processor Time" -ComputerName $ComputerName
    $result.CPU = [math]::Round($cpu.CounterSamples[0].CookedValue, 2)

    # Memory
    $os = Get-WmiObject Win32_OperatingSystem -ComputerName $ComputerName
    $freeMemory = $os.FreePhysicalMemory / 1MB
    $totalMemory = $os.TotalVisibleMemorySize / 1MB
    $result.MemoryFreeGB = [math]::Round($freeMemory, 2)
    $result.MemoryUsedPercent = [math]::Round((($totalMemory - $freeMemory) / $totalMemory) * 100, 2)

    # Disk space
    $disks = Get-WmiObject Win32_LogicalDisk -Filter "DriveType=3" -ComputerName $ComputerName
    $result.Disks = $disks | ForEach-Object {
        [PSCustomObject]@{
            Drive = $_.DeviceID
            SizeGB = [math]::Round($_.Size / 1GB, 2)
            FreeGB = [math]::Round($_.FreeSpace / 1GB, 2)
            PercentFree = [math]::Round(($_.FreeSpace / $_.Size) * 100, 2)
        }
    }

    # Services
    $criticalServices = @("W32Time", "DNS", "NTDS")  # Adjust as needed
    $result.Services = Get-Service -Name $criticalServices -ComputerName $ComputerName |
        Select-Object Name, Status

    # Last boot time
    $result.LastBootTime = $os.ConvertToDateTime($os.LastBootUpTime)
    $result.UptimeDays = ((Get-Date) - $result.LastBootTime).Days

    return $result
}

# Usage
$health = Get-ServerHealth -ComputerName "SERVER01"
if ($health.CPU -gt 90) {
    Send-Alert "High CPU on SERVER01: $($health.CPU)%"
}

Linux Server Health Check

Lang: bash

#!/bin/bash
# linux-health-check.sh

# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)

# Memory
MEM_TOTAL=$(free -m | awk 'NR==2{print $2}')
MEM_USED=$(free -m | awk 'NR==2{print $3}')
MEM_PERCENT=$(awk "BEGIN {printf \"%.2f\", ($MEM_USED/$MEM_TOTAL)*100}")

# Disk
DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)

# Services
SERVICES=("sshd" "cron" "rsyslog")
for service in "${SERVICES[@]}"; do
    if ! systemctl is-active --quiet $service; then
        echo "CRITICAL: $service is not running"
        # Send alert
    fi
done

# Uptime
UPTIME_DAYS=$(uptime | awk '{print $3}' | cut -d',' -f1)

# Check thresholds
if [ $(echo "$CPU_USAGE > 90" | bc) -eq 1 ]; then
    echo "CRITICAL: CPU usage at ${CPU_USAGE}%"
fi

if [ $(echo "$MEM_PERCENT > 90" | bc) -eq 1 ]; then
    echo "CRITICAL: Memory usage at ${MEM_PERCENT}%"
fi

if [ $DISK_USAGE -gt 85 ]; then
    echo "WARNING: Disk usage at ${DISK_USAGE}%"
fi

Reporting

Daily Summary Email

Lang: text

Subject: Daily IT Summary - $(Get-Date -Format "yyyy-MM-dd")

ALERTS (Last 24 Hours):
- 2 Critical
- 5 Warnings
- 12 Information

BACKUP STATUS:
- Successful: 15/15
- Failed: 0
- Warnings: 0

SERVER STATUS:
- Online: 18/18
- CPU Avg: 35%
- Memory Avg: 60%
- Disk Space: Healthy

WORKSTATIONS:
- Online: 45/50
- Patch Compliance: 92%
- AV Up-to-Date: 100%

ACTION REQUIRED:
- Review certificate expiration on WEB01 (30 days)
- Plan disk upgrade for FILE01 (trending to full in 45 days)

Monthly Report for Management

Include:

Uptime percentage by server
Mean time to resolution for issues
Patch compliance trend
Backup success rate
Security alerts summary
Capacity planning recommendations
Upcoming renewals/expirations

Monitoring Checklist

Initial Setup

Identify all devices to monitor
Define critical vs non-critical systems
Set appropriate thresholds
Configure alert escalation
Test alerting (send test alerts)
Document monitoring procedures
Train team on responding to alerts

Weekly

Review alert trends
Tune thresholds to reduce noise
Address recurring warnings
Update monitoring for new systems

Monthly

Review monitoring coverage
Check for stale monitors (offline systems)
Review alert response times
Update documentation
Capacity planning review

Quarterly

Full monitoring audit
Review and update escalation procedures
Test disaster recovery monitoring
Review tool licensing/costs

Common Mistakes

Monitoring Too Much

Every service on every server
All performance counters
Every log entry
Result: Alert fatigue, important alerts missed

Monitoring Too Little

Only basic ping checks
No service-level monitoring
No trend analysis
Result: Issues discovered too late

Wrong Thresholds

Static values for dynamic workloads
Too sensitive (false positives)
Not sensitive enough (miss real issues)
Result: Either alert fatigue or missed problems

No Response Plan

Alerts sent but no action taken
No escalation procedures
Unclear ownership
Result: Monitoring becomes useless

Not Reviewing Alerts

Never tune thresholds
Don’t analyze false positives
Don’t update for environment changes
Result: Declining effectiveness over time

Table of Contents

Monitoring Philosophy

Monitor for Impact, Not Activity

Alert Priorities

What to Monitor

Servers

Workstations

Network Devices

Backups

Cloud Services

Alert Thresholds

Disk Space

CPU Usage

Memory

Service Monitoring

Preventing Alert Fatigue

Maintenance Windows

Alert Grouping

Alert Suppression Rules

Dependency-Aware Monitoring

Monitoring Tools

RMM Platforms

Specialized Tools

Free/Built-in Options

Monitoring Scripts

Windows Server Health Check

Linux Server Health Check

Reporting

Daily Summary Email

Monthly Report for Management

Monitoring Checklist

Common Mistakes

Related Documentation