Practical guidance for implementing effective monitoring in MSP and enterprise environments. Focus on what actually matters and avoid alert fatigue.

Monitoring Philosophy

Monitor for Impact, Not Activity

  • Bad: Alert when disk usage reaches 80%

  • Good: Alert when disk usage trend will fill disk in 48 hours

  • Bad: Alert on every failed login attempt

  • Good: Alert on 5 failed attempts in 5 minutes from same IP

  • Bad: Monitor CPU usage continuously

  • Good: Alert when CPU over 90% for 15 minutes

Alert Priorities

Critical (Page immediately)

  • Services completely down
  • Data loss in progress
  • Security breach detected
  • Primary systems offline
  • Backup failures
  • Certificate expired

Warning (Email/ticket during business hours)

  • Services degraded
  • Disk space trending to full
  • Patch compliance falling behind
  • Non-critical service failures
  • Performance degradation

Information (Log only)

  • Normal operations
  • Successful backups
  • Routine maintenance
  • Trend data
  • Capacity planning metrics

What to Monitor

Servers

Critical Metrics

Lang: text
- [ ] CPU usage (sustained high load)
- [ ] Memory usage (low available memory)
- [ ] Disk space (all volumes)
- [ ] Disk I/O (performance degradation)
- [ ] Network connectivity (ping/port checks)
- [ ] Critical services (running status)
- [ ] Windows Update status
- [ ] Antivirus status and definitions
- [ ] System event log errors

Service-Specific

Lang: text
Domain Controllers:
- [ ] AD replication status
- [ ] SYSVOL replication
- [ ] DNS service responding
- [ ] DHCP scope utilization
- [ ] FSMO role holder status

File Servers:
- [ ] Share accessibility
- [ ] DFS replication status
- [ ] Shadow copy success
- [ ] Disk queue length

Database Servers:
- [ ] Database service status
- [ ] Transaction log size
- [ ] Deadlocks and blocking
- [ ] Backup job status
- [ ] Connection pool usage

Web Servers:
- [ ] HTTP/HTTPS response
- [ ] Application pool status
- [ ] Certificate expiration
- [ ] Response time
- [ ] Error rates

Workstations

Essential Only

Lang: text
- [ ] Online/offline status
- [ ] Antivirus status
- [ ] Last boot time (stale systems)
- [ ] Patch compliance
- [ ] Disk encryption status

Note: Too much workstation monitoring creates noise. Focus on security and compliance.

Network Devices

Lang: text
Firewalls:
- [ ] VPN tunnel status
- [ ] Interface status
- [ ] High availability failover
- [ ] Policy sync status
- [ ] License expiration

Switches:
- [ ] Port status (critical uplinks)
- [ ] VLAN configuration
- [ ] Spanning tree changes
- [ ] Power supply status (if redundant)

Wireless:
- [ ] Controller connectivity
- [ ] AP status
- [ ] Client association issues
- [ ] Channel utilization

Backups

Lang: text
- [ ] Backup job completion status
- [ ] Backup duration trends
- [ ] Backup size trends
- [ ] Failed files/errors
- [ ] Backup storage capacity
- [ ] Offsite/cloud backup status
- [ ] Test restore success

Cloud Services

Lang: text
Microsoft 365:
- [ ] Service health
- [ ] Mailbox full warnings
- [ ] License expiration
- [ ] Exchange Online Protection alerts
- [ ] SharePoint site quota

Azure/AWS:
- [ ] Virtual machine status
- [ ] Storage account capacity
- [ ] Subscription spending alerts
- [ ] Security Center alerts
- [ ] Backup job status

Alert Thresholds

Disk Space

Bad Approach: Static threshold

Lang: text
Alert when C: drive reaches 80% full

Good Approach: Trend-based

Lang: text
Alert when disk will fill in 7 days at current rate

Example calculation:
- Monday: 60GB free
- Friday: 55GB free
- Rate: 5GB/4 days = 1.25GB/day
- Days until full: 55GB / 1.25GB = 44 days
- Action: Monitor, no alert yet

If rate increases to 5GB/day:
- Days until full: 55GB / 5GB = 11 days
- Action: Warning alert

Implementation Example

Lang: powershell
# Check disk space trend
$drive = "C:"
$threshold = 7  # days

$history = Get-DiskSpaceHistory -Drive $drive -Days 7
$currentFree = (Get-PSDrive $drive).Free / 1GB
$avgDailyDecrease = ($history[0].FreeGB - $history[-1].FreeGB) / 7

if ($avgDailyDecrease -gt 0) {
    $daysUntilFull = $currentFree / $avgDailyDecrease

    if ($daysUntilFull -lt $threshold) {
        Send-Alert -Severity Critical -Message "Drive $drive will fill in $daysUntilFull days"
    }
}

CPU Usage

Appropriate Thresholds

Lang: text
Critical: >90% for 15 minutes
Warning: >80% for 30 minutes
Info: >70% for 1 hour

Avoid: Alert on any spike above 80%. Temporary spikes are normal.

Memory

For Windows Servers

Lang: text
Critical: <10% available physical RAM for 10 minutes
Warning: <20% available for 15 minutes

Note: Available memory, not % used. Page file usage is normal.

For Linux Servers

Lang: text
Critical: <100MB available (not cached/buffered)
Warning: <500MB available

Note: Linux uses RAM for caching. Check actual available, not just free.

Service Monitoring

HTTP/HTTPS

Lang: text
Check frequency: Every 5 minutes
Timeout: 30 seconds
Alert after: 2 consecutive failures
Check: HTTP 200 response code
Response time warning: >3 seconds
Response time critical: >10 seconds

Database Connectivity

Lang: text
Check frequency: Every 5 minutes
Alert after: 2 consecutive failures
Test: Actual query execution, not just port check

Windows Services

Lang: text
Check: Service status (Running/Stopped)
Alert: Immediately if stopped (for critical services)
Action: Attempt automatic restart before alerting
Escalate: If restart fails or service stops 3 times in 1 hour

Preventing Alert Fatigue

Maintenance Windows

Lang: powershell
# Silence alerts during maintenance
# Configure in monitoring system or:

# Example: Disable monitoring for server during patching
Disable-Monitoring -Server "SERVER01" -Duration 60 -Reason "Windows Updates"

# Auto-enable after duration or:
Enable-Monitoring -Server "SERVER01"

Alert Grouping

Instead of:

Lang: text
EMAIL 1: Server1 - Disk C: high
EMAIL 2: Server1 - Disk D: high
EMAIL 3: Server1 - Disk E: high

Group to:

Lang: text
EMAIL 1: Server1 - Multiple disk space alerts (C:, D:, E:)

Alert Suppression Rules

Lang: text
If backup fails between 2 AM - 3 AM:
  Wait 30 minutes
  Check again
  If still failing: Alert

If service restarts successfully:
  Log event
  Don't alert

If service restarts 3 times in 1 hour:
  Alert critical

Dependency-Aware Monitoring

Lang: text
If Firewall offline:
  Suppress alerts for:
    - All internet-dependent checks
    - VPN tunnel status
    - External website monitoring

If Domain Controller offline:
  Suppress alerts for:
    - AD authentication failures on other systems
    - Group Policy errors

Send single alert: "DC01 offline - 15 dependent checks suppressed"

Monitoring Tools

RMM Platforms

Commercial Options

  • Datto RMM
  • ConnectWise Automate
  • NinjaOne
  • Atera
  • Kaseya VSA
  • N-able N-central

Key Features to Look For

  • Agent-based monitoring
  • Automated remediation (restart services, clear space)
  • Patch management integration
  • Alert escalation
  • Mobile app for alerts
  • Reporting and dashboards
  • Scripting capabilities

Specialized Tools

Network Monitoring

  • PRTG Network Monitor
  • Nagios / Icinga
  • Zabbix
  • LibreNMS

Application Performance

  • New Relic
  • Datadog
  • Application Insights
  • Dynatrace

Log Management

  • Splunk
  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Graylog
  • Papertrail

Uptime Monitoring (External)

  • Pingdom
  • UptimeRobot
  • StatusCake
  • Site24x7

Free/Built-in Options

Windows

Lang: powershell
# Performance Monitor
perfmon

# Event Viewer
eventvwr.msc

# Resource Monitor
resmon

# Task Scheduler for automated checks
$action = New-ScheduledTaskAction -Execute 'PowerShell.exe' -Argument '-File C:\Scripts\DailyCheck.ps1'
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "Daily Health Check"

Linux

Lang: bash
# Built-in monitoring
top
htop
iotop
netstat
ss

# System logs
journalctl
tail -f /var/log/syslog

# Cron for scheduled checks
crontab -e
0 8 * * * /usr/local/bin/daily-check.sh

Monitoring Scripts

Windows Server Health Check

Lang: powershell
# Basic server health check
function Get-ServerHealth {
    param($ComputerName)

    $result = @{}

    # CPU
    $cpu = Get-Counter "\Processor(_Total)\% Processor Time" -ComputerName $ComputerName
    $result.CPU = [math]::Round($cpu.CounterSamples[0].CookedValue, 2)

    # Memory
    $os = Get-WmiObject Win32_OperatingSystem -ComputerName $ComputerName
    $freeMemory = $os.FreePhysicalMemory / 1MB
    $totalMemory = $os.TotalVisibleMemorySize / 1MB
    $result.MemoryFreeGB = [math]::Round($freeMemory, 2)
    $result.MemoryUsedPercent = [math]::Round((($totalMemory - $freeMemory) / $totalMemory) * 100, 2)

    # Disk space
    $disks = Get-WmiObject Win32_LogicalDisk -Filter "DriveType=3" -ComputerName $ComputerName
    $result.Disks = $disks | ForEach-Object {
        [PSCustomObject]@{
            Drive = $_.DeviceID
            SizeGB = [math]::Round($_.Size / 1GB, 2)
            FreeGB = [math]::Round($_.FreeSpace / 1GB, 2)
            PercentFree = [math]::Round(($_.FreeSpace / $_.Size) * 100, 2)
        }
    }

    # Services
    $criticalServices = @("W32Time", "DNS", "NTDS")  # Adjust as needed
    $result.Services = Get-Service -Name $criticalServices -ComputerName $ComputerName |
        Select-Object Name, Status

    # Last boot time
    $result.LastBootTime = $os.ConvertToDateTime($os.LastBootUpTime)
    $result.UptimeDays = ((Get-Date) - $result.LastBootTime).Days

    return $result
}

# Usage
$health = Get-ServerHealth -ComputerName "SERVER01"
if ($health.CPU -gt 90) {
    Send-Alert "High CPU on SERVER01: $($health.CPU)%"
}

Linux Server Health Check

Lang: bash
#!/bin/bash
# linux-health-check.sh

# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)

# Memory
MEM_TOTAL=$(free -m | awk 'NR==2{print $2}')
MEM_USED=$(free -m | awk 'NR==2{print $3}')
MEM_PERCENT=$(awk "BEGIN {printf \"%.2f\", ($MEM_USED/$MEM_TOTAL)*100}")

# Disk
DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)

# Services
SERVICES=("sshd" "cron" "rsyslog")
for service in "${SERVICES[@]}"; do
    if ! systemctl is-active --quiet $service; then
        echo "CRITICAL: $service is not running"
        # Send alert
    fi
done

# Uptime
UPTIME_DAYS=$(uptime | awk '{print $3}' | cut -d',' -f1)

# Check thresholds
if [ $(echo "$CPU_USAGE > 90" | bc) -eq 1 ]; then
    echo "CRITICAL: CPU usage at ${CPU_USAGE}%"
fi

if [ $(echo "$MEM_PERCENT > 90" | bc) -eq 1 ]; then
    echo "CRITICAL: Memory usage at ${MEM_PERCENT}%"
fi

if [ $DISK_USAGE -gt 85 ]; then
    echo "WARNING: Disk usage at ${DISK_USAGE}%"
fi

Reporting

Daily Summary Email

Lang: text
Subject: Daily IT Summary - $(Get-Date -Format "yyyy-MM-dd")

ALERTS (Last 24 Hours):
- 2 Critical
- 5 Warnings
- 12 Information

BACKUP STATUS:
- Successful: 15/15
- Failed: 0
- Warnings: 0

SERVER STATUS:
- Online: 18/18
- CPU Avg: 35%
- Memory Avg: 60%
- Disk Space: Healthy

WORKSTATIONS:
- Online: 45/50
- Patch Compliance: 92%
- AV Up-to-Date: 100%

ACTION REQUIRED:
- Review certificate expiration on WEB01 (30 days)
- Plan disk upgrade for FILE01 (trending to full in 45 days)

Monthly Report for Management

Include:

  • Uptime percentage by server
  • Mean time to resolution for issues
  • Patch compliance trend
  • Backup success rate
  • Security alerts summary
  • Capacity planning recommendations
  • Upcoming renewals/expirations

Monitoring Checklist

Initial Setup

  • Identify all devices to monitor
  • Define critical vs non-critical systems
  • Set appropriate thresholds
  • Configure alert escalation
  • Test alerting (send test alerts)
  • Document monitoring procedures
  • Train team on responding to alerts

Weekly

  • Review alert trends
  • Tune thresholds to reduce noise
  • Address recurring warnings
  • Update monitoring for new systems

Monthly

  • Review monitoring coverage
  • Check for stale monitors (offline systems)
  • Review alert response times
  • Update documentation
  • Capacity planning review

Quarterly

  • Full monitoring audit
  • Review and update escalation procedures
  • Test disaster recovery monitoring
  • Review tool licensing/costs

Common Mistakes

Monitoring Too Much

  • Every service on every server
  • All performance counters
  • Every log entry
  • Result: Alert fatigue, important alerts missed

Monitoring Too Little

  • Only basic ping checks
  • No service-level monitoring
  • No trend analysis
  • Result: Issues discovered too late

Wrong Thresholds

  • Static values for dynamic workloads
  • Too sensitive (false positives)
  • Not sensitive enough (miss real issues)
  • Result: Either alert fatigue or missed problems

No Response Plan

  • Alerts sent but no action taken
  • No escalation procedures
  • Unclear ownership
  • Result: Monitoring becomes useless

Not Reviewing Alerts

  • Never tune thresholds
  • Don’t analyze false positives
  • Don’t update for environment changes
  • Result: Declining effectiveness over time