Monitoring Best Practices
Best practices for effective IT infrastructure monitoring
Table of Contents
Practical guidance for implementing effective monitoring in MSP and enterprise environments. Focus on what actually matters and avoid alert fatigue.
Monitoring Philosophy
Monitor for Impact, Not Activity
Bad: Alert when disk usage reaches 80%
Good: Alert when disk usage trend will fill disk in 48 hours
Bad: Alert on every failed login attempt
Good: Alert on 5 failed attempts in 5 minutes from same IP
Bad: Monitor CPU usage continuously
Good: Alert when CPU over 90% for 15 minutes
Alert Priorities
Critical (Page immediately)
- Services completely down
- Data loss in progress
- Security breach detected
- Primary systems offline
- Backup failures
- Certificate expired
Warning (Email/ticket during business hours)
- Services degraded
- Disk space trending to full
- Patch compliance falling behind
- Non-critical service failures
- Performance degradation
Information (Log only)
- Normal operations
- Successful backups
- Routine maintenance
- Trend data
- Capacity planning metrics
What to Monitor
Servers
Critical Metrics
- [ ] CPU usage (sustained high load)
- [ ] Memory usage (low available memory)
- [ ] Disk space (all volumes)
- [ ] Disk I/O (performance degradation)
- [ ] Network connectivity (ping/port checks)
- [ ] Critical services (running status)
- [ ] Windows Update status
- [ ] Antivirus status and definitions
- [ ] System event log errorsService-Specific
Domain Controllers:
- [ ] AD replication status
- [ ] SYSVOL replication
- [ ] DNS service responding
- [ ] DHCP scope utilization
- [ ] FSMO role holder status
File Servers:
- [ ] Share accessibility
- [ ] DFS replication status
- [ ] Shadow copy success
- [ ] Disk queue length
Database Servers:
- [ ] Database service status
- [ ] Transaction log size
- [ ] Deadlocks and blocking
- [ ] Backup job status
- [ ] Connection pool usage
Web Servers:
- [ ] HTTP/HTTPS response
- [ ] Application pool status
- [ ] Certificate expiration
- [ ] Response time
- [ ] Error ratesWorkstations
Essential Only
- [ ] Online/offline status
- [ ] Antivirus status
- [ ] Last boot time (stale systems)
- [ ] Patch compliance
- [ ] Disk encryption statusNote: Too much workstation monitoring creates noise. Focus on security and compliance.
Network Devices
Firewalls:
- [ ] VPN tunnel status
- [ ] Interface status
- [ ] High availability failover
- [ ] Policy sync status
- [ ] License expiration
Switches:
- [ ] Port status (critical uplinks)
- [ ] VLAN configuration
- [ ] Spanning tree changes
- [ ] Power supply status (if redundant)
Wireless:
- [ ] Controller connectivity
- [ ] AP status
- [ ] Client association issues
- [ ] Channel utilizationBackups
- [ ] Backup job completion status
- [ ] Backup duration trends
- [ ] Backup size trends
- [ ] Failed files/errors
- [ ] Backup storage capacity
- [ ] Offsite/cloud backup status
- [ ] Test restore successCloud Services
Microsoft 365:
- [ ] Service health
- [ ] Mailbox full warnings
- [ ] License expiration
- [ ] Exchange Online Protection alerts
- [ ] SharePoint site quota
Azure/AWS:
- [ ] Virtual machine status
- [ ] Storage account capacity
- [ ] Subscription spending alerts
- [ ] Security Center alerts
- [ ] Backup job statusAlert Thresholds
Disk Space
Bad Approach: Static threshold
Alert when C: drive reaches 80% fullGood Approach: Trend-based
Alert when disk will fill in 7 days at current rate
Example calculation:
- Monday: 60GB free
- Friday: 55GB free
- Rate: 5GB/4 days = 1.25GB/day
- Days until full: 55GB / 1.25GB = 44 days
- Action: Monitor, no alert yet
If rate increases to 5GB/day:
- Days until full: 55GB / 5GB = 11 days
- Action: Warning alertImplementation Example
# Check disk space trend
$drive = "C:"
$threshold = 7 # days
$history = Get-DiskSpaceHistory -Drive $drive -Days 7
$currentFree = (Get-PSDrive $drive).Free / 1GB
$avgDailyDecrease = ($history[0].FreeGB - $history[-1].FreeGB) / 7
if ($avgDailyDecrease -gt 0) {
$daysUntilFull = $currentFree / $avgDailyDecrease
if ($daysUntilFull -lt $threshold) {
Send-Alert -Severity Critical -Message "Drive $drive will fill in $daysUntilFull days"
}
}CPU Usage
Appropriate Thresholds
Critical: >90% for 15 minutes
Warning: >80% for 30 minutes
Info: >70% for 1 hourAvoid: Alert on any spike above 80%. Temporary spikes are normal.
Memory
For Windows Servers
Critical: <10% available physical RAM for 10 minutes
Warning: <20% available for 15 minutesNote: Available memory, not % used. Page file usage is normal.
For Linux Servers
Critical: <100MB available (not cached/buffered)
Warning: <500MB availableNote: Linux uses RAM for caching. Check actual available, not just free.
Service Monitoring
HTTP/HTTPS
Check frequency: Every 5 minutes
Timeout: 30 seconds
Alert after: 2 consecutive failures
Check: HTTP 200 response code
Response time warning: >3 seconds
Response time critical: >10 secondsDatabase Connectivity
Check frequency: Every 5 minutes
Alert after: 2 consecutive failures
Test: Actual query execution, not just port checkWindows Services
Check: Service status (Running/Stopped)
Alert: Immediately if stopped (for critical services)
Action: Attempt automatic restart before alerting
Escalate: If restart fails or service stops 3 times in 1 hourPreventing Alert Fatigue
Maintenance Windows
# Silence alerts during maintenance
# Configure in monitoring system or:
# Example: Disable monitoring for server during patching
Disable-Monitoring -Server "SERVER01" -Duration 60 -Reason "Windows Updates"
# Auto-enable after duration or:
Enable-Monitoring -Server "SERVER01"Alert Grouping
Instead of:
EMAIL 1: Server1 - Disk C: high
EMAIL 2: Server1 - Disk D: high
EMAIL 3: Server1 - Disk E: highGroup to:
EMAIL 1: Server1 - Multiple disk space alerts (C:, D:, E:)Alert Suppression Rules
If backup fails between 2 AM - 3 AM:
Wait 30 minutes
Check again
If still failing: Alert
If service restarts successfully:
Log event
Don't alert
If service restarts 3 times in 1 hour:
Alert criticalDependency-Aware Monitoring
If Firewall offline:
Suppress alerts for:
- All internet-dependent checks
- VPN tunnel status
- External website monitoring
If Domain Controller offline:
Suppress alerts for:
- AD authentication failures on other systems
- Group Policy errors
Send single alert: "DC01 offline - 15 dependent checks suppressed"Monitoring Tools
RMM Platforms
Commercial Options
- Datto RMM
- ConnectWise Automate
- NinjaOne
- Atera
- Kaseya VSA
- N-able N-central
Key Features to Look For
- Agent-based monitoring
- Automated remediation (restart services, clear space)
- Patch management integration
- Alert escalation
- Mobile app for alerts
- Reporting and dashboards
- Scripting capabilities
Specialized Tools
Network Monitoring
- PRTG Network Monitor
- Nagios / Icinga
- Zabbix
- LibreNMS
Application Performance
- New Relic
- Datadog
- Application Insights
- Dynatrace
Log Management
- Splunk
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Graylog
- Papertrail
Uptime Monitoring (External)
- Pingdom
- UptimeRobot
- StatusCake
- Site24x7
Free/Built-in Options
Windows
# Performance Monitor
perfmon
# Event Viewer
eventvwr.msc
# Resource Monitor
resmon
# Task Scheduler for automated checks
$action = New-ScheduledTaskAction -Execute 'PowerShell.exe' -Argument '-File C:\Scripts\DailyCheck.ps1'
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "Daily Health Check"Linux
# Built-in monitoring
top
htop
iotop
netstat
ss
# System logs
journalctl
tail -f /var/log/syslog
# Cron for scheduled checks
crontab -e
0 8 * * * /usr/local/bin/daily-check.shMonitoring Scripts
Windows Server Health Check
# Basic server health check
function Get-ServerHealth {
param($ComputerName)
$result = @{}
# CPU
$cpu = Get-Counter "\Processor(_Total)\% Processor Time" -ComputerName $ComputerName
$result.CPU = [math]::Round($cpu.CounterSamples[0].CookedValue, 2)
# Memory
$os = Get-WmiObject Win32_OperatingSystem -ComputerName $ComputerName
$freeMemory = $os.FreePhysicalMemory / 1MB
$totalMemory = $os.TotalVisibleMemorySize / 1MB
$result.MemoryFreeGB = [math]::Round($freeMemory, 2)
$result.MemoryUsedPercent = [math]::Round((($totalMemory - $freeMemory) / $totalMemory) * 100, 2)
# Disk space
$disks = Get-WmiObject Win32_LogicalDisk -Filter "DriveType=3" -ComputerName $ComputerName
$result.Disks = $disks | ForEach-Object {
[PSCustomObject]@{
Drive = $_.DeviceID
SizeGB = [math]::Round($_.Size / 1GB, 2)
FreeGB = [math]::Round($_.FreeSpace / 1GB, 2)
PercentFree = [math]::Round(($_.FreeSpace / $_.Size) * 100, 2)
}
}
# Services
$criticalServices = @("W32Time", "DNS", "NTDS") # Adjust as needed
$result.Services = Get-Service -Name $criticalServices -ComputerName $ComputerName |
Select-Object Name, Status
# Last boot time
$result.LastBootTime = $os.ConvertToDateTime($os.LastBootUpTime)
$result.UptimeDays = ((Get-Date) - $result.LastBootTime).Days
return $result
}
# Usage
$health = Get-ServerHealth -ComputerName "SERVER01"
if ($health.CPU -gt 90) {
Send-Alert "High CPU on SERVER01: $($health.CPU)%"
}Linux Server Health Check
#!/bin/bash
# linux-health-check.sh
# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
# Memory
MEM_TOTAL=$(free -m | awk 'NR==2{print $2}')
MEM_USED=$(free -m | awk 'NR==2{print $3}')
MEM_PERCENT=$(awk "BEGIN {printf \"%.2f\", ($MEM_USED/$MEM_TOTAL)*100}")
# Disk
DISK_USAGE=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)
# Services
SERVICES=("sshd" "cron" "rsyslog")
for service in "${SERVICES[@]}"; do
if ! systemctl is-active --quiet $service; then
echo "CRITICAL: $service is not running"
# Send alert
fi
done
# Uptime
UPTIME_DAYS=$(uptime | awk '{print $3}' | cut -d',' -f1)
# Check thresholds
if [ $(echo "$CPU_USAGE > 90" | bc) -eq 1 ]; then
echo "CRITICAL: CPU usage at ${CPU_USAGE}%"
fi
if [ $(echo "$MEM_PERCENT > 90" | bc) -eq 1 ]; then
echo "CRITICAL: Memory usage at ${MEM_PERCENT}%"
fi
if [ $DISK_USAGE -gt 85 ]; then
echo "WARNING: Disk usage at ${DISK_USAGE}%"
fiReporting
Daily Summary Email
Subject: Daily IT Summary - $(Get-Date -Format "yyyy-MM-dd")
ALERTS (Last 24 Hours):
- 2 Critical
- 5 Warnings
- 12 Information
BACKUP STATUS:
- Successful: 15/15
- Failed: 0
- Warnings: 0
SERVER STATUS:
- Online: 18/18
- CPU Avg: 35%
- Memory Avg: 60%
- Disk Space: Healthy
WORKSTATIONS:
- Online: 45/50
- Patch Compliance: 92%
- AV Up-to-Date: 100%
ACTION REQUIRED:
- Review certificate expiration on WEB01 (30 days)
- Plan disk upgrade for FILE01 (trending to full in 45 days)Monthly Report for Management
Include:
- Uptime percentage by server
- Mean time to resolution for issues
- Patch compliance trend
- Backup success rate
- Security alerts summary
- Capacity planning recommendations
- Upcoming renewals/expirations
Monitoring Checklist
Initial Setup
- Identify all devices to monitor
- Define critical vs non-critical systems
- Set appropriate thresholds
- Configure alert escalation
- Test alerting (send test alerts)
- Document monitoring procedures
- Train team on responding to alerts
Weekly
- Review alert trends
- Tune thresholds to reduce noise
- Address recurring warnings
- Update monitoring for new systems
Monthly
- Review monitoring coverage
- Check for stale monitors (offline systems)
- Review alert response times
- Update documentation
- Capacity planning review
Quarterly
- Full monitoring audit
- Review and update escalation procedures
- Test disaster recovery monitoring
- Review tool licensing/costs
Common Mistakes
Monitoring Too Much
- Every service on every server
- All performance counters
- Every log entry
- Result: Alert fatigue, important alerts missed
Monitoring Too Little
- Only basic ping checks
- No service-level monitoring
- No trend analysis
- Result: Issues discovered too late
Wrong Thresholds
- Static values for dynamic workloads
- Too sensitive (false positives)
- Not sensitive enough (miss real issues)
- Result: Either alert fatigue or missed problems
No Response Plan
- Alerts sent but no action taken
- No escalation procedures
- Unclear ownership
- Result: Monitoring becomes useless
Not Reviewing Alerts
- Never tune thresholds
- Don’t analyze false positives
- Don’t update for environment changes
- Result: Declining effectiveness over time