Comprehensive guide for documenting server infrastructure in MSP and enterprise environments.

Why Document Servers?

Proper server documentation:

  • Reduces mean time to repair (MTTR)
  • Simplifies onboarding of new team members
  • Enables consistent configurations
  • Supports compliance and audits
  • Facilitates capacity planning
  • Critical for disaster recovery

Server Inventory

Essential Server Information

Minimum Documentation per Server:

Lang: yaml
Hostname: WEBSRV-01
Purpose: Production web server
Environment: Production
OS: Ubuntu Server 22.04 LTS
IP Address: 10.10.10.10
Subnet Mask: 255.255.255.0
Gateway: 10.10.10.1
DNS Servers: 10.10.0.10, 10.10.0.11
VLAN: 10 (Servers)

Hardware:
  Type: Physical / Virtual / Cloud
  Make/Model: Dell PowerEdge R740 / VMware / AWS EC2 t3.large
  Serial Number: SVC1234567
  CPU: 2x Intel Xeon Gold 6248R (48 cores total)
  RAM: 128GB DDR4
  Storage:
    - OS Drive: 2x 480GB SSD RAID1
    - Data Drive: 6x 1.8TB SAS 10K RAID10
  Network: 2x 10GbE (bonded)

Location:
  Data Center: HQ-DC1
  Rack: A-12
  Rack Units: 24-25

Virtualization:
  Hypervisor: VMware ESXi 8.0 / Hyper-V / N/A
  Host: ESX-HOST-03
  Datastore: SAN-PROD-01

Purchase Info:
  Vendor: Dell
  Purchase Date: 2023-06-15
  Warranty Expiration: 2028-06-14
  Support Contract: Dell ProSupport 24x7
  Asset Tag: ASSET-12345
  PO Number: PO-2023-0456

Access:
  Console: iDRAC at https://10.10.0.210 / iLO / IPMI
  SSH: Yes, port 22
  RDP: No
  Management URL: https://websrv-01.company.local

Backup:
  Method: Veeam Backup & Replication
  Schedule: Daily incremental, weekly full
  Retention: 30 days daily, 12 months monthly
  Last Backup: 2024-11-01 23:00
  Backup Size: 120GB

Services Running:
  - Nginx 1.24.0 (ports 80, 443)
  - PHP-FPM 8.2
  - MySQL 8.0.34 (port 3306)

Dependencies:
  - Database: DBSRV-01 (MySQL replication slave)
  - Storage: SAN-PROD-01 via iSCSI
  - Authentication: DC-01 (LDAP)
  - Monitoring: Zabbix server at 10.10.0.100

Monitoring:
  - Zabbix agent installed
  - SNMP enabled (community: public)
  - Alerts sent to: ops@company.com

Change Log:
  - 2024-10-15: Upgraded Nginx to 1.24.0
  - 2024-09-01: Increased RAM from 64GB to 128GB
  - 2024-08-10: Migrated to new SAN storage

Notes:
  - SSL cert expires 2025-03-01
  - Requires monthly Windows updates (2nd Tuesday)
  - Database connection pooling configured
  - Rate limiting enabled on nginx

Configuration Documentation

Operating System Configuration

Document all OS-level configurations:

Lang: yaml
OS Configuration:
  Hostname: WEBSRV-01
  Domain: company.local
  Timezone: America/New_York
  NTP Servers:
    - 10.10.0.10
    - 10.10.0.11

  Firewall:
    Status: Enabled
    Rules:
      - Allow 80/tcp from 0.0.0.0/0
      - Allow 443/tcp from 0.0.0.0/0
      - Allow 22/tcp from 10.10.0.0/24
      - Allow 3306/tcp from 10.10.20.0/24

  Users and Groups:
    Local Admins: admin, backup_user
    Service Accounts: nginx_svc, mysql_svc
    SSH Keys: /root/.ssh/authorized_keys (ops team)

  File Systems:
    - /dev/sda1: / (root) - 200GB ext4
    - /dev/sda2: /var - 100GB ext4
    - /dev/sdb1: /data - 2TB xfs
    - /dev/sdc1: /backup - 500GB ext4

  Network Interfaces:
    eth0: 10.10.10.10/24 (Production)
    eth1: 10.10.20.10/24 (Backup network)
    bond0: eth2+eth3 (10GbE bonded)

Application Configuration

Web Application Stack:

Lang: yaml
Application: Customer Portal
Version: 3.2.1
Installation Path: /var/www/portal

Components:
  Web Server:
    Software: Nginx 1.24.0
    Config: /etc/nginx/nginx.conf
    Sites: /etc/nginx/sites-enabled/
    SSL Cert: /etc/ssl/certs/portal.company.com.crt
    SSL Key: /etc/ssl/private/portal.company.com.key

  Application:
    Runtime: PHP 8.2-FPM
    Config: /etc/php/8.2/fpm/php.ini
    Pool Config: /etc/php/8.2/fpm/pool.d/www.conf
    Max Workers: 50
    Memory Limit: 256M

  Database:
    Type: MySQL 8.0.34
    Config: /etc/mysql/my.cnf
    Data Dir: /var/lib/mysql
    Port: 3306
    Max Connections: 200
    Buffer Pool Size: 16GB

Dependencies:
  - Redis 7.0 (session cache) - port 6379
  - Memcached 1.6 (object cache) - port 11211
  - Elasticsearch 8.10 (search) - port 9200

Environment Variables:
  APP_ENV: production
  DB_HOST: 10.10.10.11
  DB_NAME: portal_prod
  DB_USER: portal_app
  REDIS_HOST: 10.10.10.12
  API_KEY: [stored in vault]

Cron Jobs:
  - 0 2 * * * /var/www/portal/scripts/daily_cleanup.sh
  - */15 * * * * /var/www/portal/scripts/queue_worker.php
  - 0 0 * * 0 /var/www/portal/scripts/weekly_report.sh

Security Documentation

Security Hardening

Document all security configurations:

  • Patch Management:

    • OS patches: Monthly (2nd Tuesday)
    • Application patches: As needed, tested in staging first
    • Last patched: 2024-10-08
  • Access Control:

    • SSH: Key-based auth only, no password login
    • Sudo: Limited to ops team members
    • Service accounts: No interactive login
    • MFA: Required for all admin access
  • Encryption:

    • Data at rest: LUKS encryption on /data partition
    • Data in transit: TLS 1.2+ only, strong ciphers
    • Database: Encrypted connections required
  • Logging and Auditing:

    • Syslog forwarding: To SIEM at 10.10.0.100
    • Audit logs: /var/log/audit/
    • Retention: 90 days local, 1 year in SIEM
    • Monitored events: Login attempts, sudo usage, file changes
  • Vulnerability Scanning:

    • Tool: Nessus
    • Schedule: Weekly
    • Last scan: 2024-11-01
    • Critical vulns: 0

Compliance Requirements

Industry Standards:

  • SOC 2 Type II: Yes
  • PCI DSS: N/A
  • HIPAA: No
  • GDPR: Yes (EU customer data)

Required Controls:

  • Access logging enabled
  • Data encryption at rest and in transit
  • Regular vulnerability scanning
  • Incident response procedures
  • Backup verification

Maintenance and Operations

Maintenance Schedule

Lang: yaml
Daily Tasks:
  - 00:00: Full backup starts
  - 02:00: Database optimization
  - 03:00: Log rotation
  - 04:00: Cleanup temp files

Weekly Tasks:
  - Sunday 01:00: Security scan
  - Sunday 02:00: Weekly report generation
  - Sunday 23:00: Full system backup

Monthly Tasks:
  - 2nd Tuesday: OS patches (maintenance window)
  - Last Sunday: Certificate renewal check
  - 1st Monday: Capacity review

Quarterly Tasks:
  - Disaster recovery test
  - Access review
  - Documentation review
  - Vulnerability assessment

Performance Baselines

Normal Operating Metrics:

Lang: yaml
CPU Usage:
  Average: 25-35%
  Peak: 60-70% (business hours)
  Alert Threshold: >80% for 15 min

Memory Usage:
  Average: 60GB/128GB (47%)
  Alert Threshold: >90% (115GB)

Disk I/O:
  Read: 50-100 MB/s average
  Write: 20-50 MB/s average
  IOPS: 1000-2000 average
  Alert Threshold: >80% capacity

Network:
  Inbound: 100-200 Mbps average
  Outbound: 50-100 Mbps average
  Connections: 500-1000 concurrent
  Alert Threshold: >8 Gbps sustained

Application Metrics:
  Response Time: <200ms (95th percentile)
  Request Rate: 1000-2000 req/sec
  Error Rate: <0.1%
  Active Sessions: 500-1500

Disaster Recovery

Recovery Procedures

Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 1 hour

Disaster Recovery Steps:

  • Server Failure:

    • Restore from last full backup to standby hardware
    • Update DNS to point to standby server
    • Verify application functionality
    • Restore incremental backups if needed
    • Monitor for issues
  • Data Corruption:

    • Identify last known good backup
    • Restore to staging environment
    • Verify data integrity
    • Promote to production during maintenance window
  • Ransomware:

    • Isolate affected server (disconnect network)
    • Preserve evidence for forensics
    • Rebuild server from clean image
    • Restore data from backup (verify backup is clean)
    • Restore services incrementally
    • Conduct security review

Backup Restoration Test:

  • Frequency: Quarterly
  • Last Test: 2024-10-01
  • Result: Success - 3.5 hour recovery time
  • Issues: None
  • Next Test: 2025-01-01

Runbooks and Procedures

Common Procedures

Server Restart Procedure:

Lang: bash
# 1. Notify stakeholders (30 min advance notice)
# 2. Stop application gracefully
systemctl stop nginx
systemctl stop php8.2-fpm

# 3. Stop database with proper shutdown
systemctl stop mysql

# 4. Reboot server
reboot

# 5. Verify services after boot
systemctl status mysql
systemctl status php8.2-fpm
systemctl status nginx

# 6. Test application access
curl -I https://portal.company.com

# 7. Monitor logs for errors
tail -f /var/log/nginx/error.log
tail -f /var/log/mysql/error.log

SSL Certificate Renewal:

Lang: bash
# 1. Generate CSR
openssl req -new -key /etc/ssl/private/portal.company.com.key \
  -out /tmp/portal.company.com.csr

# 2. Submit CSR to CA (DigiCert)
# 3. Download new certificate files
# 4. Backup old certificates
cp /etc/ssl/certs/portal.company.com.crt \
   /etc/ssl/certs/portal.company.com.crt.backup

# 5. Install new certificate
cp new_cert.crt /etc/ssl/certs/portal.company.com.crt
cp intermediate.crt /etc/ssl/certs/portal_intermediate.crt

# 6. Test certificate
openssl x509 -in /etc/ssl/certs/portal.company.com.crt -text -noout

# 7. Reload nginx
nginx -t && systemctl reload nginx

# 8. Verify in browser and with SSL checker

Application Deployment:

  • Update code from Git repository
  • Run database migrations if needed
  • Clear application cache
  • Restart PHP-FPM
  • Verify no errors in logs
  • Smoke test critical paths
  • Monitor error rates for 30 minutes

Database Backup Verification:

Lang: bash
# Weekly backup verification script
# 1. Restore backup to test database
mysql -h test-db-server -u root -p < /backup/portal_backup.sql

# 2. Verify table counts
mysql -h test-db-server -u root -p -e \
  "SELECT COUNT(*) FROM portal_test.users;"

# 3. Check data integrity
mysql -h test-db-server -u root -p -e \
  "SELECT MAX(created_at) FROM portal_test.orders;"

# 4. Document results in backup log

Network Documentation

Network Connectivity

Lang: yaml
Network Segments:
  Production VLAN:
    VLAN ID: 10
    Subnet: 10.10.10.0/24
    Gateway: 10.10.10.1
    Purpose: Production servers

  Database VLAN:
    VLAN ID: 20
    Subnet: 10.10.20.0/24
    Gateway: 10.10.20.1
    Purpose: Database servers

  Backup VLAN:
    VLAN ID: 30
    Subnet: 10.10.30.0/24
    Gateway: 10.10.30.1
    Purpose: Backup traffic

Firewall Rules:
  - Production VLAN → Database VLAN: port 3306
  - Production VLAN → Internet: ports 80, 443 (outbound)
  - Ops Network → Production VLAN: port 22
  - Backup VLAN → Storage: iSCSI ports

Load Balancer:
  VIP: 10.10.10.100
  Backend Servers:
    - WEBSRV-01: 10.10.10.10
    - WEBSRV-02: 10.10.10.11
  Algorithm: Least connections
  Health Check: GET /health HTTP/1.1
  SSL Offloading: Enabled

Diagrams and Visual Documentation

Required Diagrams

Network Diagram:

  • Physical topology showing switches, routers, firewalls
  • Logical topology showing VLANs and subnets
  • External connections (internet, VPN, site-to-site)

Application Architecture:

  • Load balancer configuration
  • Web tier (Nginx servers)
  • Application tier (PHP-FPM)
  • Data tier (MySQL, Redis, Elasticsearch)
  • Integration points (APIs, external services)

Data Flow:

  • User request flow through load balancer
  • Database replication topology
  • Backup data flow
  • Log aggregation flow

Rack Elevation:

  • Physical server locations in rack
  • Network switch locations
  • PDU connections
  • Cable management

Change Management

Change Documentation Template

Lang: yaml
Change Request: CR-2024-1234
Date: 2024-11-01
Requested By: John Doe
Implemented By: Jane Smith
Priority: Medium

Description:
  Upgrade MySQL from 8.0.33 to 8.0.34 for security patches

Impact Analysis:
  Systems Affected: WEBSRV-01, DBSRV-01
  Downtime Required: 15 minutes
  Risk Level: Low
  Rollback Plan: Restore from snapshot

Testing:
  - Tested in staging environment
  - Backup verified before change
  - Rollback procedure documented

Implementation Steps:
  - Stop application
  - Backup database
  - Upgrade MySQL packages
  - Run mysql_upgrade
  - Restart MySQL
  - Verify replication status
  - Start application
  - Monitor for 1 hour

Verification:
  - mysql --version shows 8.0.34
  - All services running normally
  - No errors in logs
  - Application responding correctly
  - Replication lag: 0 seconds

Post-Implementation:
  - Update documentation
  - Notify stakeholders
  - Schedule follow-up review in 1 week

Documentation Best Practices

Maintenance Guidelines

  • Keep Current: Review monthly, update immediately after changes
  • Version Control: Use Git for documentation files
  • Access Control: Sensitive info (passwords, keys) in separate vault
  • Automation: Generate inventory from monitoring tools where possible
  • Templates: Use consistent templates across all servers
  • Validation: Quarterly audit to verify accuracy

Information to NEVER Document in Plain Text

  • Passwords or API keys
  • Private encryption keys
  • Personal data
  • Credit card information
  • Social Security numbers

Instead use:

  • Password manager (1Password, LastPass)
  • Secrets management (HashiCorp Vault, AWS Secrets Manager)
  • Environment variables
  • Encrypted configuration files

Documentation Storage

Recommended Tools:

  • Wiki: Confluence, BookStack, Gitea Wiki
  • Version Control: GitLab, GitHub, Bitbucket
  • Diagrams: Draw.io, Lucidchart, Visio
  • Password Vault: 1Password, Bitwarden, KeePass
  • CMDB: ServiceNow, Device42, Netbox