Proxmox VE Clustering & High Availability
Proxmox VE clustering enables you to join multiple physical servers into a single logical unit with centralized management, automatic failover, and zero-downtime maintenance capabilities. The integrated HA Manager ensures critical services remain available even when hardware fails.
4-Node Proxmox Cluster with Shared Storage
Cluster Features
Centralized Management
Single web interface to manage all nodes and resources in the cluster.
- Unified dashboard
- Cross-node operations
- Synchronized configuration
- Cluster-wide monitoring
High Availability (HA)
Automatic restart of VMs on healthy nodes when a host fails.
- Watchdog-based fencing
- Automatic recovery
- Priority-based restart
- Service monitoring
Live Migration
Move running VMs between hosts with zero downtime for maintenance.
- Online migration (VMs)
- Offline migration (VMs/CTs)
- Shared or local storage
- Automatic or manual
Quorum-Based
Voting system prevents split-brain scenarios in network partitions.
- Majority voting
- External QDevice support
- Auto-fencing
- Safe failover
Setting Up a Cluster
Prerequisites
- All nodes must have unique hostnames
- All nodes should have time synchronized (NTP)
- All nodes need to be on the same network and able to communicate
- Minimum 3 nodes recommended for quorum
- Dedicated network interface for cluster communication recommended
Creating a Cluster
pvecm create my-cluster
pvecm status
Adding Nodes
pvecm add 192.168.1.10
pvecm nodes
Removing Nodes
pvecm delnode node-name
High Availability Manager
The HA Manager monitors services and automatically restarts them on other nodes if their host fails. It uses a priority system to determine which node should run which services.
Configuring HA Services
ha-manager add vm:100 --state started --max_restart 3 --max_relocate 3
ha-manager remove vm:100
ha-manager status
ha-manager config
HA Groups
HA groups allow you to restrict which nodes can run specific HA services, useful for licensing constraints or hardware requirements.
ha-manager groupadd production-nodes -nodes "pve-node1,pve-node2" -nofailback 0
ha-manager add vm:100 --group production-nodes
Live Migration
| Migration Type | Downtime | Requirements |
|---|---|---|
| Online (Live) - Shared Storage | None (seconds) | Shared storage, network connectivity |
| Online (Live) - Local Storage | Very brief | High bandwidth network for storage sync |
| Offline | VM shutdown time | Target node has capacity |
Migration Commands
qm migrate 100 pve-node2 --online
qm migrate 100 pve-node2
pct migrate 101 pve-node2
qm migrate 100 pve-node2 --online --bwlimit 100
Quorum & Fencing
Understanding Quorum
Quorum ensures that only one partition of a split cluster can make changes, preventing data corruption. A cluster needs a majority of nodes (n/2 + 1) to have quorum.
- 3-node cluster: Needs 2 nodes for quorum (can lose 1 node)
- 4-node cluster: Needs 3 nodes for quorum (can lose 1 node)
- 5-node cluster: Needs 3 nodes for quorum (can lose 2 nodes)
Two-Node Clusters
Two-node clusters are special cases. Use a QDevice (external vote provider) or adjust expected votes:
- QDevice: External system providing third vote (recommended)
- Expected votes: Temporary adjustment for maintenance (use with caution)
pvecm expected 1
Fencing
Fencing ensures that a failed node is truly offline before starting its HA services elsewhere. Proxmox uses watchdog-based fencing by default.
- Watchdog timers: Hardware or software watchdog automatically reboots unresponsive nodes
- External fencing: IPMI, iLO, iDRAC for power-based fencing
- Network fencing: Switch port shutdown
Storage for Clusters
Shared Storage Options
- Ceph: Hyper-converged storage integrated with Proxmox, best for clusters
- NFS: Simple to setup, good performance, single point of failure without HA NFS
- iSCSI: Block-level storage, often backed by commercial SANs
- GlusterFS: Distributed filesystem, redundant
- ZFS Replication: Block-level replication between nodes (not truly shared)
Best Practices
- Odd number of nodes: Use 3, 5, or 7 nodes for proper quorum
- Dedicated cluster network: Separate physical network for Corosync traffic
- Redundant links: Configure Corosync for multiple rings/links
- Time synchronization: Use NTP on all nodes
- Identical versions: Keep all nodes on the same Proxmox version
- Shared storage: Use for best live migration performance
- Regular testing: Test failover procedures periodically
- Monitor cluster health: Watch quorum status, fencing, and HA service states
- Document procedures: Maintain runbooks for emergency situations
Troubleshooting
systemctl status corosync
corosync-cfgtool -s
pvecm status
journalctl -u pve-ha-lrm -f
Proxmox VE clustering transforms multiple servers into a resilient, enterprise-grade infrastructure platform with automatic failover, centralized management, and zero-downtime maintenance capabilities.