Proxmox VE Clustering & High Availability

Proxmox VE clustering enables you to join multiple physical servers into a single logical unit with centralized management, automatic failover, and zero-downtime maintenance capabilities. The integrated HA Manager ensures critical services remain available even when hardware fails.

4-Node Proxmox Cluster with Shared Storage

Cluster Features

Centralized Management

Single web interface to manage all nodes and resources in the cluster.

Unified dashboard
Cross-node operations
Synchronized configuration
Cluster-wide monitoring

High Availability (HA)

Automatic restart of VMs on healthy nodes when a host fails.

Watchdog-based fencing
Automatic recovery
Priority-based restart
Service monitoring

Live Migration

Move running VMs between hosts with zero downtime for maintenance.

Online migration (VMs)
Offline migration (VMs/CTs)
Shared or local storage
Automatic or manual

Quorum-Based

Voting system prevents split-brain scenarios in network partitions.

Majority voting
External QDevice support
Auto-fencing
Safe failover

Setting Up a Cluster

Prerequisites

All nodes must have unique hostnames
All nodes should have time synchronized (NTP)
All nodes need to be on the same network and able to communicate
Minimum 3 nodes recommended for quorum
Dedicated network interface for cluster communication recommended

Creating a Cluster

On the first node, create the cluster

pvecm create my-cluster

Check cluster status

pvecm status

Adding Nodes

On additional nodes, join the cluster

pvecm add 192.168.1.10

View cluster nodes

pvecm nodes

Removing Nodes

Remove a node from cluster

pvecm delnode node-name

High Availability Manager

The HA Manager monitors services and automatically restarts them on other nodes if their host fails. It uses a priority system to determine which node should run which services.

Configuring HA Services

Add VM to HA (via Web GUI or CLI)

ha-manager add vm:100 --state started --max_restart 3 --max_relocate 3

Remove VM from HA

ha-manager remove vm:100

Check HA status

ha-manager status

View HA configuration

ha-manager config

HA Groups

HA groups allow you to restrict which nodes can run specific HA services, useful for licensing constraints or hardware requirements.

Create HA group

ha-manager groupadd production-nodes -nodes "pve-node1,pve-node2" -nofailback 0

Assign VM to HA group

ha-manager add vm:100 --group production-nodes

Live Migration

Migration Type	Downtime	Requirements
Online (Live) - Shared Storage	None (seconds)	Shared storage, network connectivity
Online (Live) - Local Storage	Very brief	High bandwidth network for storage sync
Offline	VM shutdown time	Target node has capacity

Migration Commands

Migrate VM online (live migration)

qm migrate 100 pve-node2 --online

Migrate VM offline

qm migrate 100 pve-node2

Migrate container

pct migrate 101 pve-node2

Migrate with bandwidth limit

qm migrate 100 pve-node2 --online --bwlimit 100

Quorum & Fencing

Understanding Quorum

Quorum ensures that only one partition of a split cluster can make changes, preventing data corruption. A cluster needs a majority of nodes (n/2 + 1) to have quorum.

3-node cluster: Needs 2 nodes for quorum (can lose 1 node)
4-node cluster: Needs 3 nodes for quorum (can lose 1 node)
5-node cluster: Needs 3 nodes for quorum (can lose 2 nodes)

Two-Node Clusters

Two-node clusters are special cases. Use a QDevice (external vote provider) or adjust expected votes:

QDevice: External system providing third vote (recommended)
Expected votes: Temporary adjustment for maintenance (use with caution)

Temporarily set expected votes (emergency only)

pvecm expected 1

Fencing

Fencing ensures that a failed node is truly offline before starting its HA services elsewhere. Proxmox uses watchdog-based fencing by default.

Watchdog timers: Hardware or software watchdog automatically reboots unresponsive nodes
External fencing: IPMI, iLO, iDRAC for power-based fencing
Network fencing: Switch port shutdown

Storage for Clusters

Shared Storage Options

Ceph: Hyper-converged storage integrated with Proxmox, best for clusters
NFS: Simple to setup, good performance, single point of failure without HA NFS
iSCSI: Block-level storage, often backed by commercial SANs
GlusterFS: Distributed filesystem, redundant
ZFS Replication: Block-level replication between nodes (not truly shared)

Best Practices

Odd number of nodes: Use 3, 5, or 7 nodes for proper quorum
Dedicated cluster network: Separate physical network for Corosync traffic
Redundant links: Configure Corosync for multiple rings/links
Time synchronization: Use NTP on all nodes
Identical versions: Keep all nodes on the same Proxmox version
Shared storage: Use for best live migration performance
Regular testing: Test failover procedures periodically
Monitor cluster health: Watch quorum status, fencing, and HA service states
Document procedures: Maintain runbooks for emergency situations

Troubleshooting

Check Corosync status

systemctl status corosync

View cluster ring status

corosync-cfgtool -s

Check cluster quorum

pvecm status

View HA manager logs

journalctl -u pve-ha-lrm -f

Proxmox VE clustering transforms multiple servers into a resilient, enterprise-grade infrastructure platform with automatic failover, centralized management, and zero-downtime maintenance capabilities.

Proxmox VE Reference Guide

Proxmox VE Clustering & High Availability

4-Node Proxmox Cluster with Shared Storage

Cluster Features

Centralized Management

High Availability (HA)

Live Migration

Quorum-Based

Setting Up a Cluster

Prerequisites

Creating a Cluster

Adding Nodes

Removing Nodes

High Availability Manager

Configuring HA Services

HA Groups

Live Migration

Migration Commands

Quorum & Fencing

Understanding Quorum

Two-Node Clusters

Fencing

Storage for Clusters

Shared Storage Options

Best Practices

Troubleshooting

Virtualization Platforms