Aller au contenu principal

Site Recovery

Enterprise Feature

Site Recovery is available exclusively with an Enterprise license. The required feature flag is ceph_replication. Learn more about licensing.

Site Recovery provides disaster recovery (DR) capabilities for your Proxmox environments. It manages data replication between nodes or clusters, orchestrates recovery plans, and supports failover, failback, and emergency DR operations -- giving you confidence that critical workloads can be restored quickly when disaster strikes.

Overview

Site Recovery is built around two core concepts:

  1. Replication Jobs -- Continuous or scheduled data replication from a source node/cluster to a target, ensuring an up-to-date copy of your VMs is always available.
  2. Recovery Plans -- Predefined sequences of actions that describe how to restore a set of VMs on a target cluster in case of failure.

Together, these allow you to protect workloads, test your DR strategy regularly, and execute real failovers with minimal downtime.

Interface Tabs

The Site Recovery page is organized into four tabs: Dashboard, Protection, Recovery Plans, and Emergency.

Dashboard

The Dashboard tab provides a high-level view of your replication health:

  • Overall replication status -- healthy, degraded, or critical
  • Active replication job count and their current states
  • Error count -- jobs in an error state are flagged immediately
  • Recovery plan status overview

Use this tab as a daily check-in to verify that your DR posture is healthy.

Protection

The Protection tab manages replication jobs. Each job defines what data is replicated, from where, and to where.

Creating a Replication Job

Click Create Job to open the creation dialog. You can configure:

  • Source connection -- the Proxmox cluster containing the VMs to protect
  • Target connection -- the destination cluster for replicated data
  • VMs to replicate -- select individual VMs from the source cluster
  • Schedule -- how often replication runs

Managing Replication Jobs

For each job, the following actions are available:

ActionDescription
SyncTrigger an immediate replication sync
PauseTemporarily suspend replication
ResumeResume a paused replication job
DeleteRemove the replication job entirely

Selecting a job displays its execution logs in a detail panel, showing the history of sync operations with timestamps and results.

astuce

Run a manual sync after making significant changes to a protected VM to ensure the latest state is replicated before relying on it for recovery.

Recovery Plans

Recovery Plans define the procedure for restoring services on a target cluster. The tab lists all existing plans and lets you create new ones.

Creating a Recovery Plan

Click Create Plan to define:

  • Plan name and description
  • Source and target clusters
  • Associated replication jobs -- which replication jobs feed into this plan
  • VM startup order and dependencies

Recovery Plan Operations

Each recovery plan supports three operations:

OperationDescription
Test FailoverExecutes the recovery plan in an isolated, network-isolated environment. Production workloads are not affected. Use this to validate your DR strategy regularly.
FailoverActivates the recovery plan for real. VMs are started on the target cluster using the most recent replicated data. Use this during an actual disaster.
FailbackAfter the primary site is restored, failback reverses the direction -- migrating workloads back from the DR site to the original production cluster.

When any operation is executed, ProxCenter tracks its progress in real time, polling the execution status every 3 seconds and displaying step-by-step updates.

attention

Failover is a disruptive operation. Ensure the source site is truly unavailable before initiating a production failover, as running the same VMs on both sites simultaneously can cause data corruption.

Test Cleanup

After running a test failover, use the Cleanup action to tear down the test environment and release resources on the target cluster. This ensures that test artifacts do not consume storage or interfere with future tests.

Execution History

Select a recovery plan to view its execution history -- a chronological list of all test, failover, and failback operations with their outcomes, timestamps, and any errors encountered.

Emergency

The Emergency tab is designed for critical situations where you need to act fast without going through the full recovery plan workflow.

Emergency DR Mode allows you to:

  • Start individual VMs on a target cluster directly from their most recent replication snapshot
  • Execute immediate failover of an entire recovery plan
  • Execute failback to restore services to the original site

This tab aggregates all replication jobs and recovery plans with quick-action buttons, giving operators a single view to manage a crisis.

attention

Emergency operations bypass the normal validation steps. Use them only when time is critical and you understand the implications of starting replicated VMs without a full plan execution.

Workflow Example

A typical Site Recovery workflow looks like this:

  1. Set up replication: Create replication jobs for your critical VMs, pointing to a secondary Proxmox cluster
  2. Create a recovery plan: Group the replication jobs into a recovery plan with the correct startup order
  3. Test regularly: Run test failovers monthly to validate that recovery works as expected, then clean up
  4. Respond to incidents: If the primary site fails, execute a failover from the Emergency tab or Recovery Plans tab
  5. Restore normal operations: Once the primary site is back, perform a failback to return workloads to production

Permissions

PermissionDescription
vm.configRequired to access Site Recovery and manage replication jobs and recovery plans

Users without the vm.config permission will not see the Site Recovery entry in the navigation sidebar.