Database Replication and High Availability: Ensuring Zero Downtime

1 December 2024 · CodeMatic Team

Database Replication

High availability is critical for production applications. Database replication ensures your application remains available even when primary databases fail. This guide covers replication strategies, failover mechanisms, and disaster recovery for zero-downtime deployments.

Types of Database Replication

Master-Slave Replication

Master-Slave (or Primary-Replica) replication involves one master database that accepts writes and multiple replica databases that copy data from the master.

  • Master: Handles all write operations
  • Replicas: Handle read operations, reducing load on master
  • Failover: Promote replica to master if master fails

Master-Master Replication

Master-Master replication allows writes to multiple databases simultaneously, improving write availability.

  • Multiple masters accept writes
  • Data synchronized bidirectionally
  • More complex conflict resolution
  • Better for multi-region deployments

Synchronous vs Asynchronous Replication

Synchronous Replication

Synchronous replication waits for confirmation from replica before committing transaction.

  • Pros: No data loss, strong consistency
  • Cons: Higher latency, slower writes
  • Use Case: Financial systems, critical data

Asynchronous Replication

Asynchronous replication commits locally and replicates in background.

  • Pros: Lower latency, faster writes
  • Cons: Possible data loss, eventual consistency
  • Use Case: High-traffic applications, read-heavy workloads

PostgreSQL Replication Setup

Streaming Replication Configuration

-- Master (postgresql.conf)
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3

-- Master (pg_hba.conf)
host replication replica_user REPLICA_IP/32 md5

-- Replica (postgresql.conf)
primary_conninfo = 'host=MASTER_IP port=5432 user=replica_user'
primary_slot_name = 'replica_slot'

-- Enable replication slot
SELECT pg_create_physical_replication_slot('replica_slot');

-- Create replica from backup
pg_basebackup -h MASTER_IP -U replica_user -D /var/lib/postgresql/data -P -W -R

Read Replicas

Read replicas distribute read load across multiple databases, improving performance and availability.

Load Balancing Reads

// Application-level routing
const dbConfig = {
  write: {
    host: process.env.DB_MASTER_HOST,
    port: 5432,
  },
  read: [
    { host: process.env.DB_REPLICA_1_HOST },
    { host: process.env.DB_REPLICA_2_HOST },
  ],
};

async function query(sql, params, options = {}) {
  const isRead = sql.trim().toUpperCase().startsWith('SELECT');
  const pool = isRead 
    ? getReadPool() // Round-robin or least-connections
    : getWritePool();
  
  return pool.query(sql, params);
}

Automatic Failover

Implement automatic failover to minimize downtime when primary database fails.

PostgreSQL Failover with Patroni

# patroni.yml
scope: mycluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: NODE1_IP:8008

etcd:
  hosts: ETCD1_IP:2379,ETCD2_IP:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 30
    maximum_lag_on_failover: 1048576

postgresql:
  parameters:
    wal_level: replica
    max_wal_senders: 3

Monitoring Replication Lag

-- Check replication lag
SELECT 
  client_addr,
  state,
  sync_state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
  pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
  pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

-- Alert on high lag
SELECT 
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
WHERE lag_seconds > 60;

Multi-Region Replication

For global applications, replicate databases across multiple regions for disaster recovery and low latency.

  • Primary Region: Write operations
  • Secondary Regions: Read replicas, disaster recovery
  • Conflict Resolution: Last-write-wins or custom logic
  • Latency: Use regional replicas for reads

Disaster Recovery Planning

RPO (Recovery Point Objective)

Maximum acceptable data loss. Synchronous replication: RPO = 0. Asynchronous: RPO = replication lag time.

RTO (Recovery Time Objective)

Maximum acceptable downtime. Automated failover: RTO = minutes. Manual: RTO = hours.

Backup Strategies

  • Full Backups: Daily or weekly complete database dumps
  • Incremental Backups: Continuous WAL archiving
  • Point-in-Time Recovery: Restore to specific timestamp
  • Testing: Regularly test backup restoration

Real-World Architecture

High-availability setup for a financial application:

  • Primary database with 2 synchronous replicas (RPO = 0)
  • 3 read replicas in different regions
  • Automated failover with Patroni (RTO < 30 seconds)
  • Continuous WAL archiving to S3
  • Daily backups with 30-day retention
  • Result: 99.99% uptime, zero data loss incidents

Best Practices

  • Monitor replication lag continuously
  • Test failover procedures regularly
  • Use connection pooling with read/write splitting
  • Implement circuit breakers for replica failures
  • Document recovery procedures
  • Set appropriate RPO and RTO based on business needs

Conclusion

Database replication is essential for high availability. Choose synchronous replication for zero data loss, or asynchronous for better performance. Implement read replicas to scale reads, use automatic failover for minimal downtime, and plan for disaster recovery. Monitor replication lag and test failover procedures regularly to ensure your system can handle failures gracefully.