Database Replication and High Availability: Ensuring Zero Downtime - CodeMatic Blog

High availability is critical for production applications. Database replication ensures your application remains available even when primary databases fail. This guide covers replication strategies, failover mechanisms, and disaster recovery for zero-downtime deployments.

Types of Database Replication

Master-Slave Replication

Master-Slave (or Primary-Replica) replication involves one master database that accepts writes and multiple replica databases that copy data from the master.

Master: Handles all write operations
Replicas: Handle read operations, reducing load on master
Failover: Promote replica to master if master fails

Master-Master Replication

Master-Master replication allows writes to multiple databases simultaneously, improving write availability.

Multiple masters accept writes
Data synchronized bidirectionally
More complex conflict resolution
Better for multi-region deployments

Synchronous vs Asynchronous Replication

Synchronous Replication

Synchronous replication waits for confirmation from replica before committing transaction.

Pros: No data loss, strong consistency
Cons: Higher latency, slower writes
Use Case: Financial systems, critical data

Asynchronous Replication

Asynchronous replication commits locally and replicates in background.

Pros: Lower latency, faster writes
Cons: Possible data loss, eventual consistency
Use Case: High-traffic applications, read-heavy workloads

PostgreSQL Replication Setup

Streaming Replication Configuration

-- Master (postgresql.conf)
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3

-- Master (pg_hba.conf)
host replication replica_user REPLICA_IP/32 md5

-- Replica (postgresql.conf)
primary_conninfo = 'host=MASTER_IP port=5432 user=replica_user'
primary_slot_name = 'replica_slot'

-- Enable replication slot
SELECT pg_create_physical_replication_slot('replica_slot');

-- Create replica from backup
pg_basebackup -h MASTER_IP -U replica_user -D /var/lib/postgresql/data -P -W -R

Read Replicas

Read replicas distribute read load across multiple databases, improving performance and availability.

Load Balancing Reads

// Application-level routing
const dbConfig = {
  write: {
    host: process.env.DB_MASTER_HOST,
    port: 5432,
  },
  read: [
    { host: process.env.DB_REPLICA_1_HOST },
    { host: process.env.DB_REPLICA_2_HOST },
  ],
};

async function query(sql, params, options = {}) {
  const isRead = sql.trim().toUpperCase().startsWith('SELECT');
  const pool = isRead 
    ? getReadPool() // Round-robin or least-connections
    : getWritePool();
  
  return pool.query(sql, params);
}

Automatic Failover

Implement automatic failover to minimize downtime when primary database fails.

PostgreSQL Failover with Patroni

# patroni.yml
scope: mycluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: NODE1_IP:8008

etcd:
  hosts: ETCD1_IP:2379,ETCD2_IP:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 30
    maximum_lag_on_failover: 1048576

postgresql:
  parameters:
    wal_level: replica
    max_wal_senders: 3

Monitoring Replication Lag

-- Check replication lag
SELECT 
  client_addr,
  state,
  sync_state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
  pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
  pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

-- Alert on high lag
SELECT 
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
WHERE lag_seconds > 60;

Multi-Region Replication

For global applications, replicate databases across multiple regions for disaster recovery and low latency.

Primary Region: Write operations
Secondary Regions: Read replicas, disaster recovery
Conflict Resolution: Last-write-wins or custom logic
Latency: Use regional replicas for reads

Disaster Recovery Planning

RPO (Recovery Point Objective)

Maximum acceptable data loss. Synchronous replication: RPO = 0. Asynchronous: RPO = replication lag time.

RTO (Recovery Time Objective)

Maximum acceptable downtime. Automated failover: RTO = minutes. Manual: RTO = hours.

Backup Strategies

Full Backups: Daily or weekly complete database dumps
Incremental Backups: Continuous WAL archiving
Point-in-Time Recovery: Restore to specific timestamp
Testing: Regularly test backup restoration

Real-World Architecture

High-availability setup for a financial application:

Primary database with 2 synchronous replicas (RPO = 0)
3 read replicas in different regions
Automated failover with Patroni (RTO < 30 seconds)
Continuous WAL archiving to S3
Daily backups with 30-day retention
Result: 99.99% uptime, zero data loss incidents

Best Practices

Monitor replication lag continuously
Test failover procedures regularly
Use connection pooling with read/write splitting
Implement circuit breakers for replica failures
Document recovery procedures
Set appropriate RPO and RTO based on business needs

Conclusion

Database replication is essential for high availability. Choose synchronous replication for zero data loss, or asynchronous for better performance. Implement read replicas to scale reads, use automatic failover for minimal downtime, and plan for disaster recovery. Monitor replication lag and test failover procedures regularly to ensure your system can handle failures gracefully.