High availability is critical for production applications. Database replication ensures your application remains available even when primary databases fail. This guide covers replication strategies, failover mechanisms, and disaster recovery for zero-downtime deployments.
Types of Database Replication
Master-Slave Replication
Master-Slave (or Primary-Replica) replication involves one master database that accepts writes and multiple replica databases that copy data from the master.
- Master: Handles all write operations
- Replicas: Handle read operations, reducing load on master
- Failover: Promote replica to master if master fails
Master-Master Replication
Master-Master replication allows writes to multiple databases simultaneously, improving write availability.
- Multiple masters accept writes
- Data synchronized bidirectionally
- More complex conflict resolution
- Better for multi-region deployments
Synchronous vs Asynchronous Replication
Synchronous Replication
Synchronous replication waits for confirmation from replica before committing transaction.
- Pros: No data loss, strong consistency
- Cons: Higher latency, slower writes
- Use Case: Financial systems, critical data
Asynchronous Replication
Asynchronous replication commits locally and replicates in background.
- Pros: Lower latency, faster writes
- Cons: Possible data loss, eventual consistency
- Use Case: High-traffic applications, read-heavy workloads
PostgreSQL Replication Setup
Streaming Replication Configuration
-- Master (postgresql.conf)
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
-- Master (pg_hba.conf)
host replication replica_user REPLICA_IP/32 md5
-- Replica (postgresql.conf)
primary_conninfo = 'host=MASTER_IP port=5432 user=replica_user'
primary_slot_name = 'replica_slot'
-- Enable replication slot
SELECT pg_create_physical_replication_slot('replica_slot');
-- Create replica from backup
pg_basebackup -h MASTER_IP -U replica_user -D /var/lib/postgresql/data -P -W -R
Read Replicas
Read replicas distribute read load across multiple databases, improving performance and availability.
Load Balancing Reads
// Application-level routing
const dbConfig = {
write: {
host: process.env.DB_MASTER_HOST,
port: 5432,
},
read: [
{ host: process.env.DB_REPLICA_1_HOST },
{ host: process.env.DB_REPLICA_2_HOST },
],
};
async function query(sql, params, options = {}) {
const isRead = sql.trim().toUpperCase().startsWith('SELECT');
const pool = isRead
? getReadPool() // Round-robin or least-connections
: getWritePool();
return pool.query(sql, params);
}
Automatic Failover
Implement automatic failover to minimize downtime when primary database fails.
PostgreSQL Failover with Patroni
# patroni.yml
scope: mycluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: NODE1_IP:8008
etcd:
hosts: ETCD1_IP:2379,ETCD2_IP:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576
postgresql:
parameters:
wal_level: replica
max_wal_senders: 3
Monitoring Replication Lag
-- Check replication lag
SELECT
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
-- Alert on high lag
SELECT
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
WHERE lag_seconds > 60;
Multi-Region Replication
For global applications, replicate databases across multiple regions for disaster recovery and low latency.
- Primary Region: Write operations
- Secondary Regions: Read replicas, disaster recovery
- Conflict Resolution: Last-write-wins or custom logic
- Latency: Use regional replicas for reads
Disaster Recovery Planning
RPO (Recovery Point Objective)
Maximum acceptable data loss. Synchronous replication: RPO = 0. Asynchronous: RPO = replication lag time.
RTO (Recovery Time Objective)
Maximum acceptable downtime. Automated failover: RTO = minutes. Manual: RTO = hours.
Backup Strategies
- Full Backups: Daily or weekly complete database dumps
- Incremental Backups: Continuous WAL archiving
- Point-in-Time Recovery: Restore to specific timestamp
- Testing: Regularly test backup restoration
Real-World Architecture
High-availability setup for a financial application:
- Primary database with 2 synchronous replicas (RPO = 0)
- 3 read replicas in different regions
- Automated failover with Patroni (RTO < 30 seconds)
- Continuous WAL archiving to S3
- Daily backups with 30-day retention
- Result: 99.99% uptime, zero data loss incidents
Best Practices
- Monitor replication lag continuously
- Test failover procedures regularly
- Use connection pooling with read/write splitting
- Implement circuit breakers for replica failures
- Document recovery procedures
- Set appropriate RPO and RTO based on business needs
Conclusion
Database replication is essential for high availability. Choose synchronous replication for zero data loss, or asynchronous for better performance. Implement read replicas to scale reads, use automatic failover for minimal downtime, and plan for disaster recovery. Monitor replication lag and test failover procedures regularly to ensure your system can handle failures gracefully.