A Complete Guide to PostgreSQL Change Data Capture (CDC)

PostgreSQL, also referred to as Postgres is one of the most common and widely used open-source relational databases. It is the preferred platform for several use cases ranging from data warehouse and analytics to OLTP workloads.

Typically, organizations handle the transactional workloads through a relational database and then have a separate data warehouse for performing the aggregated reporting and analytics use cases.

However, the issue is often to make sure that the current version of the data in the transactional database is in the data warehouse. It is not uncommon to have time-bound reporting needs that do not match hourly or daily batch sync between these databases.

To get around this problem, it is necessary to have continuous sync between the instances based on the Change Data Capture feature.

CDC or Change Data Capture is a software design pattern that tracks and monitors changes made to a database so that the required action can be taken on them. This functionality, based on the Postgres CDC, has many benefits that include:

  • Capturing change events in real-time by keeping data warehouses and other downstream systems in sync with PostgreSQL.
  • Reducing the load on PostgreSQL since only the relevant changes are processed.
  • Enabling the efficient implementation of use cases that require access to changes made in PostgreSQL without modifying the application code.

This post will take readers through the various aspects of the Postgres CDC, its three main approaches, and their pros and cons.

Trigger-based Postgres CDC

With the Postgres CDC Trigger model, it is possible to identify changes such as Insert, Delete, and Update taking place in the table of interest, and for each change, insert one row into a changing table, thereby building a changelog.

The Postgres generic Trigger function or code supports the 9.1 version and later stores all the change events in the audit.logged actions. This approach to Postgres CDC stores events that are captured inside PostgreSQL only.

If the need to sync change events to other data systems such as a data warehouse arises, the PostgreSQL table that holds the change events has to be recurringly queried. This often increases the implementation complexity of the process.

Pros and Cons of Postgres CDC with Triggers

Pros

  • Real-time processing of change events is enabled as any change is captured immediately
  • The PostgreSQL Trigger function automatically adds useful metadata to the change events. Examples include the transaction ID, the session user name, or the statement that caused the change.

Cons

  • The performance of Postgres is adversely impacted by trigger-based Postgres CDC as they increase the time of execution of the original statement.
  • Changes have to be made to the PostgreSQL database for triggers to work.
  • There is a need to set up a separate data pipeline that polls the table filled by the trigger function. This is when change events have to be synced to a data store other than the same PostgreSQL database.
  • Users face more operational complexity when creating and managing triggers.

Query-based Postgres CDC

The next way to implement Postgres CDC is the query-based method.

When the schema of the tracked database has a timestamp column that indicates when was the last time that a row has been changed, PostgreSQL has to be recurringly queried using that column.

This would provide all records that have been modified since the last time that PostgreSQL was queried. It has to be noted here though that query-based Postgres CDC cannot capture Delete and is limited to Update and Insert events only.

Pros and Cons of Query-based Postgres CDC

Pros

  • Changes are not required to be made to PostgreSQL to implement query-based CDC. However, this is only possible when the schema has a timestamp column that indicates the time of modification of the rows.

Cons

  • Additional stress is put on PostgreSQL since the query-based Postgres CDC uses the query layer for extracting data.
  • Since recurring polling of the monitored table is required for query-based CDC, resources are often wasted if data does not change.
  • A column needs to be present in this form of Postgres CDC that tracks the time of the last modification of records.
  • Delete changes cannot be captured by the Query-based Postgres CDC unless it is soft deletions.

Logical Replication-based Postgres CDC

The Logical Replication-based Postgres CDC was launched by PostgreSQL with its 9.4 version.

It can quickly and seamlessly replicate data between different PostgreSQL instances on different systems. Primarily and technically speaking, it’s a write-ahead log on disk that comprises all events that change the data in the PostgreSQL database such as Insert, Update, and Delete.

Several database systems use the replication model but it is not always enabled by default. On the other hand, in Postgres CDC, logical replication can be enabled by introducing changes to the configuration file.

Logical replication is implemented by a decoding plugin automatically but for PostgreSQL versions older than 10, the same has to be done manually. Most managed PostgreSQL services offer logical replication support including AWS RDS, Google Cloud SQL, or Azure Database.

Pros and cons of logical replication-based Postgres CDC

Pros

  • Log-based CDC captures changes to the data in real-time and hence downstream applications always get access to the current data from PostgreSQL.
  • As distinct from the other two approaches, log-based CDC detects all forms of changes such as Insert, Update, and Delete.
  • Implementing changes through logical replication-based Postgres CDC leads to direct access to file systems and hence does not in any way affect the performance of the PostgreSQL database.

Cons

Versions of PostgreSQL that are older than 9.4 do not support logical replication.

When comparing the three approaches, the winner is the logical replication Postgres CDC as it is highly efficient and captures all change types without affecting the performance of the database.