Backend
How to update 1 billion rows without blocking the table
In today's data-driven world, managing large datasets efficiently is a common challenge for many organizations. A particularly tricky scenario arises when we need to update a massive table — say, one with 1 billion rows — without causing downtime or significantly impacting database performance.
This article explores a practical case where we added a new column to such a table and needed to update it based on values from another table, all without blocking access to the data.
The Problem
Our database contains a table with over 1 billion rows, a testament to the vast amount of data we manage daily. The challenge begins when we decide to add a new column to this table.
The requirement is not just to add an empty column but to populate it with data derived from another table.
The catch? This operation needs to be performed without locking the table or degrading the performance of the database, ensuring that the application relying on this data remains fully operational.
Updating a table of this size poses several challenges:
- Performance Impact: Traditional update operations on a table of this magnitude can severely impact database performance, leading to slow response times for other queries.
- Locking Issues: Large update operations often require table locks, preventing other operations from accessing the data and leading to application downtime.
- Transaction Log Overflow: Massive updates can flood the transaction log, causing it to grow rapidly and potentially filling up the allocated disk space.
The Solution
To address these challenges, we employed a multi-step strategy that minimized the impact on database performance and avoided locking the table for extended periods. Here’s how we did it:
Batch Processing: Instead of updating all 1 billion rows in a single transaction, we broke down the task into smaller, more manageable batches.
This approach reduces the strain on the transaction log and allows other operations to proceed with minimal interference.
Disabling Indexes: Contrary to the common practice of relying on indexes to speed up queries, but In our update process for a 1-billion-row table, disabling non-clustered indexes significantly enhanced performance by removing the overhead of updating index entries for each row change.
This approach requires caution, as it temporarily degrades the performance of queries relying on these indexes.
After completing the updates, we promptly rebuilt the indexes to restore query efficiency to optimal levels, balancing speed and system integrity.
Optimization to Avoid Repeat Visits: To ensure that our update process was as efficient as possible, we initialized the new column with a default value that does not exist in the derived table.
This technique is critical for avoiding repeat visits to rows that have already been updated.
By doing so, we ensured that each row was processed only once, significantly reducing the overall time and resources required for the update.
This step is crucial in scenarios where the process might be interrupted or executed in phases, as it prevents redundant processing of rows.
Monitoring and Adjusting: Throughout the process, we closely monitored the database's performance metrics.
This vigilance allowed us to adjust our batch size and concurrency settings in real-time, optimizing the update process without overburdening the system.
The Code
To fully leverage the strategies outlined, it is advisable to run the provided script multiple times, fine-tuning it to your database's unique configuration and requirements.
This iterative process not only helps in achieving the best configuration but also plays a crucial role in monitoring your database's performance and health.
By closely observing the script's impact with each execution, you can gain valuable insights into how your database responds to the updates, allowing for informed adjustments that ensure optimal performance and stability.
Important Considerations
Execution of Generated Commands: The script generates commands to disable and enable indexes but does not execute them. You'd need to manually run these generated commands or automate their execution within your script.
Performance Impact: Disabling and re-enabling indexes can significantly improve the performance of large update operations but should be used with caution, especially in a live production environment, due to the temporary loss of indexing benefits and potential impact on other queries.
Monitoring and Adjustment: The script includes logging to help monitor its progress. Depending on the actual data and database performance, you might need to adjust the PageSize and TableCnt variables for optimal performance.
This script demonstrates an efficient approach to performing large-scale updates in PostgreSQL, especially useful for databases with a significant amount of data and high transaction rates.
-- Generate SQL statements to disable indexes for the "Your_large_Data" table
-- This is to improve the performance of the subsequent UPDATE operation
SELECT 'ALTER INDEX ' || indexname || ' DISABLE;'
FROM pg_indexes
WHERE tablename = 'Your_large_Data';
-- Anonymous code block to perform batch update operation
DO $$
DECLARE
-- Initialize pagination variables
PageNumber INT := 0; -- Current page number, starting from 0
PageSize INT := 100000; -- Number of rows to process per batch
TblCnt INT := 1000000000; -- Estimated total number of rows to update
AffectedRows INT := 0; -- Number of rows affected by the current update
TotalAffectedRows INT := 0; -- Total number of rows updated across all batches
BEGIN
-- Continue looping until all estimated rows are processed
WHILE PageNumber * PageSize < TblCnt
LOOP
-- Log the progress with the current batch being processed
RAISE NOTICE 'Processing % Rows', (PageNumber + 1) * PageSize;
-- Perform the batch update
UPDATE "Your_large_Data" YLD
SET "Your_New_Column" = subquery."Your_Derived_Column"
FROM (
-- Subquery to select "Your_Derived_Column" from "Your_Derived_Data" table for batch of "Your_Join_Column"s
SELECT YDD."Id" AS "Your_Join_Column", YDD."Your_Derived_Column"
FROM "Your_Derived_Data" YDD
WHERE YDD."Id" IN (
-- Select distinct "Your_Join_Column"s from "Your_large_Data" table where "Your_New_Column" is 0
-- Apply pagination using OFFSET and LIMIT
SELECT DISTINCT OYLD."Your_Join_Column"
FROM "Your_large_Data" OYLD -- AS Not Affected Data
WHERE OYLD."Your_New_Column" = 0
OFFSET PageNumber * PageSize LIMIT PageSize
)
) AS subquery
WHERE YLD."Your_New_Column" = 0
AND YLD."Your_Join_Column" = subquery."Your_Join_Column";
-- Retrieve and log the number of rows affected by the current update
GET DIAGNOSTICS AffectedRows = ROW_COUNT;
RAISE NOTICE 'Affected Rows: %', AffectedRows;
-- Update the total number of affected rows and proceed to the next batch
TotalAffectedRows := TotalAffectedRows + AffectedRows;
PageNumber := PageNumber + 1;
-- Optional: Pause for a moment to allow other transactions
PERFORM PG_SLEEP(1); -- Adjust the sleep time as needed
END LOOP;
END
$$;
-- Generate SQL statements to re-enable indexes for the "Your_large_Data" table
-- This is done after the batch update operation is completed
SELECT 'ALTER INDEX ' || indexname || ' ENABLE;'
FROM pg_indexes
WHERE tablename = 'Your_large_Data';
Conclusion
Our update strategy for a 1-billion-row table exemplifies the importance of balance and adaptability in large-scale database management. By employing batch processing, careful index management, and real-time monitoring, we minimized operational impact while ensuring seamless functionality. Our approach underscores the significance of strategic planning and continuous improvement in navigating complex database challenges effectively.