How we migrated 50TB of relational data using DMS part 2

Miguel Angel Fernandez-Nieto
Engineering at FundApps
8 min readFeb 27, 2024

--

As we continue our exploration into enhancing our infrastructure, it’s crucial to acknowledge the foundation laid in our previous discussion. In the first part of our journey, we introduced the challenges faced by our company and the imperative to modernize our systems. Now, we delve deeper into our efforts to overcome these challenges and drive innovation forward.

Infrastructure Overhaul: Terraform and DMS Settings

In terms of infrastructure, we implemented an internal Terraform module that allowed us to deploy infrastructure to all 300+ tenants simultaneously using Scalr. This included not only the DMS instance, but also DMS endpoints connecting the source and target, and replication tasks, with configurations and settings for replication, and all the required infrastructure.

The Amazon DMS Instance is under the hood, an EC2 instance deployed within the chosen Virtual Private Cloud (VPC). During the configuration of replication settings, we conducted extensive testing of different settings.

Amazon Database Migration Service

Fine-Tuning Data Migration: Key Settings and Strategies

Let’s first discuss the different types of ongoing replication tasks provided by Amazon DMS. Currently, Amazon DMS offers two types: Full Load mode and CDC (Changed Data Capture). The key distinction is that Full Load mode performs one-time data migration, while CDC continuously monitors data changes in the database and copies the modified information. There’s also a combination of both, which migrates existing data and monitors changes for migration to the target. In our case, we opted for Full Load mode with a planned maintenance window, as CDC presented some challenges that we wanted to avoid, keeping it as a backup plan.

To minimize the duration of maintenance windows, we divided the replication job into two separate tasks, because the shorter the downtime window is, the sooner our clients can use the platform. The first task involved migrating all data using Full Load mode during operational hours. AWS DMS offers an option to create a rule and filter data based on timestamps, allowing us to have two tasks: one for data older than a specific timestamp and another for data newer or equal to that timestamp. This approach leveraged the immutable nature of our data.

{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "select-all-tables-from-dbo",
"object-locator": {
"schema-name": "dbo",
"table-name": "ResultData"
},
"rule-action": "include",
"filters": [
{
"filter-type": "source",
"column-name": "id",
"filter-conditions": [
{
"filter-operator": "lte",
"value": "MAXRESULTIDHERE"
}
]
}
]
}

Testing and Optimization: Task Settings

During our testing phase, we identified various task settings that are worth discussing. These settings include CommitRate, MaxFullLoadSubTasks, InlineLobMaxSize, LobMaxSize, and TargetTablePrepMode. Throughout our comprehensive testing, these settings played a significant role in shaping our data migration strategy. Let's delve into each of them to gain a deeper understanding of their impact and significance:

  • CommitRate: This setting is crucial as it determines the frequency at which data changes are committed during the replication process. Optimizing CommitRate is essential for maintaining a balance between data integrity and migration speed. We explored different CommitRate values to find the optimal setting for our specific needs.
  • MaxFullLoadSubTasks: Managing the number of subtasks during a Full Load operation is essential for optimizing resource utilization and data transfer efficiency. Adjusting the MaxFullLoadSubTasks setting allowed us to fine-tune the performance of our data migration tasks to align with our infrastructure capacity. It is important to mention that the higher the value is, the more load we put on the DMS instance, and based on the structure of the data that we have, it was a key point in our data migration plan. Our data has different shapes and sizes, and using a higher MaxFullLoadSubTasks value helped us process things faster. It was like fitting puzzle pieces together, making sure everything moved smoothly and quickly during migration.
  • InlineLobMaxSize: Efficiently handling Large Objects (LOBs) is crucial for a smooth data migration process. The InlineLobMaxSize setting allows us to specify the maximum size of LOBs that should be included directly within the row data. Striking the right balance between including LOBs in line and storing them externally was a key consideration in our migration strategy. In the majority of tables, you want to let AWS DMS check all LOBs and perform an inline transfer for the objects, as we have a mix of small and large LOBs and we wanted to optimize the transfer process.
  • LobMaxSize: Similar to InlineLobMaxSize, LobMaxSize pertains to the handling of Large Objects during data migration. However, LobMaxSize focuses on specifying the maximum size threshold for LOBs that will be migrated separately, optimizing resource usage and storage during the process. What we needed to do here was to set the max LOB size for the biggest column in the source, and put it as a LOBMAXSize.
  • TargetTablePrepMode: The TargetTablePrepMode setting plays a pivotal role in preparing target tables for data migration. It influences how data is transformed and loaded into the target schema. Our testing involved evaluating different modes to ensure our data was correctly prepared and mapped to the target tables, aligning with our specific requirements. In the end, we ended up setting the preparation to DO_NOTHING, as we wanted to have this under control and be able to rerun the tasks without dropping the DB. Dropping the DB would cause data loss because as we already explained we had to split the DMS tasks. It is also worth mentioning that the DB schema was already deployed, dropping the DB would have caused losing the schema. DO_NOTHING allows control and avoids these drawbacks.

It’s also important to note that we encountered another unique aspect of the process. Due to variations in table structures and underlying logic, we found it necessary to create distinct DMS tasks for specific tables. Consequently, we found ourselves managing three separate DMS tasks, each configured with its own set of settings tailored to its specific requirements.

Long Tables: In the case of tables like dbo.Result, which can potentially contain more than 1 billion rows, specific settings were crucial. These settings were fine-tuned to optimize performance and resource utilization. LOBMaxSize and InlineLobMaxSize were set to 32KB. This choice was made because large data storage wasn't a requirement for this table, and these settings effectively balanced the storage requirements. The CommitRate was set at 20,000. This parameter determines the rate at which data is committed, helping maintain a smooth data migration process. The MaxFullLoadSubTasks were configured to be 24, ensuring that the migration process could be divided into multiple subtasks for efficient data transfer.

Big Blob Tables: In this case, where is a 1:1 relationship table with dbo.Result, and used to store blob data, different settings were applied due to the need to handle large binary objects (BLOBs). The LOBMaxSize was increased to 175,000. This was necessary to accommodate the larger size of binary objects. InlineLobMaxSize was set to 0, as there was no need to store BLOBs inline within the table. A higher MaxFullLoadSubTasks of 49 was chosen to maximize data transfer speed. The CommitRate was set at 50,000 to ensure a rapid data transfer process. The objective here was to move data as quickly as possible due to the large size of the BLOBs.

Other Tables: For tables that were considered more typical or had standard characteristics, a common set of settings was employed to strike a balance between performance and resource utilization. LOBMaxSize and InlineLobMaxSize were set to 32, indicating that these tables had modest storage requirements for binary objects. The CommitRate was configured at 20,000, maintaining a reasonable rate for data commitment. MaxFullLoadSubTasks were set at 23, which provided efficient data migration by breaking the process into manageable subtasks.

Deploying DB Schemas and Overcoming PostgreSQL Challenges

With the migration plan in place and the infrastructure deployed, the next step was to deploy the DB schemas to the target database. While Amazon DMS allows for schema migration from the source, we encountered challenges due to differences in how PostgreSQL stores default formats. To address this, we ended up using DBUp, which is a set of .NET libraries that helps you to deploy changes to different databases like SQL Server. It tracks which SQL scripts have been run already, and runs the change scripts that are needed to get your database up to date. It was at this moment that we decided to utilize an internal development tool called Hammond, which enabled us to deploy a serverless application across all tenants. Further information can be found here.

Constraints and Data Migration: ANext Data Migration + Airplane to the Rescue

It was important to disable constraints before starting the DMS tasks because we discovered a notable impact on migration speed in databases with a high volume of rows. As the index size expanded, the corresponding increase in data that needed to be cached resulted in a gradual slowdown of the migration process. That’s why we had to disable the constraints and index before starting the DMS.

Based on that we already deployed a DB schema including PK, FK, and other types of constraints, and moving data would include some challenges that we wanted to avoid this time. It was at that moment we opted to use an existing application called ANext Data Migration. which basically is composed of an ECS autoscaling group, which scales up when there is a message in an SQS queue.

It allows us to execute SQL Scripts in the tenant account where the DB lives. Then, it was just a matter of triggering this message. It is when Airplane came into play. It allowed us to execute tasks in each of the AWS tenants’ accounts. Now, with our infrastructure, DMS settings, and DB schema in place, it was time to start the DMS tasks.

Initiating and Monitoring DMS Tasks for 300+ Tenants

We encountered a challenge related to starting and stopping. How would we start more than 300 DMS tasks simultaneously? Well, as mentioned earlier, we had an application in place called Anext Data Migration + Airplane that enables us to trigger tasks for all 300+ tenants. This task was also developed in Python and connected to the AWS API to initiate the start or stop tasks. This proved to be highly beneficial as it allowed us to trigger the tasks for all tenants at the same time. Initiating the start task was not a significant challenge, but tracking the progress of each tenant was. When did they finish? We were able to track the progress of each tenant using the AWS API and then create a dashboard for each tenant to monitor progress or determine if the process had already finished.

We’re excited to start this journey towards improvement. We’re all about doing great work and finding new ways to make things better. So, stay tuned for the next part where we’ll talk more about how we’re making our systems better with Terraform and DMS settings. We’re on a path to a brighter future, and we’re glad you’re joining us for the ride!

--

--