As part of the PantheraHive workflow "Data Migration Planner," this deliverable provides a comprehensive, professional output for step 2: gemini → generate_code. This output includes detailed configuration examples and a Python script designed to facilitate robust data migration planning, execution, and validation.
The goal of this project step is to generate code and configuration templates that encapsulate the critical aspects of a data migration. This includes defining field mappings, specifying transformation rules, outlining validation procedures, establishing rollback strategies, and providing a framework for timeline estimation. The generated code is designed to be highly configurable, maintainable, and extensible, serving as a foundation for your data migration efforts.
This section provides the core components for your data migration planning: a structured YAML configuration file and a Python script that interprets and executes the migration logic.
migration_config.yamlThis YAML file serves as the central definition for your data migration. It allows you to declaratively define source and target systems, specify which entities (tables/collections) to migrate, detail field mappings, and outline transformation and validation rules without modifying core code.
The migration_config.yaml is structured to cover all necessary aspects of a migration for multiple entities.
source_system: Metadata about the source database/system.target_system: Metadata about the target database/system.entities_to_migrate: A list of dictionaries, where each dictionary represents an entity (e.g., a table, a collection) to be migrated. * name: A unique identifier for the migration entity (e.g., 'Users', 'Orders').
* source_entity_name: The actual name of the entity in the source system (e.g., 'TBL_USERS').
* target_entity_name: The actual name of the entity in the target system (e.g., 'users').
* field_mapping: A dictionary mapping source field names to target field names.
* source_field: The name of the field in the source entity.
* target_field: The name of the field in the target entity.
* transformation_rules: A dictionary where keys are target_field names and values are transformation rule definitions.
* target_field: The field in the target entity that requires transformation.
* rule: A dictionary specifying the transformation function and its arguments.
* function: The name of a predefined transformation function (see data_migration_planner.py).
* args: A list of arguments to pass to the transformation function. These can be literal values or references to other source fields (e.g., {'type': 'source_field', 'value': 'first_name'}).
* validation_rules: A list of validation rules to apply to the target data.
* field: The target field to validate.
* rule: A dictionary specifying the validation type and parameters.
* type: The type of validation (e.g., 'not_null', 'unique', 'regex', 'custom_function').
* params: Parameters specific to the validation type (e.g., pattern for regex).
migration_config.yaml--- ### 2. Python Script: `data_migration_planner.py` This Python script is the operational core of your data migration plan. It's designed to read the `migration_config.yaml`, provide extensible transformation and validation logic, and outline a framework for execution and rollback. #### Purpose * **Configuration Interpretation:** Load and parse the `migration_config.yaml`. * **Extensible Logic:** Provide a framework for custom transformation and validation functions. * **Migration Orchestration:** Outline the steps for data extraction, transformation, loading, and validation. * **Rollback Framework:** Define a structured approach to rollback procedures. * **Timeline Estimation:** Offer a module for estimating migration effort. #### Core Components Explained ##### Configuration Loading (`load_migration_config`) Loads the YAML configuration file, ensuring it's well-formed and accessible. ##### Data Transformation Engine (`TransformationEngine`) A class that manages and applies transformation rules defined in the configuration. It includes common, reusable transformation functions and allows for easy addition of custom logic. * **Predefined Transformations:** Examples include `to_upper`, `to_lower`, `format_datetime`, `concatenate_fields`, `map_boolean_to_string`, `generate_uuid_if_null`, `passthrough`, `convert_to_decimal`, `capitalize_first_letter`. * **`apply_transformation` Method:** Takes a source row, a target field, and its rule, then executes the specified function, resolving arguments from source fields or literal values. ##### Data Validation Engine (`ValidationEngine`) A class responsible for applying validation rules to transformed data. This helps ensure data quality before and after loading into the target system. * **Predefined Validations:** Examples include `not_null`, `not_empty_string`, `is_numeric`, `is_datetime`, `regex`, `in_list`, `unique_check` (which would require external data access). * **`validate_row` Method:** Iterates through all validation rules for an entity and reports any failures. ##### Rollback Procedure Framework (`RollbackProcedure`) A conceptual class outlining the essential steps for a robust rollback strategy. This is crucial for minimizing risks during migration. * **`pre_migration_backup`:** Steps to back up critical data in the target system. * **`post_migration_restore`:** Steps to restore the target system to its pre-migration state. ##### Timeline Estimation Module (`estimate_timeline`) A function that provides a high-level estimate of migration effort based on configurable parameters like data volume, complexity, and number of entities. This helps in project planning. #### Example `data_migration_planner.py`
As a professional AI assistant within PantheraHive, I am generating the following detailed study plan for Data Migration Excellence, as per your request. This plan is designed to equip a learner with comprehensive knowledge and practical skills required for successful data migration projects.
This study plan outlines a structured approach to mastering the complexities of data migration. It is tailored for professionals seeking to develop expertise in planning, executing, and validating data migration initiatives. The plan integrates theoretical knowledge with practical application, covering all critical phases from initial assessment to post-migration activities.
Upon successful completion of this study plan, the learner will be able to:
This schedule assumes approximately 10-15 hours of study per week, including reading, exercises, and project work.
Week 1: Introduction to Data Migration & Fundamentals
Week 2: Discovery & Assessment – Source & Target Analysis
Week 3: Planning & Strategy – Methodologies & Scope
Week 4: Field Mapping & Data Modeling
Week 5: Data Transformation & Cleansing Rules
Week 6: Data Extraction Techniques
Week 7: Data Loading Techniques
Week 8: Data Validation & Quality Assurance
Week 9: Testing, Cutover & Rollback Procedures
Week 10: Post-Migration Activities & Monitoring
Week 11: Security, Compliance & Best Practices
Week 12: Project Simulation & Review
* "Building a Data Warehouse for Dummies" by Wiley Publishing (for foundational data concepts).
* "The Data Warehouse Toolkit" by Ralph Kimball and Margy Ross (for data modeling and ETL principles).
* Specific books on ETL tools (e.g., "Microsoft SQL Server 2019 Integration Services: A Practical Guide").
* Coursera/edX/Udemy: "Data Engineering with Google Cloud," "AWS Certified Database – Specialty," "Azure Data Engineer Associate." Look for courses on ETL, data warehousing, and cloud migration.
* Vendor-Specific Training: Official training programs from Microsoft (Azure Data Factory, SSIS), AWS (DMS, Glue), Google Cloud (Dataflow), Informatica, Talend.
* Cloud Providers: AWS Database Migration Service (DMS) documentation, Azure Data Factory documentation, Google Cloud Dataflow documentation.
* ETL Tools: Official documentation for Talend, Informatica PowerCenter, Microsoft SSIS, Apache NiFi.
* Industry Reports: Gartner, Forrester reports on data migration trends, tools, and best practices.
* Tech blogs from major cloud providers (AWS, Azure, Google Cloud).
* Blogs from data migration solution providers and consulting firms.
* Medium, Towards Data Science for practical guides and case studies.
* Databases: PostgreSQL, MySQL, SQL Server (free developer editions).
* ETL Tools: Talend Open Studio (free), SQL Server Integration Services (SSIS - part of SQL Server Developer Edition), Python with pandas, petl libraries.
* Cloud Services (Free Tiers): AWS Free Tier (DMS, Glue), Azure Free Account (Data Factory), Google Cloud Free Tier (Dataflow).
This detailed study plan provides a robust framework for developing expertise in data migration. Consistent effort, hands-on practice, and engagement with real-world scenarios will be key to achieving the defined learning objectives and becoming a proficient data migration planner.
python
import yaml
import datetime
import re
import uuid
import logging
from decimal import Decimal, InvalidOperation
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def load_migration_config(config_path: str) -> dict:
"""
Loads the data migration configuration from a YAML file.
Args:
config_path (str): The path to the YAML configuration file.
Returns:
dict: The loaded configuration dictionary.
Raises:
FileNotFoundError: If the configuration file does not exist.
yaml.YAMLError: If there is an error parsing the YAML file.
"""
try:
with open(config_path, 'r') as file:
config = yaml.safe_load(file)
logging.info(f"Configuration loaded successfully from {config_path}")
return config
except FileNotFoundError:
logging.error(f"Configuration file not found at {config_path}")
raise
except yaml.YAMLError as e:
logging.error(f"Error parsing YAML configuration file: {e}")
raise
class TransformationEngine:
"""
Manages and applies data transformation rules defined in the migration configuration.
"""
def __init__(self):
# Register common transformation functions
self.transformations = {
"passthrough": self._passthrough,
"to_upper": self._to_upper,
"to_lower": self._to_lower,
"capitalize_first_letter": self._capitalize_first_letter,
"format_datetime": self._format_datetime,
"concatenate_fields": self._concatenate_fields,
"map_boolean_to_string": self._map_boolean_to_string,
"generate_uuid_if_null": self._generate_uuid_if_null,
"convert_to_decimal": self._convert_to_decimal,
# Add more custom
Document Version: 1.0
Date: October 26, 2023
Prepared For: [Client Name]
Prepared By: PantheraHive Solutions
This document outlines the comprehensive plan for the data migration from [Source System Name] to [Destination System Name]. The objective is to ensure a secure, efficient, and accurate transfer of all identified data, minimizing business disruption and preserving data integrity. This plan details the scope, methodology, data mapping, transformation rules, validation procedures, rollback strategy, and an estimated timeline to guide the successful execution of this critical initiative.
The purpose of this document is to provide a detailed roadmap for migrating existing data from [Source System Name] (the "Source System") to [Destination System Name] (the "Destination System"). This plan will serve as the foundational guide for all project stakeholders, ensuring clarity, accountability, and a structured approach to the migration.
The migration will encompass the following data entities and modules:
The following data or functionalities are explicitly excluded from this migration plan:
[e.g., 5 years].[Source System Name][e.g., ERP, CRM, Custom Application][e.g., Microsoft SQL Server 2017, Oracle 12c, PostgreSQL][e.g., Sales, Inventory, Finance, Customer Management][e.g., 500 GB, 10 million records] across [e.g., 150 tables].[e.g., Direct database access, API endpoints, Flat file exports][Destination System Name][e.g., New ERP, Cloud-based CRM, Custom Application][e.g., Azure SQL Database, AWS Aurora PostgreSQL, MongoDB][e.g., Customer Management, Order Processing, Product Catalog][e.g., API for bulk inserts, Direct database connection via secure tunnel]Given the complexity and volume of data, a Phased Migration approach is recommended. This strategy allows for:
Phases:
[e.g., 12-24 hours] will be required during the final cutover phase for the delta load and system switch.[Source System Database Name][e.g., CSV exports for specific legacy modules][e.g., for real-time customer updates]Data will be extracted using a combination of:
[e.g., Talend, Azure Data Factory, AWS Glue] connectors for specific data sources that require more complex extraction logic or API interaction.This section outlines the detailed mapping of source fields to destination fields. A comprehensive mapping document will be maintained in [e.g., an Excel spreadsheet, a dedicated mapping tool] and will include all identified entities. Below is an illustrative example:
Entity: Customer
| Source Table.Field Name | Source Data Type | Destination Table.Field Name | Destination Data Type | Transformation Rule(s) | Notes / Comments |
| :---------------------- | :--------------- | :--------------------------- | :-------------------- | :--------------------- | :----------------------------------------------------------------------------------------------------------------------------- |
| Customers.CustomerID | INT | CRM.Customer.ExternalID | VARCHAR(50) | CAST(CustomerID AS VARCHAR(50)) | Source PK, mapped to an external ID field in CRM for traceability. |
| Customers.Name | NVARCHAR(255) | CRM.Customer.FullName | NVARCHAR(255) | TRIM() | Remove leading/trailing spaces. |
| Customers.Address1 | NVARCHAR(255) | CRM.Customer.StreetAddress | NVARCHAR(255) | TRIM() | |
| Customers.City | NVARCHAR(100) | CRM.Customer.City | NVARCHAR(100) | TRIM() | |
| Customers.StateCode | CHAR(2) | CRM.Customer.State | VARCHAR(2) | UPPER() | Standardize to uppercase two-letter codes. |
| Customers.ZipCode | VARCHAR(10) | CRM.Customer.PostalCode | VARCHAR(10) | TRIM() | |
| Customers.Email | NVARCHAR(255) | CRM.Customer.Email | NVARCHAR(255) | LOWER(), TRIM() | Convert to lowercase, remove spaces. |
| Customers.Phone | VARCHAR(20) | CRM.Customer.PhoneNumber | VARCHAR(20) | CleanPhoneNumber() | Custom function: Remove non-numeric characters, format as (XXX) XXX-XXXX. |
| Customers.CreationDate| DATETIME | CRM.Customer.CreatedAt | DATETIME | CONVERT_TZ('UTC') | Convert to UTC timezone. |
| Orders.OrderID | INT | CRM.Order.ExternalOrderID | VARCHAR(50) | CAST(OrderID AS VARCHAR(50)) | Source PK for orders. |
| Orders.OrderTotal | DECIMAL(10,2) | CRM.Order.TotalAmount | DECIMAL(10,2) | ROUND(Value, 2) | Ensure 2 decimal places. |
| Orders.OrderStatus | VARCHAR(50) | CRM.Order.Status | VARCHAR(50) | MapOrderStatus() | Custom function: Map source statuses