Simple ETL pipeline using Python, MySQL, and crontab
Here’s a simple ETL (extract, transform, load) process that can be setup quickly using Python, MySQL, and crontab. For more complicated processes, it’s better to substitute crontab with a more sophisticated tool, like Airflow. This setup is a ‘naive’ setup and I’ll probably iterate on it, but for a simple project it seems to be working.
Project Planning
When planning the development of this project my objectives were:
- To only use Python, MySQL, and crontab for the ETL pipeline. Partly due to constraints on where I will host the production environment and also my programming experience.
- Not have to make any changes in the development environment code and the production code. If it works in development, a push to GitHub from development, and a pull from GitHub in production should be the only required steps.
- Not have to deal with installing MySQL in my development machine. Instead, create a docker container for the development MySQL server and use volumes to persist the data.
Project Setup
The general idea is:
- Create a GitHub repository for your ETL process. Make a clone of it in your development environment and in your production environment. In my case, the development environment is my laptop running Ubuntu, and the production environment is my personal website hosted on Electric Embers running FreeBSD.
- Setup the local development environment as shown in the figure. It consists of the local repository and a dockerized MySQL server. Code your Python scripts and test the code.
- When the ETL pipeline has been debugged in the development environment and is ready for production, push the main branch to the GitHub repo and pull it from the production environment. The code should be ready to be run in the production environment with no further modifications required.
- Since one repository is used for both development and production environments, the code logic will be the same. Any logic that works in th development environment should be exactly the same as the production environment. However, any configuration settings that are specific to development or production are stored in either environment variables or configuration files listed in .gitignore. In addition, there may be a difference in terms of how the pipeline is implemented. For example, in the development environment you may manually run Python scripts but in the production environment, a server such as httpd or cron may run the Python scripts. See more about this in this post .
- Finally, code your crontab file in the development environment to call your Python scripts.