Airflow: your ally for automating machine learning and data pipelines
Enrica Pasqua, Bahadir Uyarer
After having worked hard developing a machine learning model, you know that there is still a relatively small step to do: moving it to production.
In a common scenario what you probably would like to have is a workflow to automate:
- gathering and preprocessing the data
- running inference on them
- storing the predictions
Ideally you would want a tool that can help you:
- dealing with big data
- guaranteeing robustness and resilience
- executing your workflows on a scheduled basis or when some pre-conditions are met
- resolving dependencies between tasks
If until today you used cron to schedule jobs, this could be the right time to adopt a well established tool like Apache Airflow for addressing this complexity.
Apache Airflow is an open source project written in Python for programmatically author, schedule and monitor batch execution of tasks.
You can design your pipelines according to a determined logic: decide which actions to perform, retry them if errors occur, skip tasks if dependencies are not met, access monitor and log status through a friendly and powerful web UI, and a lot more.
A very nice feature of Airflow is that all the above is configured and defined in Python code. Therefore the Airflow pipelines can benefit from the advantages of the software development process (such as peer-reviews, automated testing and version control).
In this workshop we’ll go over basic Airflow concepts and we’ll setup an instance for orchestrating an inference pipeline for a machine learning model.
Details for Audience
- It assumes no previous Airflow knowledge.
- The main purpose is creating a basic train and inference pipeline with Airflow.
- It is not about a particular model / ML method.
- It's not an advanced Airflow workshop.
- It is not suitable for Python beginners.
Workshop Requirements
- Docker installed.
- Any editor (Sublime, PyCharm, Vim, Atom).
- Verify that Docker works properly.
- Ensure that you allocated 4gb of RAM for the Docker Engine. (Can be done via desktop app, check Preferences section. After setting up, restart Docker App)
- Download the Airflow Docker image:
docker pull puckel/docker-airflow
- Download repository under the
$HOME
directory.git clone https://github.com/deliveryhero/pyconde2019-airflow-ml-workshop
Enrica Pasqua
Affiliation: Delivery Hero SE
Enrica works in Berlin as a Senior Data Engineer at Delivery Hero, where she develops and maintains large scale data pipelines using Python. Her interests include Big Data Architecture, Process Automation and Machine Learning.
visit the speaker at: Twitter • Homepage
Bahadir Uyarer
Affiliation: Delivery Hero SE
Bahadir is a Data Scientist at Global Marketing Tech Department of Delivery Hero. Here, he creates various data products in order to increase the efficiency of marketing activities. Prior to jumping into the tech world, he was working as a research economist in Istanbul. He holds M.A in Public Policy (Applied Economics) and MSc. in Economics and pursues his academic career as a PhD student at Bogazici University.