What is Organic Marketing?

Marketing is a complex area, especially for small businesses. Why complicate it more with terms like “Organic Marketing”? What is this organic marketing anyway? And why do you need to concern…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How Does Tokopedia Take Airflow to the Next Level?

Data Engineering team in Tokopedia is the backbone unit for the whole Data team. We provide a system with high availability to perform complex use-cases and support all business units in gathering insights from the Data. Subsequently, the insights are used in the decision-making process or even for creating prediction. We believe the data can support business use cases as well.

In the first phase, we used Airflow to support all cases with “Physical” DAG (Directed Acyclic Graph, FYI) Python files. You can imagine how arduous it is, as we need to create Dag Python files for every single request while also keeping and managing numerous scripts. Doesn’t matter if it is a request for table migration, aggregated tables, or metrics, we need to support it by creating Dag manually in this phase.

“Physical” DAG (Directed Acyclic Graph) Python files

Starting with the problem of having too many requests like migration tables and aggregated tables from business units, we finally made a plan to build automation processes in the Airflow, as we don’t want to maintain too many scripts for every single request. One of the reasons is that we want to reduce code bloating in the repository.

Basically, the main pipeline was for migrating data from DB, daily reporting, aggregated tables, metrics creation, and dumping data to external storage. So, in the first phase, we collected our cases and converted it to a metadata-design.

In the first release, Airflow Automation Job was intended to reduce a manual intervention of Dag Creation for each request, for example:

Task Automation Workflow

Generator Tasks Workflow

Metadata-cache will be synchronized every 3 hours or can be updated by synchronizing manually. Task Creator (Python) will then connect to metadata-cache to get tasks mapping and generate the tasks by loop for each Dag. At the same time, all of the functions in this feature combine BigQuery to BigQuery, BigQuery to Google Cloud Storage, and BigQuery to Big Table by a Task Creator.

Task Metadata Architecture

We have resolved the manual intervention problem to create the pipeline. Now, we have implemented them in the production environment and support the pipelines that can depend on each other with supported pipelines:

With this Automation Airflow Pipeline, Business Analysts in Tokopedia can now improve the existing pipeline and integrate the new pipelines for 5 minutes only. Now, we are scheduling more than 1000 DAGs and more than 9000 tasks every day.

Add a comment

Related posts:

Why Opt For Random Video Chat?

Thanks to the social networking, whose popularity and usage has grown leaps and bounds, the boundaries have been erased with the world having come closer and closer. There are various online…

Identify with God

When Jesus is acting and he is doing, then it is not arising from a motivation to serve his own personalhood. He has already arrived. He is home. He does not want to go anywhere or reach or become…

Three Ways to Stop Wasting Time Building the Wrong Things

The amount of wasted effort in the US economy is enormous. This game console / fried chicken warmer. Cosmetology school. All those low-cost ventilators that never got emergency use authorization. In…