Data Integrity

Using Apache Airflow and the Snowflake Data Warehouse to ingest Flume S3 data

Do you use Apache Flume to stage event-based log files in Amazon S3 before ingesting them in your database? Have you noticed .tmp files scattered throughout S3? Have you wondered what they are and how to deal with them? This article describes a simple solution to this common problem, using the Apache Airflow workflow manager and the Snowflake Data Warehouse.

Data Integrity Goal

Your goal is to ingest each event exactly once into your analytic database during ETL (extract-transfer-load). You do not want to leave any events behind, nor do you want to ingest any event more than once. Otherwise, your event counts will be wrong. If we assume that any particular event is in exactly one log file, the goal becomes ingesting each log file exactly once. At Sharethrough, we have seen that this data integrity goal cannot be met without dealing with those darn .tmp files.

Senior Staff Engineer