fbpx

Efficient Log Management for Airflow on Kubernetes: A Practical Guide by Agilno

At Agilno, we strive to provide seamless and efficient solutions for our clients. One of our recent projects involved developing a marketing reporting tool for an out-of-home (OOH) marketing company. This tool aggregates data from various sources to generate comprehensive reports that can be easily shared with customers.

Project Overview: Marketing Reporting Tool

Our client, an out-of-home marketing company, needed a robust reporting tool that could consolidate data from multiple sources, including digital and static billboards, social media platforms, ad tracking services, and spreadsheet inputs. The tool is designed to create detailed reports that offer valuable insights into the performance of their marketing campaigns. To achieve this, we leveraged Apache Airflow to orchestrate the data collection, processing, and report generation tasks.

Understanding the Tools

  • Apache Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. It’s widely adopted for its robust scheduling capabilities and extensibility.
  • Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. It’s essential for orchestrating complex applications in a scalable and resilient manner.
  • Amazon S3 (Simple Storage Service) is a scalable object storage service provided by AWS, known for its durability, availability, and performance.

The Challenge: Log Retention in Airflow

When running Apache Airflow within a Kubernetes cluster, each DAG run generates logs within its worker node. However, these logs are ephemeral – they are lost once the run completes and the worker pods are terminated. This poses a significant challenge for monitoring, debugging, and auditing purposes.

Real-Life Example: Why We Chose S3 for Log Storage

During the development of our marketing reporting tool, we encountered a significant issue. One of our team members was trying to debug a DAG that was failing in the staging environment but working perfectly on his local machine. However, he could not see the logs from the DAG after the run completed, making it difficult to diagnose the problem.

This lack of access to logs was a major bottleneck in our development process. We realized that to effectively debug and monitor our Airflow tasks, we needed a reliable and persistent log storage solution. This led us to implement log storage in Amazon S3.

By using S3 for saving logs, we were able to:

  • Easily Debug Issues: With logs readily accessible in S3, we could quickly identify and resolve issues that occurred in different environments.
  • Track DAG Runs: We could monitor what was happening after each DAG run, providing better insights and control over our workflows.
  • Improve Collaboration: Team members could access the logs from any environment, facilitating better collaboration and faster problem-solving.

Our Solution: Using Amazon S3 Buckets for Log Storage

To address this issue, we decided to store the logs in an Amazon S3 bucket. Here’s a step-by-step guide on how we implemented this solution.

Step 1: Configure S3 Connection in Airflow

First, you need to set up a connection in Airflow to your S3 bucket. This involves adding your AWS credentials and specifying the S3 bucket details.

In the Airflow web interface, navigate to **Admin > Connections**.
Add a new connection with the following parameters:
- **Conn Id**: `aws_default`
- **Conn Type**: `Amazon Web Services`
   - **Extra**: `{"aws_access_key_id": "YOUR_ACCESS_KEY", "aws_secret_access_key": "YOUR_SECRET_KEY"}`

Step 2: Update Airflow Configuration via Helm Release

Instead of directly editing the `airflow.cfg` file, we updated the Airflow Helm Release. This method allows us to maintain consistency across multiple environments such as development, staging, and production. The following lines were added to the Helm values file:

```yaml
env:
  - name: "AIRFLOW__LOGGING__REMOTE_LOGGING"
    value: "True"
  - name: "AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID"
    value: "aws_s3_log_storage"
  - name: "AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER"
    value: "s3://bucket-name/airflow-logs"
```

Step 3: Verify Logs in S3

After setting up the above configurations, run your DAG and verify that the logs are being stored in your S3 bucket. Navigate to your S3 console and check the `airflow-logs/` directory to ensure the logs are properly saved and accessible.

Step 4: Pull Logs into Central Log Monitoring System

To ensure that logs are easily accessible and centrally monitored, we configured a central log monitoring system to pull logs from the S3 bucket. This system aggregates logs from various sources, providing a unified view of all logs for easier debugging, monitoring, and analysis.

Benefits of Using S3 for Log Storage

By storing Airflow logs in an S3 bucket, we achieved:

  •  Persistence: Logs are retained beyond the lifespan of worker pods.
  •  Accessibility: Logs can be accessed from anywhere, aiding in debugging and monitoring.
  •  Scalability: S3’s scalability ensures that log storage can grow with our needs without any manual intervention.

At Agilno, we are committed to providing robust solutions that enhance operational efficiency. Implementing S3 for Airflow log storage is just one of the ways we ensure reliable and accessible workflows for our clients.

For more technical insights and solutions, stay tuned to our blog or contact us at hello@agilno.com.