100% Open Source

Transforming Big Data

Apache Spark. Extended. Declarative.

Flowman is a declarative ETL framework and data build tool powered by Apache Spark. It reads, processes and writes data from and to a huge variety of physical storages, like relational databases, files, and object stores. It can easily join data sets from different source systems for creating an integrated data model. This makes Flowman a powerful tool for creating complex data transformation pipelines for the modern data stack.

For defining all data sources, sinks and the transformations between them, Flowman follows a purely declarative approach using plain YAML files. Developers can focus on the business logic, while Flowman takes care of executing the data flow and managing the data models.

Being built on top of Apache Spark, Flowman can process both small amounts of data on a local machine and scale out to large clusters of multiple machines (Hadoop, Kubernetes, AWS EMR and Azure Synapse) for processing terabytes of data.

Transform data with Flowman and use it for BI, ML or Analytics

How you will benefit from Flowman

Flowman Declarative Spark

01. Simple to learn

Lightweight specification of data models, transformations and build targets using declarative syntax instead of complex application code.

Flowman Interactive Shell

02. Modern

Modern development methodology following the "everything is code" approach supporting collaboration via arbitrary VCS. Support for self contained unittest, automatic documentation and data quality checks.

Flowman Execution Phases

03. Batteries included

Full lifecycle management of your data models, including creating target tables, automatic migration, and possibly final removal. Automatic documentation of data flows including lineage and quality checks. Job history server. Business defined execution metrics.

Flowman removes complexity by splitting up chains of transformations into small blocks.

01. Modular Design

Implement complex transformations step by step, create self-contained unit tests at any point in the chain of transformations. Flowman will execute the whole flow without sacrificing performance by applying end-to-end query optimizations.

Flowman automatically manages your data models.

02. Data Models

Design all details of your data models, grow and evolve them over time. Add descriptions and expectations to columns. Flowman will create the physical model including descriptions and automatically migrate existing tables to the desired state.

Flowman Declarative Spark

03. Declarative Spark

Implement complex transformations step by step, create self-contained unit tests at any point in the chain of transformations. Flowman will take care of executing the whole flow without sacrificing performance by applying end-to-end query optimizations.

Flowman Interactive Shell

04. Quality Gates

Implement self-contained unit tests for your business logic by mocking the real data sources. Add data quality checks before and after pipeline execution. Collect meaningful execution metrics including data quality.

Flowman blog and change log

Flowman 1.1.0 released

We are happy to announce the release of Flowman 1.1.0. This release contains many small improvements and bugfixes. Flowman now finally supports Spark 3.4.1. Major

Read More »

Flowman at Smartclip

smartclip is a successful and growing company specialized for online video advertisement. More importantly, smartclip was one of the first companies implementing Flowman for their

Read More »

Flowman 1.0.0 released

Flowman version 1.0.0 has finally arrived. For several years, multiple companies are using Flowman in production as a robust and reliable solution for efficiently building

Read More »

Flowman 1.0 has landed

We are excited and proud to announce the official release of Flowman 1.0. Flowman is a tool for performing complex data transformations in a structured

Read More »

Projects delivered with Flowman

Online Adversting

Online advertising produces huge amounts of data on a daily basis. In order to provide meaningful insights, all this data needs to be integrated and aggregated to meaningful dimensions. Flowman has been implemented successfully to create multiple pre-aggregated data marts. By relying on the declarative specification, business experts can be easily involved for reviewing.

Complex ETL

Flowman has been successfully implemented in a microservice project in the financial service industry. The project uses Kafka for intra-service communication and Flowman is used to process relevant messages in a Data Lake built from Kafka without the need to connect all services individually.

Customer facing reporting

The art of making sense of millions of detailed records from multiple source systems by providing a high level and holistic view is at the core of customer facing reporting in B2B scenarios. Flowman is the right tool for integrating different data sources, applying complex business logic and storing aggregated tables into your reporting backend.