100% Open Source

Flow through data pipelines with Flowman

Flowman is a powerful and open source data build tool powered by Apache Spark that follows a declarative approach to simplify the act of writing complex ETL, ELT and data transformation applications. The strong focus on transformation and schema management reduces your development efforts for creating robust data pipelines.

Being built on top of Apache Spark, Flowman can be run as a standalone application but can also scale by using compute clusters (Hadoop & Kubernetes) to process any amounts of data.

Focus on business logic instead of Spark boilerplate code!

Flowman Declarative Data Flows with Spark

Transform data with Flowman and use it for BI, ML or Analyitcs

Flowman explained

How you will benefit from Flowman

Flowman Declarative Code

01. Simple to learn

Lightweight specification of data models, transformations and build targets using declarative syntax instead of complex application code.

Flowman Interactive Shell

02. Modern

Modern development methodology following the "everything is code" approach supporting collaboration via arbitrary VCS. Support for self contained unittest, automatic documentation and data quality checks.

Flowman Execution Phases

03. Batteries included

Full lifecycle management of your data models, including creating target tables, automatic migration, and possibly final removal. Automatic documentation of data flows including lineage and quality checks. Job history server. Business defined execution metrics.

Flowman Declarative Code

01. Declarative Approach

Lightweight specification of data models, transformations and build targets using declarative syntax instead of complex application code.

Flowman Interactive Shell

02. Modern

Modern development methodology following the "everything is code" approach supporting collaboration via arbitrary VCS. Support for self contained unittest, automatic documentation and data quality checks.

Flowman Execution Phases

03. Batteries included

Full lifecycle management of your data models, including creating target tables, automatic migration, and possibly final removal. Automatic documentation of data flows including lineage and quality checks. Job history server. Business defined execution metrics.

Flowman blog and change log

The new version 0.30.0 of Flowman has been released with better control over execution plans and with a new “observe” mapping to capture data dependent metrics as records flow through the system. Spark 3.2 is now officially supported on Cloudera CDP 7.1.

A new version of Flowman has been released containing some new features like a new “iterativeSql” mapping and a new Maven module “flowman-spark-dependencies”, which simplifies creating and building Flowman projects with Maven.

A new version of Flowman has been released containing minor but important improvements for working with MariaDB, MySQL and SQL Server / Azure SQL databases as sinks. Moreover the workflow for creating a new Flowman project has been simplified by adding a new Maven archetype.

The new version 0.27.0 of Flowman has been released. Among many improvements and new features, this release takes the support for working with JDBC data sources and sinks even further. You can now execute arbitrary SQL commands as part of the build process to provide a way to handle database specific features.

This latest release of Flowman contains a lot of work with a strong focus on improving working with JDBC targets like MariaDB/MySQL, Postgres, MS SQL Server, Azure SQL and Oracle. For example, column collations and comments are now correctly propagated into relational databases, changing the primary key is now supported and much more.

Also Spark 3.3 is now officially supported, albeit not much tested so far. Moreover many small bug fixes and enhancements help to make Flowman more robust and versatile.

This latest release contains a couple of smaller changes and fixes plus a new relation for creating and managing views in SQL databases, which are accessed via JDBC. This  addition strengthens Flowmans position in environments where data resides in classical relational databases instead of HDFS or object storages.

The latest Flowman release now provides a YAML/JSON to enable syntax highlighting and auto complete in code editors like IntelliJ and Visual Studio Code.

Projects delivered with Flowman

Online Adversting

Online advertising produces huge amounts of data on a daily basis. In order to provide meaningful insights, all this data needs to be integrated and aggregated to meaningful dimensions. Flowman has been implemented successfully to create multiple pre-aggregated data marts. By relying on the declarative specification, business experts can be easily involved for reviewing.

Financial Services

Flowman has been successfully implemented in a microservice project in the financial service industry. The project uses Kafka for intra-service communication and Flowman is used to collect and process relevant messages directly from Kafka without the need to connect all services individually.

Customer facing reporting

The art of making sense of millions of detailed records from multiple source systems by providing a high level and holistic view is at the core of customer facing reporting in B2B scenarios. Flowman is the right tool for integrating different data sources, applying complex business logic and storing aggregated tables into your reporting backend.