100% Open Source

What is Flowman

Flowman is a declarative data build tool based on Apache Spark that simplifies the act of writing data transformation applications. Flowman offers a modern declarative approach for ETL for creating robust data transformation pipelines.
The main idea is that developers describe the desired data transformation within purely declarative YAML files instead of writing complex Spark jobs in Scala or Python. This helps to focus on the “what” instead of the “how” and separates the business logic (in the YAML files) from implementation details (in Flowman). Hiding the complexity of execution (the “How”), and putting the spotlight onto the transformations (the “What”) help developers and business experts to collaborate on a new level of detail.

Data Lifecycle

In addition to writing and executing data transformations, Flowman also manages the whole lifecycle of physical data models, i.e. Hive tables, SQL tables etc. Flowman will automatically infer the required schema of output tables and create missing tables or migrate existing tables to the correct schema (i.e. by adding new columns, by changing the data types, etc).

This helps to keep all aspects (like transformations and schema information) in a single place managed by a single application.

Built with Apache Spark

Being built on top of Apache Spark, Flowman can be run as a standalone application or can scale to almost any size by using compute clusters (Hadoop, Kubernetes, EMR & Azure Synapse) to process huge amounts of data.

Flowman Declarative Data Flows with Spark

How Flowman works

Develop

First you specify the data transformation pipeline with all inputs and outputs and all transformations. Then you create a job containing a list of build targets telling Flowman which target tables should be populated with which results from the data transformations.

Optional self contained unittests ensure that your implemented logic meets the business requirements. These tests can be executed in isolation with all dependencies on external data sources mocked.

Execute

With the declarative flow description in place with relations, mappings, targets and jobs your project is ready to be executed.

Flowman offers a powerful command line tool which reads the flow files and performs all required actions like creating target tables, transforming and writing data. Behind the scenes Flowman utilizes the power of Apache Spark for efficient and scalable data processing.

Collaborate

Collaborate with your colleagues by putting all project files under source control like git. Flowman follows the everything-is-code philosophy and all project files are simple text files (YAML). This means you can easily trace changes to the logic and go back to a previous version when required.

The same is true for deployment – you only need to checkout a specific version of the git repository and then let Flowman execute the project.

Test & Document

Test and document your project and your data by annotating mappings or relations with descriptions and with quality checks. Both can be done either on the entity level (mapping or relation) or even on the column level.

Flowman then easily generates a full blown documentation of your project, that will not only include your description, but also the results of all specified test cases. This minimizes friction between your assumption on the data and the reality.

Notable Flowman features

100% Open Source (Apache License)

Why Flowman?

Flowman reduces development effort
The tool was born from the practical experience that most companies have very similar needs for the ETL pipeline built with Apache Spark. Instead of writing similar application code from scratch for every project and customer, Flowman is a powerful building block for skipping this first step and thereby accelerates your data team by focusing on the business logic instead of boiler plate code.
Powered by Apache Spark
By utilizing Apache Spark as the de-facto standard for Big Data ETL, Flowman is ready to reliably wrangle your Big Data while providing a higher level of abstraction.
Flowman is simple to learn
With Flowman you can describe your dataflow in a simple YAML file, that also business experts can understand. Flowman will then take care of all technical details.
Flowman is observable
The Flowman history server provides all relevant details on past runs. You can see when jobs have been successful and which jobs failed. Additional job metrics can be pushed to external collectors like Prometheus.
Flowman is extensible
In case something is missing, you can write your own Flowman plugin. New data sources and sinks, new mappings and more can be implemented in Scala. And in case you struggle with that, we can help you with that.
Flowman is scalable
Flowman can be used to process both Small Data (megabytes) and for Big Data (terabytes). You decide if jobs should be run on a single machine or in a Hadoop or Kubernetes cluster.
Flowman is proven
Not only does the source code of Flowman contain extensive unittests, which ensure correctness and which avoids breaking existing features, Flowman is also already successfully implemented in production at multiple companies.

Who is developing Flowman?

Flowman is being actively developed by dimajix as an open source building block for providing services for implementing data pipelines in modern data centric organizations.

Flowman is an open source project available under the very liberal Apache 2.0 license.