What is Flowman

Flowman is a declarative data build tool based on Apache Spark that simplifies the act of writing data transformation applications. Flowman offers a modern declarative approach for ETL for creating robust data transformation pipelines.

Everything is Code

The main idea is that developers describe the desired data transformation within purely declarative YAML files instead of writing complex Spark jobs in Scala or Python. This helps to focus on the “what” instead of the “how” and separates the business logic (in the YAML files) from implementation details (in Flowman). Hiding the complexity of execution (the “How”), and putting the spotlight onto the transformations (the “What”) help developers and business experts to collaborate on a new level of detail.

Data Lifecycle

In addition to writing and executing data transformations, Flowman also manages the whole lifecycle of physical data models, i.e. Hive tables, SQL tables etc. Flowman will automatically infer the required schema of output tables and create missing tables or migrate existing tables to the correct schema (i.e. by adding new columns, by changing the data types, etc).

This helps to keep all aspects (like transformations and schema information) in a single place managed by a single application.

Built with Apache Spark

Being built on top of Apache Spark, Flowman can be run as a standalone application or can scale to almost any size by using compute clusters (Hadoop, Kubernetes, EMR & Azure Synapse) to process huge amounts of data.

How Flowman works

Develop

First you specify the data transformation pipeline with all inputs and outputs and all transformations. Then you create a job containing a list of build targets telling Flowman which target tables should be populated with which results from the data transformations.

Optional self contained unittests ensure that your implemented logic meets the business requirements. These tests can be executed in isolation with all dependencies on external data sources mocked.

Execute

With the declarative flow description in place with relations, mappings, targets and jobs your project is ready to be executed.

Flowman offers a powerful command line tool which reads the flow files and performs all required actions like creating target tables, transforming and writing data. Behind the scenes Flowman utilizes the power of Apache Spark for efficient and scalable data processing.

Collaborate

Collaborate with your colleagues by putting all project files under source control like git. Flowman follows the everything-is-code philosophy and all project files are simple text files (YAML). This means you can easily trace changes to the logic and go back to a previous version when required.

The same is true for deployment – you only need to checkout a specific version of the git repository and then let Flowman execute the project.

Test & Document

Test and document your project and your data by annotating mappings or relations with descriptions and with quality checks. Both can be done either on the entity level (mapping or relation) or even on the column level.

Flowman then easily generates a full blown documentation of your project, that will not only include your description, but also the results of all specified test cases. This minimizes friction between your assumption on the data and the reality.

Notable Flowman features

Simple to learn declarative syntax in YAML files
Full lifecycle management of data models (create, migrate and destroy Hive tables, SQL tables or file based storage)
Flexible expression language for parametrization of all entities
Jobs for managing to bundle related build targets together
Automatic dependency analysis to build all targets in the correct order
Powerful command line tools both for batch execution and interactive data flow analysis
Integrated data quality checks and testing framework
History server for providing an overview of past jobs and targets including lineage
Flexible collecting and publishing of business relevant data quality metrics
Automatic documentation of data models from logical data flow include column level lineage
Highly extensible via plugins to implement custom connectors, business logic and much more

100% Open Source (Apache License)

Why Flowman?

Flowman reduces development effort

The tool was born from the practical experience that most companies have very similar needs for the ETL pipeline built with Apache Spark. Instead of writing similar application code from scratch for every project and customer, Flowman is a powerful building block for skipping this first step and thereby accelerates your data team by focusing on the business logic instead of boiler plate code.

By utilizing Apache Spark as the de-facto standard for Big Data ETL, Flowman is ready to reliably wrangle your Big Data while providing a higher level of abstraction.

Flowman is simple to learn

With Flowman you can describe your dataflow in a simple YAML file, that also business experts can understand. Flowman will then take care of all technical details.

Flowman is observable

The Flowman history server provides all relevant details on past runs. You can see when jobs have been successful and which jobs failed. Additional job metrics can be pushed to external collectors like Prometheus.

Flowman is extensible

In case something is missing, you can write your own Flowman plugin. New data sources and sinks, new mappings and more can be implemented in Scala. And in case you struggle with that, we can help you with that.

Flowman is scalable

Flowman can be used to process both Small Data (megabytes) and for Big Data (terabytes). You decide if jobs should be run on a single machine or in a Hadoop or Kubernetes cluster.

Flowman is proven

Not only does the source code of Flowman contain extensive unittests, which ensure correctness and which avoids breaking existing features, Flowman is also already successfully implemented in production at multiple companies.

Who is developing Flowman?

Flowman is being actively developed by dimajix as an open source building block for providing services for implementing data pipelines in modern data centric organizations.

Flowman is an open source project available under the very liberal Apache 2.0 license.