Product

Overview

What is Flowman?

Flowman is a declarative data build tool based on Apache Spark.

Everything as Code

Simple YAML files support proven workflows with source code management, code reviews and CI/CD pipelines.

Declarative Spark

With its declarative approach, Flowman removes the complexity of writing robust Spark applications and let your developers focus on the business logic instead.

Development Workflow

By using simple YAML files, Flowman easily supports collaboration between developers. An optional integration with Apache Maven simplifies CI/CD processes.

Users

Data Engineers

Learn how Flowman reduces the cognitive load of data engineers.

Operations Teams

Learn how Flowman supports your operations.
Community
Get Started

Overview

Install and try out Flowman, or simply request a demo session.

Download Flowman

The latest Flowman release for local installation

Run in Docker

The simplest way to get started with Flowman

Install Locally

How to set up Apache Spark and Flowman on your local machine, step by step.
Learn

Reference Documentation

Flowman provides a rich and extensive documentation with concepts, tutorials and reference.

Blog

Read background stories about Flowman and find the release informations.

FAQ

Find answers to commonly asked questions

Kaya Kupferschmidt

January 4, 2023
7:41 am

Flowman 0.30.0 released

A little bit later than anticipated, a new version of Flowman has been released. It didn’t make it as a Christmas present, but it became a welcome New Years present instead.

This release has many technical changes under the hood as a preparation for being able to build fat jars. These Java jar files will contain both the Flowman runtime and your project in a single file, which then can be easily deployed to an AWS EMR cluster or to a Databricks environment. But some bits are still missing, so stay tuned for the next releases.

Other than that, this release contains some changes to the job execution logic, which now allows you to control which phases are to be executed for which targets, and when a target is to be considered dirty. Read more about the executions in the job documentation and about the build policy in the relation target documentation.

A new “observe” mapping allows you to capture data dependent metrics as records flow through the system. This is interesting for counting specific records types as a relevant execution metric. Read more in the documentation for the “observe” mapping.

Moreover a new build profile and flavor has been added to support Spark 3.2 on Clouder CDP 7.1. Of course you need to install the optional Spark 3.2 parcel on your Cloudera stack to be able to use Spark 3.2.

Detailed Changes

github-278: Parallelize execution of data quality checks. This also introduces a new configuration property flowman.execution.check.parallelism (default 1)
github-282: Improve implementation for counting records
github-288: Support reading local CSV files from fatjar
github-290: Simplify specifying project name in fatjar
github-291: Simplify create/destroy Relation interface
github-292: Upgrade AWS EMR to 6.9
github-289: Color log output via log4j configuration (requires log4j 2.x)
Bump postgresql from 42.4.1 to 42.4.3 in /flowman-plugins/postgresql
Bump loader-utils from 1.4.0 to 1.4.2
Bump json5 from 2.2.1 to 2.2.3
github-293: [BUG] Fatal exceptions in parallel mapping instantiation cause deadlock
github-273: Support projects contained in (fat) jar files
github-294: [BUG] Parallel execution should not execute more targets after errors
github-295: Create build profile for CDP 7.1 with Spark 3.2
github-296: Update npm dependencies (vuetify & co)
github-297: Parametrize when to execute a specific phase
github-299: Move migrationPolicy and migrationStrategy from target into relation
github-115: Implement additional build policy in relation target for forcing dirty. This also introduces a new configuration property flowman.default.target.buildPolicy (default COMPAT).
github-298: Support fine-grained control when to execute each target of a job
github-300: Implement new ‘observe’ mapping
github-301: Upgrade Spark to 3.2.3
github-302: Upgrade DeltaLake to 2.2.0
github-303: Use multi-stage build for Docker image
github-304: Upgrade Cloudera profile to CDP 7.1.8
github-312: Fix build with Spark 2.4 and Maven 3.8

This version is fully backwards compatible until and including version 0.27.0.

Download

As usual, you can download the latest version from the Download section or directly from GitHub.

Kaya Kupferschmidt

Flowman 0.30.0 released

Detailed Changes

Download

Flowman 1.1.0 released

Flowman — A Declarative ETL Framework for Apache Spark

Flowman at Smartclip

About

Resources

Get in touch