A little bit later than anticipated, a new version of Flowman has been released. It didn’t make it as a Christmas present, but it became a welcome New Years present instead.
This release has many technical changes under the hood as a preparation for being able to build fat jars. These Java jar files will contain both the Flowman runtime and your project in a single file, which then can be easily deployed to an AWS EMR cluster or to a Databricks environment. But some bits are still missing, so stay tuned for the next releases.
Other than that, this release contains some changes to the job execution logic, which now allows you to control which phases are to be executed for which targets, and when a target is to be considered dirty. Read more about the executions in the job documentation and about the build policy in the relation target documentation.
A new “observe” mapping allows you to capture data dependent metrics as records flow through the system. This is interesting for counting specific records types as a relevant execution metric. Read more in the documentation for the “observe” mapping.
Moreover a new build profile and flavor has been added to support Spark 3.2 on Clouder CDP 7.1. Of course you need to install the optional Spark 3.2 parcel on your Cloudera stack to be able to use Spark 3.2.
Detailed Changes
- github-278: Parallelize execution of data quality checks. This also introduces a new configuration property
flowman.execution.check.parallelism
(default1
) - github-282: Improve implementation for counting records
- github-288: Support reading local CSV files from fatjar
- github-290: Simplify specifying project name in fatjar
- github-291: Simplify create/destroy Relation interface
- github-292: Upgrade AWS EMR to 6.9
- github-289: Color log output via log4j configuration (requires log4j 2.x)
- Bump postgresql from 42.4.1 to 42.4.3 in /flowman-plugins/postgresql
- Bump loader-utils from 1.4.0 to 1.4.2
- Bump json5 from 2.2.1 to 2.2.3
- github-293: [BUG] Fatal exceptions in parallel mapping instantiation cause deadlock
- github-273: Support projects contained in (fat) jar files
- github-294: [BUG] Parallel execution should not execute more targets after errors
- github-295: Create build profile for CDP 7.1 with Spark 3.2
- github-296: Update npm dependencies (vuetify & co)
- github-297: Parametrize when to execute a specific phase
- github-299: Move migrationPolicy and migrationStrategy from target into relation
- github-115: Implement additional build policy in relation target for forcing dirty. This also introduces a new configuration property
flowman.default.target.buildPolicy
(defaultCOMPAT
). - github-298: Support fine-grained control when to execute each target of a job
- github-300: Implement new ‘observe’ mapping
- github-301: Upgrade Spark to 3.2.3
- github-302: Upgrade DeltaLake to 2.2.0
- github-303: Use multi-stage build for Docker image
- github-304: Upgrade Cloudera profile to CDP 7.1.8
- github-312: Fix build with Spark 2.4 and Maven 3.8
This version is fully backwards compatible until and including version 0.27.0.
Download
As usual, you can download the latest version from the Download section or directly from GitHub.