Flowman blog and change log

Flowman 0.30.0 released

A little bit later than anticipated, a new version of Flowman has been released. It didn’t make it as a Christmas present, but it became a welcome New Years present instead.

This release has many technical changes under the hood as a preparation for being able to build fat jars. These Java jar files will contain both the Flowman runtime and your project in a single file, which then can be easily deployed to an AWS EMR cluster or to a Databricks environment. But some bits are still missing, so stay tuned for the next releases.

Other than that, this release contains some changes to the job execution logic, which now allows you to control which phases are to be executed for which targets, and when a target is to be considered dirty. Read more about the executions in the job documentation and about the build policy in the relation target documentation.

A new “observe” mapping allows you to capture data dependent metrics as records flow through the system. This is interesting for counting specific records types as a relevant execution metric. Read more in the documentation for the “observe” mapping.

Moreover a new build profile and flavor has been added to support Spark 3.2 on Clouder CDP 7.1. Of course you need to install the optional Spark 3.2 parcel on your Cloudera stack to be able to use Spark 3.2.

Detailed Changes

  • github-278: Parallelize execution of data quality checks. This also introduces a new configuration property flowman.execution.check.parallelism (default 1)
  • github-282: Improve implementation for counting records
  • github-288: Support reading local CSV files from fatjar
  • github-290: Simplify specifying project name in fatjar
  • github-291: Simplify create/destroy Relation interface
  • github-292: Upgrade AWS EMR to 6.9
  • github-289: Color log output via log4j configuration (requires log4j 2.x)
  • Bump postgresql from 42.4.1 to 42.4.3 in /flowman-plugins/postgresql
  • Bump loader-utils from 1.4.0 to 1.4.2
  • Bump json5 from 2.2.1 to 2.2.3
  • github-293: [BUG] Fatal exceptions in parallel mapping instantiation cause deadlock
  • github-273: Support projects contained in (fat) jar files
  • github-294: [BUG] Parallel execution should not execute more targets after errors
  • github-295: Create build profile for CDP 7.1 with Spark 3.2
  • github-296: Update npm dependencies (vuetify & co)
  • github-297: Parametrize when to execute a specific phase
  • github-299: Move migrationPolicy and migrationStrategy from target into relation
  • github-115: Implement additional build policy in relation target for forcing dirty. This also introduces a new configuration property flowman.default.target.buildPolicy (default COMPAT).
  • github-298: Support fine-grained control when to execute each target of a job
  • github-300: Implement new ‘observe’ mapping
  • github-301: Upgrade Spark to 3.2.3
  • github-302: Upgrade DeltaLake to 2.2.0
  • github-303: Use multi-stage build for Docker image
  • github-304: Upgrade Cloudera profile to CDP 7.1.8
  • github-312: Fix build with Spark 2.4 and Maven 3.8

This version is fully backwards compatible until and including version 0.27.0.

Download

As usual, you can download the latest version from the Download section or directly from GitHub.

Flowman 0.29.0 released

We are happy to announce a fresh and exciting release of Flowman. This release is mainly a maintenance release with a couple of minor improvements

  • New distribution for Amazon EMR 6.8.0. This simplifies using Flowman in AWS.
  • New ‘iterativeSql’ mapping for solving a new set of recursive problems

Of course, this release also contains a bunch of smaller bug fixes and other improvements.

Detailed Changes

  • github-260: Remove hive-storage-api from several plugins and lib
  • github-261: Add descriptions to all pom.xml
  • github-262: Verification of “relation” targets should only check existence
  • github-263: Add filter condition to data quality checks in documentation
  • github-265: Make JDBC dialects pluggable
  • github-264: Provide “jars” for all plugins
  • github-267: Add new flowman-spark-dependencies module to simplify dependency management
  • github-269: Implement new ‘iterativeSql’ mapping
  • github-270: Upgrade Spark to 3.3.1
  • github-271: Upgrade Delta to 2.1.1
  • github-272: Create build profile for AWS EMR 6.8.0
  • github-273: Refactor file abstraction
  • github-274: Print Flowman configuration to console

Download

As usual, you can download the latest version from the Download section or directly from GitHub.

Flowman 0.28.0 released

We are happy to announce a fresh and exciting release of Flowman. This release is mainly a maintenance release with a couple of minor improvements

  • New command line options -X and -XX for increasing log level
  • Improve compatibility with MySQL and MariaDB
  • New Maven archetype for quickly setting up a new Flowman project

Of course, this release also contains a bunch of smaller bug fixes and other improvements.

Detailed Changes

  • Improve support for MariaDB / MySQL as data sinks
  • github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui
  • github-246: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-server-ui
  • github-247: Automatically generate YAML schemas as part of build process
  • github-248: Bump scss-tokenizer and node-sass in /flowman-server-u
  • github-249: Add new options -X and -XX to increase logging
  • github-251: Support for log4j2 Configuration
  • github-252: Move sftp target into separate plugin
  • github-253: SQL Server relation should support explicit staging table
  • github-254: Use DATETIME2 for timestamps in MS SQL Server
  • github-256: Provide Maven archetype for simple Flowman projects
  • github-258: Support clustered indexes in MS SQL Server

Download

As usual, you can download the latest version from the Download section or directly from GitHub.

Flowman 0.27.0 released

We are happy to announce a fresh and exciting release of Flowman. The focus of this release is again on improving JDBC support. In particular the following important features have been implemented

  • New ‘ jdbcCommand’ target for executing arbitrary SQL statements for JDBC sinks
  • Support direct SQL statements in JDBC relations for creating tables
  • Upgrade Delta Lake to 2.0.0/2.1.0 (for Spark 3.2 and 3.3 respectively)
  • Better error messages

Of course, this release also contains a bunch of smaller bug fixes and other improvements.

Detailed Changes

  • github-232: [BUG] Column descriptions should be propagates in UNIONs
  • github-233: [BUG] Missing Hadoop dependencies for S3, Delta, etc
  • github-235: Implement new rest hook with fine control
  • github-229: A build target should not fail if Impala “COMPUTE STATS” fails
  • github-236: ‘copy’ target should not apply output schema
  • github-237: jdbcQuery relation should use fields “sql” and “file” instead of “query”
  • github-239: Allow optional SQL statement for creating jdbcTable
  • github-238: Implement new ‘jdbcCommand’ target
  • github-240: [BUG] Data quality checks in documentation should not fail on NULL values
  • github-241: Throw an error on duplicate entity definitions
  • github-220: Upgrade Delta-Lake to 2.0 / 2.1
  • github-242: Switch to Spark 3.3 as default
  • github-243: Use alternative Spark MS SQL Connector for Spark 3.3
  • github-244: Generate project HTML documentation with optional external CSS file

Download

As usual, you can download the latest version from the Download section or directly from GitHub.

Flowman 0.26.0 released

The new version 0.26.0 of Flowman has been released. This version contains a lot of work with a strong focus on improving working with JDBC targets like MariaDB/MySQL, Postgres, MS SQL Server, Azure SQL and Oracle. For example, column collations and comments are now correctly propagated into relational databases, changing the primary key is now supported and much more.

Also Spark 3.3 is now officially supported, albeit not much tested so far. Moreover many small bug fixes and enhancements help to make Flowman more robust and versatile.

Detailed Changes

  • github-202: Add support for Spark 3.3
  • github-203: [BUG] Resource dependencies for Hive should be case-insensitive
  • github-204: [BUG] Detect indirect dependencies in a chain of Hive views
  • github-207: [BUG] Build should not directly fail if inferring dirty status fails
  • github-209: [BUG] HiveViews should not trigger cascaded refresh during CREATE phase even when nothing is changed
  • github-211: Implement new hiveQuery relation
  • github-210: [BUG] HiveTables should be migrated if partition columns change
  • github-208: Implement JDBC hook for database based semaphores
  • github-212: [BUG] Hive views should not be migrated in RELAXED mode if only comments have changed
  • github-214: Update ImpalaJDBC driver to 2.6.26.1031
  • github-144: Support changing primary key for JDBC relations
  • github-216: [BUG] Floats should be represented as FLOAT and not REAL in MySQL/MariaDB
  • github-217: Support collations for creating/migrating JDBC tables
  • github-218: [BUG] Postgres dialect should be used for Postgres JDBC URLs
  • github-219: [BUG] SchemaMapping should retain incoming comments
  • github-215: Support COLUMN STORE INDEX for MS SQL Server
  • github-182: Support column descriptions in JDBC relations (SQL Server / Azure SQL)
  • github-224: Support column descriptions for MariaDB / MySQL databases
  • github-223: Support column descriptions for Postgres database
  • github-205: Initial support Oracle DB via JDBC
  • github-225: [BUG] Staging schema should not have comments

Download

As usual you can find the latest Flowman version at https://flowman.io/download/ and older downloads on the GitHub release page at https://github.com/dimajix/flowman/releases

Flowman 0.25.0 Released

The new version 0.25.0 of Flowman has been released. This release contains a couple of smaller changes and fixes plus a new relation for creating and managing views in SQL databases, which are accessed via JDBC. This last addition strengthens Flowmans position in environments where data resides in classical relational databases instead of HDFS or object storages.

Detailed Changes

  • github-184: Only read in *.yml / *.yaml files in module loader
  • github-183: Support storing SQL in external file in hiveView
  • github-185: Missing _SUCCESS file when writing to dynamic partitions
  • github-186: Support output mode OVERWRITE_DYNAMIC for Delta relation
  • github-149: Support creating views in JDBC with new jdbcView relation
  • github-190: Replace logo in documentation
  • github-188: Log detailed timing information when writing to JDBC relation
  • github-191: Add user provided description to quality checks
  • github-192: Provide example queries for JDBC metric sink

Download

As usual you can find the latest Flowman version at https://flowman.io/download/ and older downloads on the GitHub release page at https://github.com/dimajix/flow

Flowman 0.24.0 released

The new version 0.24.0 of Flowman has been released. This release contains a YAML schema generator to provide an appropriate YAML/JSON schema for the code editor of your choice. By using this schema, you will benefit from better syntax validation and auto-complete features, for example in VS Code or IntelliJ. Moreover model documentation capabilities and data quality checks have been improved by considering valuable input from real world projects.

Detailed Changes

  • github-168: Support optional filters in data quality checks
  • github-169: Support sub-queries in filter conditions
  • github-171: Parallelize loading of project files
  • github-172: Update CDP7 profile to the latest patch level
  • github-153: Use non-privileged user in Docker image
  • github-174: Provide application for generating YAML schema

Breaking Changes

We take backward compatibility very seriously. But sometimes a breaking change is needed to clean up code and to enable new features. This release contains some breaking changes, which are annoying but simple to fix. In order to avoid YAML schema inconsistencies, some entities needed to be renamed, as described in the following table:

categoryold kindnew kind
mappingconstvalues
mappingemptynull
mappingreadrelation
mappingreadRelationrelation
mappingreadStreamstream
relationconstvalues
relationemptynull
relationjdbcjdbcTable, jdbcQuery
relationtablehiveTable
relationviewhiveView
schemaembeddedinline
Breaking changes

Flowman 0.23.1 is now available!

We are happy to announce the new release 0.23.1 of Flowman. Since this is mainly a bugfix release, we encourage all users to upgrade soon.

Detailed Changes

  • github-154: Fix failing migration when PK requires change due to data type
  • github-156: Recreate indexes when data type of column changes
  • github-155: Project level configs are used outside job
  • github-157: Fix UPSERT operations for SQL Server
  • github-158: Improve non-nullability of primary key column
  • github-160: Use sensible defaults for default documenter
  • github-161: Improve schema caching during execution
  • github-162: ExpressionColumnCheck does not work when results contain NULL values
  • github-163: Implement new column length quality check

About Flowman

Flowman is a data build tool on top of Apache Spark which uses a declarative approach for specifying the full data flow including all sources, targets and transformation. Like usual, you can find the latest version of Flowman prebuilt for different Spark / Hadoop versions at https://flowman.io

Flowman 0.23.0 is now avaible!

The new version 0.23.0 of Flowman has been released. The main feature of this version is a significant improvement of the new documentation system, which now also includes column level lineage. The automatically generated documentation is a valuable artifact for both developers and business experts to improve the understanding of the data models and transformations. Flowman projects can also specify quality checks (like NOT NULL condition, foreign key relationships or arbitrary SQL expressions), which are not only included in the documentation but also executed on the real data.

Moreover support for SQL databases via JDBC has been improved again with the introduction of temporary staging tables to perform updates within a transactional commit.

Detailed Changes

  • github-148: Support staging table for all JDBC relations
  • github-120: Use staging tables for UPSERT and MERGE operations in JDBC relations
  • github-147: Add support for PostgreSQL
  • github-151: Implement column level lineage in documentation
  • github-121: Correctly apply documentation, before/after and other common attributes to templates
  • github-152: Implement new ‘cast’ mapping

About Flowman

Flowman is a data build tool on top of Apache Spark which uses a declarative approach for specifying the full data flow including all sources, targets and transformation. Like usual, you can find the latest version of Flowman prebuilt for different Spark / Hadoop versions at https://flowman.io

Flowman 0.22.0 has been released

The new version 0.22.0 of Flowman has been released. Among the new features is a a new documentation subsystem for automatically creating a reference documentation of the whole data flow and the final data model including data quality checks. Along with that feature comes the ability to publish execution and data quality metrics directly into a SQL database .Moreover, the support for Azure SQL / SQL Server databases has been improved significantly with higher write speed and transactional writes. Project loading times have also been reduced, which has its special value for interactive development work inside the Flowman shell.

About Flowman

Flowman is an open source data build tool on top of Apache Spark which uses a declarative approach for specifying the full data flow including all sources, targets and transformation. Like usual, you can find the latest version of Flowman prebuilt for different Spark / Hadoop versions at https://flowman.io