We are excited and proud to announce the official release of Flowman 1.0.
Flowman is a tool for performing complex data transformations in a structured manner. It follows an everything-as-code approach with a purely declarative specification language for describing the transformations and data models. This helps enterprises to accelerate their workflows for building robust data pipelines because developers can focus on the business logic while Flowman takes care of the correct execution.
Under the hood, Flowman uses Apache Spark as its execution engine which is a versatile and highly scalable framework for creating data intensive applications. Although Spark already does lots of the heavy lifting, Flowman adds a lot of value by hiding many of the low-level details of Spark, which would otherwise need to be taken care of in complex boilerplate code found in classical Spark applications. Moreover Flowman also fills many gaps which are not in the scope of a generic framework like Apache Spark, such as automatic schema management, data quality tests, data documentation and much more.
Flowman is ready for production
For several years, multiple companies are using Flowman in production as a robust and reliable solution for efficiently building data transformation pipelines. During this time, many rough edges of Flowman have been smoothed out, and the overall development workflow has been significantly improved. Therefore, we felt that the time is ripe to increase the major version to “1”, underlining the production readiness of Flowman.
We are often asked, what sets Flowman apart either from plain Apache Spark or from different solutions like dbt. Some of the main advantages of using Flowman are:
- In contrast to dbt, Flowman comes with Apache Spark, as a powerful and scalable SQL execution engine. This provides direct connectivity to many data sources and allows simple integration of data stored within different technologies.
- The declarative approach takes a lot of burden from data engineers and helps them to focus on the business logic with simple YAML files instead of writing complex boilerplate code.
- Flowman can completely manage the whole lifecycle of the data models it writes to. This does not only mean that Flowman can create some tables, but it can automatically apply migrations, for instance by adding or removing columns.
- Unit tests are a first-class citizen for Flowman and enable developers to test their business logic independently on their developer machines by mocking data sources instead of requiring connectivity to any data source.
- Flowman provides a set of powerful command line tools. This includes a tool for batch execution of data flows in production as well as an interactive shell enabling developers to inspect intermediate results.
The road to 1.0
The jump from version 0.30 to 1.0 was more than an adjustment of some version number. We dedicated much time into an extensive preparation of this big and important release. On the one hand, some exciting new features have been implemented, like the new client/server mode and the Flowman Maven plugin to improve the overall development workflow. On the other hand, we significantly improved the documentation and enhanced our integration tests to prevent future regressions.
We also thought about a strategy to provide extended maintenance (i.e. bug fixes) for older releases. Some customers understandably prefer to stick to a long-standing and proven version of Flowman instead of always upgrading to the latest version. This reduce the risk of unforeseen side effects and thereby will save them a significant amount of efforts for development and testing. We finally found a good strategy for providing bug fixes for older Flowman releases, but this required switching to a new branching model in the Flowman repository.