In addition to writing and executing data transformations, Flowman also manages the whole lifecycle of physical data models, i.e. Hive tables, SQL tables etc. Flowman will automatically infer the required schema of output tables and create missing tables or migrate existing tables to the correct schema (i.e. by adding new columns, by changing the data types, etc).
This helps to keep all aspects (like transformations and schema information) in a single place managed by a single application.
Being built on top of Apache Spark, Flowman can be run as a standalone application or can scale to almost any size by using compute clusters (Hadoop, Kubernetes, EMR & Azure Synapse) to process huge amounts of data.
First you specify the data transformation pipeline with all inputs and outputs and all transformations. Then you create a job containing a list of build targets telling Flowman which target tables should be populated with which results from the data transformations.
Optional self contained unittests ensure that your implemented logic meets the business requirements. These tests can be executed in isolation with all dependencies on external data sources mocked.
With the declarative flow description in place with relations, mappings, targets and jobs your project is ready to be executed.
Flowman offers a powerful command line tool which reads the flow files and performs all required actions like creating target tables, transforming and writing data. Behind the scenes Flowman utilizes the power of Apache Spark for efficient and scalable data processing.
Collaborate with your colleagues by putting all project files under source control like git. Flowman follows the everything-is-code philosophy and all project files are simple text files (YAML). This means you can easily trace changes to the logic and go back to a previous version when required.
The same is true for deployment – you only need to checkout a specific version of the git repository and then let Flowman execute the project.
Test and document your project and your data by annotating mappings or relations with descriptions and with quality checks. Both can be done either on the entity level (mapping or relation) or even on the column level.
Flowman then easily generates a full blown documentation of your project, that will not only include your description, but also the results of all specified test cases. This minimizes friction between your assumption on the data and the reality.
Flowman is being actively developed by dimajix as an open source building block for providing services for implementing data pipelines in modern data centric organizations.
Flowman is an open source project available under the very liberal Apache 2.0 license.