Frequently asked questions about Flowman (FAQ)

Here you will find answers to common product level questions about Flowman. These will help you on deciding how Flowman fits into your overall application landscape. For more technical questions on the developer side, please visit the cookbook section of the official Flowman documentation.

Flowman supports many different data sources, like for example
  • Files either on local file system or distributed storage (HDFS) or blob storage (S3, ABS, etc)
  • Hive tables and views
  • Relational databases (MySQL, MariaDB, MS SQL Server, PostgreSQL, …)
You can find a detailed overview in the official Flowman documentation
Flowman supports all file formats, which are also supported by Apache Spark. Plugins even extend this to more file formats.
  • Plain text
  • CSV
  • Fixed width format
  • Hadoop Sequence files
  • JSON files
  • Avro files
  • Parquet
  • ORC
  • Delta Lake
You find the full list of supported file formats in the official Flowman documentation

Flowman supports many relational databases via JDBC connectivity.

  • MySQL
  • MariaDB
  • PostgreSQL
  • Oracle
  • MS SQL Server
  • Azure SQL

New databases can be implemented on request without much effort.

Absolutely. Flowman well supports a generic and flexible SQL mapping and even a more powerful recusrive SQL mapping.Both types support the full Spark SQL syntax and therefore let you create complex transformation containing sub-selects, common table expressions and more.

Actually Flowman is neither a replacement for schedulers like Apache Airflow or Oozie, nor does it contain a job scheduler which automatically starts the execution of jobs at specific times. Since job scheduling is an overarching topic which is required to run many different tools, this is not a shortcoming of Flowman itself, but rather a design decision to exclude this feature since other excellent tools already exist.

This means you can use any existing scheduler which supports starting a bash script (since this is what the Flowman executables essentially are), so for example Oozie or Airflow work just fine.

Although Apache Spark and Flowman are meant to wrangle huge amounts of data, Spark and Flowman can be run without any hassle on a single machine. In this case you still benefit from the multitude of connectors and the flexibility of Apache Spark without the operational complexity required by a ompute cluster.
Currently we do not offer something like “Flowman Cloud” which would be Software-as-a-Service (SaaS). So essentially Flowman is an on-premise solution running on your computers. Of course you can also run Flowman on any virtual Cloud resource managed by you.Although this implies that Flowman requires installation, configuration and management from your side, we are happy to support you with these maintenance tasks.
Absoluetly. Flowman well supports Apache Hadoop and commercial distributions like Cloudera. Flowman well supports Kerberos authentication, which is commonly used within production Hadoop clusters.
Flowman can be run in a Kubernetes Cluster, both in single-process mode and in distributed mode using the official Kubernetes support from Apache Spark.
Absolutely. Special build variants of Flowman are available for CDH 6 and CDP 7, see the download section for appropriate versions.

Since Flowman version 0.30.0, AWS EMR is fully supported as a deployment target. You can either deploy Flowman to the master node and access it via ssh, or you can deploy Flowman as a step function of your EMR cluster. Flowman also supports EMR Serverless, but no interactive console access is possible.

Since Flowman version 1.0.0, Flowman supports Azure Synapse Spark as a deployment target. This means that you can create a fat jar containing all required Flowman libraries and your project, copy the jar file to Azure Blob Storage and then run Flowman as a Spark job inside Azure Synapse.

Flowman supports Hive very well. Not only Flowman supports reading and writing to Hive tables, Flowman can even manage (i.e. create) Hive views by transforming your data flow from mappings into valid Hive SQL.
Everything of Flowman is Open Source, there are no closed source parts. The source code is freely available at GitHub under the liberal Apache 2.0 License.When it makes sense, we also try to convince companies to donate any custom extensions like plugins to the official Flowman repository, which also ensures adoption to any internal API changes.

Related topics

What is
Flowman?

A gentle introduction into Flowman, the problem it solves and its core concepts.

Download & Install

How to set up Flowman locally to get started

Quickstart Guide

A small quickstart guide will lead you through a simple example.

Development
Workflow

Streamline your development workflow by making most of  all Flowman tools

Still have a question?

If you cannot find answer to your question in our FAQ, feel free to contact us.