Q: Does Flowman run on premise or in the Cloud?

Currently we do not offer something like “Flowman Cloud” which would be Software-as-a-Service (SaaS). So essentially Flowman is an on-premise solution running on your computers. Of course you can also run Flowman on any virtual Cloud resource managed by you. Although this implies that Flowman requires installation, configuration and management from your side, we are happy to support you with these maintenance tasks.

Question 1

What data sources can Flowman access?

Accepted Answer

Flowman supports many different data sources, like for example

Files either on local file system or distributed storage (HDFS) or blob storage (S3, ABS, etc)
Hive tables and views
Relational databases (MySQL, MariaDB, MS SQL Server, PostgreSQL, …)

You can find a detailed overview in the official Flowman documentation

Question 2

What file formats are supported?

Accepted Answer

Flowman supports all file formats, which are also supported by Apache Spark. Plugins even extend this to more file formats.

Plain text
CSV
Fixed width format
Hadoop Sequence files
JSON files
Avro files
Parquet
ORC
Delta Lake

You find the full list of supported file formats in the official Flowman documentation

Question 3

Which databases are supported?

Accepted Answer

Flowman supports many relational databases via JDBC connectivity.

MySQL
MariaDB
PostgreSQL
Oracle
MS SQL Server
Azure SQL

New databases can be implemented on request without much effort.

Question 4

Can I use SQL?

Accepted Answer

Absolutely. Flowman well supports a generic and flexible SQL mapping and even a more powerful recusrive SQL mapping. Both types support the full Spark SQL syntax and therefore let you create complex transformation containing sub-selects, common table expressions and more.

Question 5

How does job scheduling work?

Accepted Answer

Actually Flowman is neither a replacement for schedulers like Apache Airflow or Oozie, nor does it contain a job scheduler which automatically starts the execution of jobs at specific times. Since job scheduling is an overarching topic which is required to run many different tools, this is not a shortcoming of Flowman itself, but rather a design decision to exclude this feature since other excellent tools already exist.

This means you can use any existing scheduler which supports starting a bash script (since this is what the Flowman executables essentially are), so for example Oozie or Airflow work just fine.

Question 6

Do I need a Cluster?

Accepted Answer

Although Apache Spark and Flowman are meant to wrangle huge amounts of data, Spark and Flowman can be run without any hassle on a single machine. In this case you still benefit from the multitude of connectors and the flexibility of Apache Spark without the operational complexity required by a ompute cluster.

Question 7

Does Flowman run on premise or in the Cloud?

Accepted Answer

Currently we do not offer something like &#8220;Flowman Cloud&#8221; which would be Software-as-a-Service (SaaS). So essentially Flowman is an on-premise solution running on your computers. Of course you can also run Flowman on any virtual Cloud resource managed by you.

Although this implies that Flowman requires installation, configuration and management from your side, we are happy to support you with these maintenance tasks.

Question 8

Does Flowman support Hadoop YARN?

Accepted Answer

Absoluetly. Flowman well supports Apache Hadoop and commercial distributions like Cloudera. Flowman well supports Kerberos authentication, which is commonly used within production Hadoop clusters.

Question 9

Does Flowman support Kubernetes?

Accepted Answer

Flowman can be run in a Kubernetes Cluster, both in single-process mode and in distributed mode using the official Kubernetes support from Apache Spark.

Question 10

Does Flowman support Cloudera?

Accepted Answer

Absolutely. Special build variants of Flowman are available for CDH 6 and CDP 7, see the download section for appropriate versions.

Question 11

Does Flowman support AWS EMR?

Accepted Answer

Since Flowman version 0.30.0, AWS EMR is fully supported as a deployment target. You can either deploy Flowman to the master node and access it via ssh, or you can deploy Flowman as a step function of your EMR cluster. Flowman also supports EMR Serverless, but no interactive console access is possible.

Question 12

Does Flowman support Azure Synapse?

Accepted Answer

Since Flowman version 1.0.0, Flowman supports Azure Synapse Spark as a deployment target. This means that you can create a fat jar containing all required Flowman libraries and your project, copy the jar file to Azure Blob Storage and then run Flowman as a Spark job inside Azure Synapse.

Question 13

Does Flowman support Hive?

Accepted Answer

Flowman supports Hive very well. Not only Flowman supports reading and writing to Hive tables, Flowman can even manage (i.e. create) Hive views by transforming your data flow from mappings into valid Hive SQL.

Question 14

Which parts of Flowman are Open Source?

Accepted Answer

Everything of Flowman is Open Source, there are no closed source parts. The source code is freely available at GitHub under the liberal Apache 2.0 License. When it makes sense, we also try to convince companies to donate any custom extensions like plugins to the official Flowman repository, which also ensures adoption to any internal API changes.

Frequently asked questions about Flowman (FAQ)

What is
Flowman?

Download & Install

Quickstart Guide

Development
Workflow

Still have a question?

info@flowman.io

About

Resources

Get in touch