Download Flowman

Download the newest version of Flowman and get started quickly. Flowman distributions are hosted on GitHub, which are prebuilt for different Spark and Hadoop versions. Alternatively you can also run a Docker image, which gets you started quickly and serves well for as a local development tool.

In case your favorite Spark and Hadoop version is missing, feel free to contact us for support.

Run Flowman in Docker

The simplest way to run Flowman is to use a prebuilt Docker image available on Docker Hub:

				
					# Start the Docker container
$ docker run --rm -ti dimajix/flowman:0.27.0-oss-spark3.3-hadoop3.3 bash

# Run Flowman inside the container
$ cd /opt/flowman
$ flowshell -f examples/weather
				
			

Install and run Flowman locally

Of course you can also run Flowman directly on your local machine, especially on a Linux machine. Windows users might consider installing Flowman inside WSL for the best experience.

Download & Install Apache Spark

Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.

The following step will install Apache Spark on your local machine. If you already have a working Spark installation with a version which is supported by Flowman, you may want to skip this section. Otherwise we download and install Spark 3.3.0 for Hadoop 3.3 which works nicely with the latest Flowman release 0.27.0.

				
					# Create an fresh playground directory, both for Spark and for Flowman
mkdir playground
cd playground

# Download and unpack Spark & Hadoop
curl -L https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz | tar xvzf -

# Create a nice link
ln -snf spark-3.3.0-bin-hadoop3.2 spark
				
			

Download & Install Flowman

For this quickstart, we chose `flowman-dist-0.27.0-oss-spark3.3-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. If you use your existing Spark and Hadoop installation, please use the appropriate download above.

				
					# Download and unpack Flowman
curl -L https://github.com/dimajix/flowman/releases/download/0.27.0/flowman-dist-0.27.0-oss-spark3.3-hadoop3.3-bin.tar.gz | tar xvzf -

# Create a nice link
ln -snf flowman-0.27.0-oss-spark3.3-hadoop3.3 flowman
				
			

Flowman Configuration

Now before you can use Flowman, you need to tell it where it can find the Spark home directory which we just created in the previous step. This can be either done by providing a valid configuration file in flowman/conf/flowman-env.sh (a template can be found at flowman/conf/flowman-env.sh.template ), or you can simply set an environment variable. For the sake of simplicity, we follow the second approach

				
					# This assumes that we are still in the directory "playground"
export SPARK_HOME=$(pwd)/spark
				
			

In order to use some of the provided Flowman plugins, we also need to provide a default namespace which contains some basic configurations. We simply copy the provided template as follows:

				
					# Copy default namespace
cp flowman/conf/default-namespace.yml.template flowman/conf/default-namespace.yml
cp flowman/conf/flowman-env.sh.template flowman/conf/flowman-env.sh
				
			

Congratulation!

That’s it. Now you have a working Flowman installation. Continue reading the next section to learn how to use.

Additional Resources

Extensive technical documantation

A small quickstart guide will lead you through a simple example.

Tutorial
on github

A comprehensive tutorial will teach you all the details.

FAQs

Some commonyl asked question

Having trouble?

A small quickstart guide will lead you through a simple example.

Looking for help for a custom project?

Get in touch and contact us.