Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.
The following step will install Apache Spark on your local machine. If you already have a working Spark installation with a version which is supported by Flowman, you may want to skip this section. Otherwise we download and install Spark 3.4.1 for Hadoop 3.3 which works nicely with the latest Flowman release 1.1.0.
# Create an fresh playground directory, both for Spark and for Flowman mkdir playground cd playground # Download and unpack Spark & Hadoop curl -L https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz | tar xvzf - # Create a nice link ln -snf spark-3.4.1-bin-hadoop3 spark
For this quickstart, we chose `flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. If you use your existing Spark and Hadoop installation, please use the appropriate download above.
# Download and unpack Flowman curl -L https://github.com/dimajix/flowman/releases/download/1.1.0/flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz | tar xvzf - # Create a nice link ln -snf flowman-1.1.0-oss-spark3.4-hadoop3.3 flowman
Now before you can use Flowman, you need to tell it where it can find the Spark home directory which we just created in the previous step. This can be either done by providing a valid configuration file in
flowman/conf/flowman-env.sh (a template can be found at
flowman/conf/flowman-env.sh.template ), or you can simply set an environment variable. For the sake of simplicity, we follow the second approach
# This assumes that we are still in the directory "playground" export SPARK_HOME=$(pwd)/spark
In order to use some of the provided Flowman plugins, we also need to provide a default namespace which contains some basic configurations. We simply copy the provided template as follows:
# Copy default namespace cp flowman/conf/default-namespace.yml.template flowman/conf/default-namespace.yml cp flowman/conf/flowman-env.sh.template flowman/conf/flowman-env.sh
That’s it. Now you have a working Flowman installation. Continue reading the next section to learn how to use.
A small quickstart guide will lead you through a simple example.
Streamline your development workflow by making most of all Flowman tools.