Although Flowman directly builds upon the power of Apache Spark, it does not provide a working Hadoop or Spark environment — and there is a good reason for that: In many environments (specifically in companies using Hadoop distributions) a Hadoop/Spark environment is already provided by some platform team. And Flowman tries its best not to mess this up and instead requires a working Spark installation.
The following step will install Apache Spark on your local machine. If you already have a working Spark installation with a version which is supported by Flowman, you may want to skip this section. Otherwise we download and install Spark 3.4.1 for Hadoop 3.3 which works nicely with the latest Flowman release 1.1.0.
# Create an fresh playground directory, both for Spark and for Flowman
mkdir playground
cd playground
# Download and unpack Spark & Hadoop
curl -L https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz | tar xvzf -
# Create a nice link
ln -snf spark-3.4.1-bin-hadoop3 spark
For this quickstart, we chose `flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. If you use your existing Spark and Hadoop installation, please use the appropriate download above.
# Download and unpack Flowman
curl -L https://github.com/dimajix/flowman/releases/download/1.1.0/flowman-dist-1.1.0-oss-spark3.4-hadoop3.3-bin.tar.gz | tar xvzf -
# Create a nice link
ln -snf flowman-1.1.0-oss-spark3.4-hadoop3.3 flowman
Now before you can use Flowman, you need to tell it where it can find the Spark home directory which we just created in the previous step. This can be either done by providing a valid configuration file in flowman/conf/flowman-env.sh
(a template can be found at flowman/conf/flowman-env.sh.template
), or you can simply set an environment variable. For the sake of simplicity, we follow the second approach
# This assumes that we are still in the directory "playground"
export SPARK_HOME=$(pwd)/spark
In order to use some of the provided Flowman plugins, we also need to provide a default namespace which contains some basic configurations. We simply copy the provided template as follows:
# Copy default namespace
cp flowman/conf/default-namespace.yml.template flowman/conf/default-namespace.yml
cp flowman/conf/flowman-env.sh.template flowman/conf/flowman-env.sh
That’s it. Now you have a working Flowman installation. Continue reading the next section to learn how to use.
Streamline your development workflow by making most of all Flowman tools.
Copyright © The Flowman Authors | Kaya Kupferschmidt | Freiherr-vom-Stein Straße 3, 60323 Frankfurt, Germany | +49 69 71588909 | info@flowman.io
Webdesign by Katharina Vennewald