100% Open Source

What is Flowman

Flowman is a declarative data build tool based on Apache Spark that simplifies the act of writing data transformation applications. Flowman offers a modern declarative approach for ETL for creating robust data transformation pipelines.

Everything is Code

The main idea is that developers describe the desired data transformation within purely declarative YAML files instead of writing complex Spark jobs in Scala or Python. This helps to focus on the “what” instead of the “how” and separates the business logic (in the YAML files) from implementation details (in Flowman). Hiding the complexity of execution (the “How”), and putting the spotlight onto the transformations (the “What”) help developers and business experts to collaborate on a new level of detail.

Data Lifecycle

In addition to writing and executing data transformations, Flowman also manages the whole lifecycle of physical data models, i.e. Hive tables, SQL tables etc. Flowman will automatically infer the required schema of output tables and create missing tables or migrate existing tables to the correct schema (i.e. new columns are added, data types are changed etc).

This helps to keep all aspects (like transformations and schema information) in a single place managed by a single application.

How Flowman works

Develop

First you specify the data transformation pipeline with all inputs and outputs and all transformations. Then you create a job containing a list of build targets telling Flowman which target tables should be populated with which results from the data transformations.

Optional self contained unittests ensure that your implemented logic meets the business requirements. These tests can be executed in isolation with all dependencies on external data sources mocked.

Execute

With the declarative flow description in place with relations, mappings, targets and jobs your project is ready to be executed.

Flowman offers a powerful command line tool which reads the flow files and performs all required actions like creating target tables, transforming and writing data. Behind the scenes Flowman utilizes the power of Apache Spark for efficient and scalable data processing.

Collaborate

Collaborate with your colleagues by putting all project files under source control like git. Flowman follows the everything-is-code philosophy and all project files are simple text files (YAML). This means you can easily trace changes to the logic and go back to a previous version when required.

The same is true for deployment – you only need to checkout a specific version of the git repository and then let Flowman execute the project.

Test & Document

Test and document your project and your data by annotating mappings or relations with descriptions and with quality checks. Both can be done either on the entity level (mapping or relation) or even on the column level.

Flowman then easily generates a full blown documentation of your project, that will not only include your description, but also the results of all specified test cases. This minimizes friction between your assumption on the data and the reality.

Declarative ETL with Flowman

The following code snippets are taken from a small example, which extracts and transforms multiple metrics from a publicly available weather data set. These snippets will give you an impression for the YAML format for describing the data flows as expected by Flowman.

Flowman organizes the data flows in so called projects, which can actually contain multiple different data flows. Usually all data flows within a single project belong to the same logical data model and will be executed as a single batch process. Each project contains at least four YAML entity types relationsmappingstargets and at least one job as shown in the example:

Step 1: Relations

				
					
relations:
  # Define a relation called "measurements-raw"
  measurements-raw:
    # The data is stored in files
    kind: file
    # ... as a simple text format
    format: text
    # ... in AWS S3
    location: "s3a://dimajix-training/data/weather/"
    # ... with a pattern for partitions
    pattern: "${year}"
    partitions:
      - name: year
        type: integer
        granularity: 1
    # Define the schema of the data    
    schema:
      # The schema is embedded
      kind: embedded
      fields:
        - name: raw_data
          type: string
          description: "Raw measurement data"
				
			
Relations describe your data sources and sinks including the physical location and their data schema. Each relation represents a single logical table. The example above defines a data source containing text data stored in S3.
  • Many different storage types are supported like files, object stores (like S3, ABS, …) and SQL sources.
  • The data schema can also be omitted and will then be inferred automatically. Or it can be stored in an external file to simplify maintenance.
  • Custom relation types can be implemented as plugins.

Step 2: Mappings

				
					
mappings:
  # Define a mapping for reading from a relation
  measurements-raw:
    kind: relation
    relation: measurements-raw
    partitions:
      year: $year

  # Define a mapping for extracting values
  measurements-extracted:
    kind: select
    # Specify the input mapping
    input: measurements-raw
    # Specify the columns to be extracted
    columns:
      usaf: "SUBSTR(raw_data,5,6)"
      wban: "SUBSTR(raw_data,11,5)"
      date: "SUBSTR(raw_data,16,8)"
      time: "SUBSTR(raw_data,24,4)"
      report_type: "SUBSTR(raw_data,42,5)"
      wind_direction: "SUBSTR(raw_data,61,3)"
      wind_direction_qual: "SUBSTR(raw_data,64,1)"
      wind_speed: "CAST(SUBSTR(raw_data,66,4) AS FLOAT)/10"
      wind_speed_qual: "SUBSTR(raw_data,70,1)"
      air_temperature: "CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10"
      air_temperature_qual: "SUBSTR(raw_data,93,1)"
				
			

Next you define the data flow, which usually starts with a read mapping and then continues with transformations. In this example we will simply extract some metrics at fixed locations in the weather data. A broad range of mappings is provided by Flowman.

  • Simple transformations like filtering, projections and record-wise transformations including the full power of all supported Spark SQL functions.
  • Grouped aggregations and joins are implemented and commonly used for building data marts for reporting purposes.
  • Complex message transformations simplify handling of nested data types like commonly found in JSON and Avro.
  • Even SQL queries can be used for specifying arbitrary complex transformations.
  • Custom mappings can be implemented as plugins.

Step 3: Execution Targets

				
					
targets:
  # Define a built target called "measurements"
  measurements:
    # It should build a relation
    kind: relation
    # The data is provided by the mapping
    mapping: measurements-extracted
    # ...and the result is written into a relation
    relation: measurements
    # Overwrite a specific partition
    partition:
      year: $year
				
			
Now that relations and mappings are described, we now create execution targets which plug together the output of one mapping to a target relation.
  • Typically each target refers to a target relation and and input mapping. Flowman will materialize all records from the mapping and write them to the relation.
  • Additional execution target types also provide file copy and SFTP operations.
  • Execution targets will only be built when the physical destination is outdated or if the execution is forced via a command-line flag.
  • Custom targets types can be implemented as plugins.

Step 4: Jobs

				
					
jobs:
  main:
    # Define optional job parameters
    parameters:
      - name: year
        type: Integer
        default: 2013
    # Provide a list of targets in this job        
    targets:
      - measurements
				
			
Finally we create a build job, which can possibly consist of multiple build targets, which Flowman will execute in the correct order, i.e. it will resolve any data dependencies between build targets within a single build job.Each job can also be parametrized, in this example we use a parameter `year` for specifying which data partition to process.
  • Jobs help to bundle related build targets to be executed within a single Flowman invocation.
  • Automatic ordering of build targets ensure that all upstream dependencies will be executed before a target is run.
  • Optionally job runs can be registered in a job history database to improve observability.

Notable Flowman features

100% Open Source (Apache License)

Why Flowman?

Flowman reduces development effort
The tool was born from the practical experience that most companies have very similar needs for the ETL pipeline built with Apache Spark. Instead of writing similar application code from scratch for every project and customer, Flowman is a powerful building block for skipping this first step and thereby accelerates your data team by focusing on the business logic instead of boiler plate code.
Powered by Apache Spark
By utilizing Apache Spark as the de-facto standard for Big Data ETL, Flowman is ready to reliably wrangle your Big Data while providing a higher level of abstraction.
Flowman is simple to learn
With Flowman you can describe your dataflow in a simple YAML file, that also business experts can understand. Flowman will then take care of all technical details.
Flowman is observable
The Flowman history server provides all relevant details on past runs. You can see when jobs have been successful and which jobs failed. Additional job metrics can be pushed to external collectors like Prometheus.
Flowman is extensible
In case something is missing, you can write your own Flowman plugin. New data sources and sinks, new mappings and more can be implemented in Scala. And in case you struggle with that, we can help you with that.
Flowman is scalable
Flowman can be used to process both Small Data (megabytes) and for Big Data (terabytes). You decide if jobs should be run on a single machine or in a Hadoop or Kubernetes cluster.
Flowman is proven
Not only does the source code of Flowman contain extensive unittests, which ensure correctness and which avoids breaking existing features, Flowman is also already successfully implemented in production at multiple companies.
Previous
Next

Who is developing Flowman?

Flowman is being actively developed by dimajix as an open source building block for providing services for implementing data pipelines in modern data centric organizations.

Flowman is an open source project available under the very liberal Apache 2.0 license.