The following code snippets are taken from a small example, which extracts and transforms multiple metrics from a publicly available weather data set. These snippets will give you an impression for the YAML format for describing the data flows as expected by Flowman.
Flowman organizes the data flows in so called projects, which can actually contain multiple different data flows. Usually all data flows within a single project belong to the same logical data model and will be executed as a single batch process. Each project contains at least four YAML entity types relations, mappings, targets and at least one job as shown in the example:
relations:
# Define a relation called "measurements-raw"
measurements-raw:
# The data is stored in files
kind: file
# ... as a simple text format
format: text
# ... in AWS S3
location: "s3a://dimajix-training/data/weather/"
# ... with a pattern for partitions
pattern: "${year}"
partitions:
- name: year
type: integer
granularity: 1
# Define the schema of the data
schema:
# The schema is embedded
kind: embedded
fields:
- name: raw_data
type: string
description: "Raw measurement data"
mappings:
# Define a mapping for reading from a relation
measurements-raw:
kind: relation
relation: measurements-raw
partitions:
year: $year
# Define a mapping for extracting values
measurements-extracted:
kind: select
# Specify the input mapping
input: measurements-raw
# Specify the columns to be extracted
columns:
usaf: "SUBSTR(raw_data,5,6)"
wban: "SUBSTR(raw_data,11,5)"
date: "SUBSTR(raw_data,16,8)"
time: "SUBSTR(raw_data,24,4)"
report_type: "SUBSTR(raw_data,42,5)"
wind_direction: "SUBSTR(raw_data,61,3)"
wind_direction_qual: "SUBSTR(raw_data,64,1)"
wind_speed: "CAST(SUBSTR(raw_data,66,4) AS FLOAT)/10"
wind_speed_qual: "SUBSTR(raw_data,70,1)"
air_temperature: "CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10"
air_temperature_qual: "SUBSTR(raw_data,93,1)"
Next you define the data flow, which usually starts with a read mapping and then continues with transformations. In this example we will simply extract some metrics at fixed locations in the weather data. A broad range of mappings is provided by Flowman.
targets:
# Define a built target called "measurements"
measurements:
# It should build a relation
kind: relation
# The data is provided by the mapping
mapping: measurements-extracted
# ...and the result is written into a relation
relation: measurements
# Overwrite a specific partition
partition:
year: $year
Now that relations and mappings are described, we now create execution targets which plug together the output of one mapping to a target relation.
jobs:
main:
# Define optional job parameters
parameters:
- name: year
type: Integer
default: 2013
# Provide a list of targets in this job
targets:
- measurements
Finally we create a build job, which can possibly consist of multiple build targets, which Flowman will execute in the correct order, i.e. it will resolve any data dependencies between build targets within a single build job.
Each job can also be parametrized, in this example we use a parameter `year` for specifying which data partition to process.
Once all declarations are in place, the project can be executed. While the most important component of Flowman is the command line tool for normal batch execution, Flowman also provides an interactive shell for developers where they can also inspect the project, analyze intermediate results and much more.
Copyright © The Flowman Authors | Kaya Kupferschmidt | Freiherr-vom-Stein Straße 3, 60323 Frankfurt, Germany | +49 69 71588909 | info@flowman.io
Webdesign by Katharina Vennewald