serverless ETL

Serverless ETL using AWS Step Function -100,000 files under $30

ETL (Extract, Transform, Load) is a way to load data from a source that can be a file, database, transform it, and then load it into a target database or files.

serverless ETL

Designing an serverless ETL solution is not an easy task, it has many minor nuisances that we need to take care of.The idea here is to come up with a serverless ETL solution where we pay only for the resources we use during processing.

Our requirement was to load data from thousands of different CSV and XML files and persist into database tables scheduled monthly /quarterly along with the capability to

  • Know the status of file load
  • Ability to replay and reload the files in case of failure/issue
  • Ability to define the dependency, and sequence in which files should be loaded
  • Ability to schedule the load
  • Ability to run scripts and parse the files
  • Scalability (Load files in parallel if needed)
  • Error reporting capabilities at the record level

It made little sense for us to use ETL tools like airflow and keep running the server or manage the server to host because of the following reasons.

  • Server cost — Running servers 24/7 even if we are not doing anything
  • Management cost — Event if we shut down the server after usage, we need a way to get the status of files/records if some issue comes and then the capability to replay and load the files.

This left us with AWS GLUE and AWS Step function.

Why not AWS Glue?

Our use case was simple to parse the file and load it.

  • AWS Glue has a learning curve associated with it as you need to write ETL jobs in spark environment, learn GLUE catalog, etc. Being a small team we did not want to introduce this as it will become tough to maintain.
  • Setup to reload a single file in case of failure was not there.
  • It did not have the capability to know the status of each file load.

AWS Glue allows you setup, orchestrate, and monitor complex data flows.

Why Step Function?

Step function is AWS managed workflow management tool to coordinate the different components of applications.We can design serverless ETL using step function

Lets discuss the setup

Each state in step function has capabilities to execute a task in a state and move to next state. Step functions executes the task asynchronously and move to the next state.

Step function workflow setup with task token.
Step function workflow setup with task token
  • We have a parent workflow which loops through the FileType1 and invokes child workflow to load each file. If we keep it async, it will create 1000s of child workflow in parallel to load the file and If your database can handle this load, you can go this route.
  • If not, then to execute it synchronously, we can use Task Token, we can pass the token to the task and wait in the same state until we get the token in the callback. Once the task completes, we receive the token and move to the next state.
  • Lambda will process and load the files. Beware of lambda limitation if need you can use Batch, Fargate instead of lambda, or any other AWS service that suits your need.
  • Use bulk insert and data frames for memory-efficient processing and persisting of data.
  • You can control the execution name of the child workflow.To make it searchable you can use {$DATE}-{$filename}-{$uuid}. You can check the status and search in the console using date and filename as shown in image below
status of a step function workflow execution
See the status of a step function workflow execution
  • Once you click on the execution, you can see the execution details, input-output of each state, and logs for the lambda executions as below.
Execution detail of step function workflow
Execution detail of step function workflow
  • In case some file load failed, you can pick the input and re-execute only child workflow and reload the file.

You can scale your file load the way you want, process it parallely or sequentially on AWS service that suits you need.

And the most important thing cost……..

  • Step function cost
  • Lambda Cost
  • Other cots (Cloud watch logs, Network cost if any) Negligible in this context

Step function charge is 0.025$ per 1000 state transition i.e 0.000025 per state

Lambda cost $0.0000002083 per 100ms per 128 MB

In the above setup for each file, we will make 3 state transitions in the parent workflow loop and then 4 state transitions in child workflow, a total of 7 state transitions for a single file.

Let’s say we are using 256 MB and on average, it takes 20 seconds to load a single file(including all three lambdas)

Total state transition

100000(total file)*7(state transition)*0.000025(per state transition)=17.5$

Total Lambda execution

(0.0000002083*10)*20(for 20 sec)*2(for 256 MB)*100000=8.332$

Total Cost = 25.83$

Serverless ETL is cheap Isn’t it?

This is one of the use case of step functions, we can use step functions for a wide variety of use cases. Read more about it.

Please read more about serverless here.

Follow us here.

Please let me know if we can solve this in a better way will be more than happy to learn from my mistakes.

Leave a Reply

Your email address will not be published. Required fields are marked *