More

Arimbr · on Nov 8, 2022

The future is EtLT! t for data privacy transformations and T for the rest.

alexmarquardt · on Nov 8, 2022

Yes, transforming the data before it hits the destination may be necessary in some cases, even in the ELT approach (hence the "t" / "tweak") - discussed here: https://airbyte.com/blog/etlt-gdpr-compliance

Arimbr · on Oct 24, 2022

Oh, declarative doesn't necessarily mean no-code. Airbyte data integration connectors are built with an SDK in Python, Java, and a low-code SDK that was just released...

You can then build custom connectors on top of these and many users actually need to modify an existing connector, but would rather start from a template than from scratch.

Airbyte also provides a CLI and YAML configuration language that you can use to declare sources, destinations and connections without the UI: https://github.com/airbytehq/airbyte/blob/master/octavia-cli...

I agree with you that code is here to stay and power users need to see the code and modify it. That's why Airbyte code is open-source.

Arimbr · on June 14, 2022

Interesting to see how modern data orchestrators seem to be adding some of the features of data catalogs and data observability tools.

Arimbr · on March 23, 2022

Sorry, wrong link, and I couldn't delete the post. I reposted with the correct link to: https://airbyte.com/blog/sql-vs-python-data-analysis

Arimbr · on March 17, 2022

Nice article! I also tend to favor SQL for simple querying and data processing with dbt, but when I need to unit test some complex logic, I prefer Python.

Arimbr · on Dec 16, 2021

I like how Prefect is positioned as an orchestrator for the modern data stack.

Airflow also started as an orchestrator, but then they tried to cover all sorts of other use cases like ETL/ELT pipelines with transfer and transformation operators...

I feel like Prefect focuses on doing one thing, orchestration, and then integrates with other data tools.

Arimbr · on Oct 8, 2021

Oh, you should check Materialize. I feel Materialize is like dbt but with an ingestion layer and real-time materialized views.

To deliver that you need to centralize data on their the Materialize database, which is may main caveat. With dbt you can use any data warehouse.

Arimbr · on Oct 8, 2021

[author of the article] My main concern about using Airflow for the EL parts is that sources and destinations are highly coupled with Airflow transfer operators (e.g. PostgresToBigQueryOperator). The community needs to provide M * N operators to cover all possible transfers. Other open-source projects like Airbyte, decouple sources from destinations, so the community only needs to contribute 2 * (M + N) connectors.

Another concern about using Airflow for the T part is that you need to code the dependencies between models both in your SQL files and your Airflow DAG. Other open-source projects like dbt create a DAG from the model dependencies in the SQL files.

So I advocate for integrating Airflow scheduler with Airbyte and dbt.

Curious to know how other use Airflow for ETL/ELT pipelines?

rockostrich · on Oct 8, 2021

Is that really that bad of a problem? In the worst case, you use the underlying hooks to create a custom operator and in the worst worst case you do something like have a Spark/Dataflow/k8s pod/whatever other single threaded or distributed task runner act as that custom operator.

I'm running into the "incremental load" problem now trying to batch intervals of Avro files into BigQuery, but Airflow doesn't seem to complain much about 10 minute intervals (less than that and I'd be a bit worried). One workaround we're considering to having 144 dag runs/dag/day for 10 minute intervals is to just define it as a daily dag and periodically clear it throughout the day.

I have heard of newer companies who don't need to deal with legacy ETL going the ELT route which is interesting to me. I'm curious who owns the transformation in that case. Do you just hire more data scientists or maybe a bunch of junior devs/analysts/DSs to own data cleaning and transformation? The way our transformation works is that the engineering team that owns the application data owns the transformation so they can document exactly what the data going into the BI tool is. It adds overhead on the engineering side, but not nearly as much as throwing the data in a lake and hoping someone downstream knows what to do with it.

ianbutler · on Oct 8, 2021

In a previous job we/I trained the analysts to be more technical and write the T part of ELT in DBT. They effectively became what is known as "Analytics Engineers" so they owned the T and then wrote their analysis on top of the models they had created.

rockostrich · on Oct 8, 2021

That works for ELT, especially if you have documentation around the raw data being loaded in but sounds like it adds a bit of overhead to the analysts' jobs which may or may not be more than just having the engineering team own it and document it well (something they already have to do for the analysts to write transformation code). I'm curious how you handle the upstream data schema changing. Loading in raw data means handling compatibility in another place outside the application.

KptMarchewa · on Oct 8, 2021

In the end, analysts are cheaper than engineers.

rockostrich · on Oct 8, 2021

Not if it's just a part of those engineers' jobs. They're already familiar with the underlying application data so owning the transformation is just understanding what the data needs to look like and documenting it. They're going to need to document the raw data anyway to avoid those analysts asking them a million questions. Might as well avoid hiring analysts who can also learn the transformation bit and just give them good data.

ianbutler · on Oct 12, 2021

I think we've worked in very different jobs, in my case the analysts had a good idea of the underlying application data and often worked closely with both data engineering and regular engineering to understand it so they can make better analyses. They were quite competent in their own right, otherwise I wouldn't have given them control over the T which only made a net benefit to my life as reduced work.

dominotw · on Oct 9, 2021

exactly. analysts are always a step behind engineers when it comes to really understand what data really means and what changes are coming down the line. this always results in delays, broken pipelines ect. modern tools like dbt make it easy for data producing teams to also own T part.

Arimbr · on Oct 8, 2021

Really good points! I don't think that Airflow is necessarily a problem if your data engineering team knows how to best use Airflow Operators, Hooks and DAGs for incremental loads. But because Airflow is not an opinionated ETL/ELT tool, most often I see a lot of custom code that could be improved...

You know there is this "data mesh" hype now. I think the idea behind is to empower data consumers within the company (data analysts) who know best the data to create and maintain the models. That's easier said than done, and most often turns out into a worst situation than when is only data engineers who can model data... I've only heard of Zalando who has successfully distributed data ownership within the company.

rockostrich · on Oct 8, 2021

Yea, I wasn't familiar with Airbyte before writing that comment so now I'm seeing the value in it. We have tons of teams asking "how do I get this data into BigQuery" and the answer is usually "use this airflow operator to dump it into GCS and then use this airflow operator to load it into BigQuery" which isn't super useful for a non-technical person or even really any technical person not familiar with Airflow.

A mesh is certainly something in-between a lake and a warehouse... Something super simple that I've gotten good feedback on so far from DSs is just documenting the transformed data in place. It was really difficult to do this in our old ETL stack (data pulled from HBase, transformed to parquet + Hive in HDFS) but we've moved a lot of it over to Avro files loaded into BigQuery where we can just put decorators on our Scala transformation code that's writing the Avro files and that updates the schema with descriptions in BigQuery. Gives a nice bit of ownership to the engineering team and lets the DS using the data be a lot more autonomous. That boundary has to exist somewhere (or I guess in many places for a "mesh") so having it distinctly at data getting loaded in feels right to me.

dominotw · on Oct 9, 2021

>the answer is usually "use this airflow operator to dump it into GCS and then use this airflow operator to load it into BigQuery"

why not hide these steps behind a yaml file and construct dags from those. devs already are used to writing yaml files for ci, kubernetes ect.

Arimbr · on Oct 8, 2021

Nice work there! I also think that the next challenge for data teams is all this data documentation and discovery work.

I still think that Airflow is great for power data engineers. Airbyte and dbt are positioned to empower data analysts (or lazy data engineers like me) to own the ELTs.

rockostrich · on Oct 8, 2021

Agreed. I see a lot of folks coming up with one off solutions for pulling data out of 3rd party sources like Kustomer or Lever. Giving a centralized UI for setting that up would be a great service.

Seems like I have a fun weekend project.

dominotw · on Oct 9, 2021

what about meltano/singer

contravariant · on Oct 8, 2021

I mostly don't bother writing separate Operators. The only part I write are the so called Hooks (which are basically just airflow's way of defining a standard way of grabbing credentials and instantiating a session object).

After that you just write a short python function that grabs the data from one hook and pushes it to another. Which is basically the (M + N) solution you mention (I think the factor 2 is unnecessary if you've already split sources and sinks).

This approach works with anything you can connect to python. Though for particularly large datasets you want to be careful that you don't accidentally store all data in memory at once. And sure you can sometimes specialize an operation for a particular use case (e.g. if in your example can instruct BigQuery to connect to the Postgres application natively), but usually it works just fine to use a python script in-between.

smarx007 · on Oct 8, 2021

I am curious if Airflow is the most appropriate tool for such tasks. I would imagine Apache Camel to be a good starting point for building a library of connectors and then plugging them together using EIP [1] patterns?

[1]: https://martinfowler.com/books/eip.html

DevKoala · on Oct 8, 2021

My team is currently evaluating Debezium for a PG to S3 CDC solution, and is very advanced with the PoC.

What would be the argument for trying AirByte? It is the first time we hear about it.

Arimbr · on Oct 8, 2021

Airbyte CDC is based on Debezium, but Airbyte abstracts it away and make it easier to CDC from Postgres, MySQL, MSSQL to any supported destination (included S3). Here is the doc for CDC: https://docs.airbyte.io/understanding-airbyte/cdc

I guess one benefit is that you can use Airbyte for all your data syncs, CDC and non-CDC. You can give it a try with your own data, and see if it's easier for your team. You can run Airbyte locally with Docker Compose: https://docs.airbyte.io/quickstart/deploy-airbyte

gunnarmorling · on Oct 8, 2021

Can you elaborate on how Airbyte makes things easier for a user? Would love to pick up any potential improvements in Debezium itself, so that all its users get to benefit from them, rather than only users of a specific integrator like Airbyte.

Disclaimer: I work on Debezium

pid-1 · on Oct 8, 2021

1 - Hide your ETL logic in REST APIs and use common microservice patterns for monitoring, logging, etc...

2 - Use PythonOperator to call your APIs.

It's working really well for us.

ibishvintilli · on Oct 10, 2021

This is a very bad idea. You cannot hide ETL in a Rest API, at most you can hide the T part and it's a lot more work. you need to build an extra service that reads the data from a database and does some padgination. It's way more complex than a one-liner to get data from a jdbc connection and way much slower.

kdazzle · on Oct 9, 2021

I used to do that, but having to deal with timeouts in long running jobs was frustrating because i never knew if something was successful. Could bump up the timeouts, but theyre there for a reason.

verdverm · on Oct 8, 2021

Is there really a 1-1 mapping between SQL and T?

What about use cases where the data lives in an object store? How does dbt deal with that?

ibains · on Oct 8, 2021

We work with many businesses that are larger (Fortune 500) and the T per pipeline is say 60 steps with 1200 columns at 10TB scale and uses multiple things not in SQL. They lookup object stores, lookup web services, use rocksdb, partitioning is important. At scale, cost becomes critical- some are even moving to their own Spark on Kubernetes. ML on done on data after ETL into Data Lake.

None of them can use DBT for core ETL, but DBT might be good later for views, some dimensional modeling. They have done a good job here.

Think of it as the modern small-scale data stack.

verdverm · on Oct 8, 2021

Have you explored Cuelang for T?

verdverm · on Oct 11, 2021

I got inspired and started this over the weekend to demonstrate what is possible.

https://github.com/hofstadter-io/cuetils

rpedela · on Oct 9, 2021

If you can write a SQL query or a set of SQL queries to do your transformation, then you can use DBT. DBT doesn't do transformation itself rather it helps you manage all the dependencies between your SQL models. Whether you can use SQL depends on your data and database/warehouse functionality. For example, JSON parsing support is pretty good now in many databases and warehouses. If your objects can be represented as JSON, then you could write SQL via DBT to parse the objects into columns and tables.

Arimbr · on Oct 8, 2021

My understanding of dbt is that it builds a DAG based on the interdepencies between models. The interdepencies are parsed from 'ref' functions on the SQL files. The thing with dbt is that you transform the data within a single data warehouse.

So, you would normally first load all data to the data warehouse. Then dependencies between SQL models are easier to map.

Arimbr · on June 25, 2020

Hello HN! My name is Ari. I've worked as a data engineer for three French startups. The first, didn't have an engineering blog. The second had one, but i never took the courage to write. The third, didn't consider it a priority.

I believe that all tech teams have tremendous value to share. With Guriosity, I want to encourage and support more teams to write about their work.

So, I gathered 60 software engineering blogs by French companies and classified 600+ manually picked articles in 10 categories: Backend, Data, Frontend, DevOps, Product...

For candidates, it can be a great window to know how is it going to be working for a company before joining.

Arimbr · on May 28, 2020

Nice, I also think that backend engineering needs more innovation!