Hacker Newsnew | past | comments | ask | show | jobs | submit | Arimbr's commentslogin

I like the YAML abstraction. This should make it easier to programmatically try and evaluate multiple configurations for the whole AI pipeline (not just the LLM) against a dataset or real users through an API deployment.

Some feedback: It would be great to see in one place all the supported fields and values for the YAML config.


Yes, we hope that the YAMLs will provide a clear separation between configuration and code allowing easier deploys of many apps from same codebase and principled experiments.

Thanks for the hint, we'll do a large summary of all options.


What is an hybrid index?


An index that combines multiple indexing techniques, e.g. vector search with more classical information retrieval techniques such as BM25. We found that they have different strengths: vector indexes are very good at getting synonyms and indirect matches, while classical term-based indexes are better for direct queries. Hybridization gives you the best of both worlds.


If all the pipeline and the vector index is keep in memory... does Pathway still persist state somewhere?


(Adrian from the Pathway team here.) Indeed, everything is RAM-based, and persistence/cache relies on file backends. The precise backend to use is a code configuration parameter. S3 or local filesystem are the currently supported options. For documentation, see the user guide under Deployment -> Persistence.


Nice, thanks! I was reading https://pathway.com/developers/user-guide/deployment/persist.... If I understand correctly you persist both source data and internal state, including the intermediary state of the computational graph. And you only rely on the backend to recover from failures and upgrades. So if I want to clone a Pathway instance, I don't need to reprocess all source data, I can recover the intermediary state from the snapshot.

Is it the same logic for the VectorStoreServer? https://pathway.com/developers/user-guide/llm-xpack/vectorst...


For indexing operators, there is some flexibility regarding the amount of internal operator state that is persisted. Say, in a stream-stream join structure, it's actually often faster to rebuild its state from its "boundary conditions" than persist it fully. For vector indexes, it is necessary to persist rather more of the internal state due to determinism issues (the next time the index is rebuilt, it could come back different, and could give different approximate results, which is bad). Currently, the HNSW implementation which is the basis of VectorStoreServer is still not fully integrated into the main Differential Dataflow organization, and has its own way of persisting/caching data "on the side". All in all, this part of the codebase is relatively young, and there is a fair amount of room for improvement.


The AI Connector Builder from API docs is insane! Which API doc specifications will it support? Or does it even matter?


Interesting implementation! For complex stream and text processing, I also prefer processing data in memory with Python (ETL) rather than SQL in the warehouse (ELT).


I see the ingested documents in the data folder don't have an id field, only a doc field.

{"doc": "Using Large Language Models in Pathway is simple: just call the functions from `pathway.stdlib.ml.nlp`!"}

What if I pass two contradictory statements? Is there a way to remove (or better update) a document with a new version?

For example, if I am ingesting some public docs, and I update a doc page. How do I make so that it only takes the answer from the latest document version?


This depends on the data source used. Some track updateable collections, some have a more "append-only" nature. For instance, tracing a database table using CDC+Debezium will support reacting to all document changes out of the box.

For file sources, we are working on supporting file versioning and integration with S3 native object versioning. Then the simply deleting the file or uploading a new version would be sufficient to trigger re-indexing the affected documents.


Hi, interesting!

> Then it processes and organizes these documents by building a 'vector index' using the Pathway package.

What is the Pathway package?


Pathway (https://github.com/pathwaycom/pathway) is a data processing framework we are developing that unifies stream and batch processing of large datasets. It lets developers concentrate on writing the data processing logic, without worrying about tracking changes to data and updating the results. The same code can then be run on batch data (e.g. during testing) or on real-time data streams (i.e. online query processing)

In the LLM app, Pathway allows concentrating on prompt building and querying the LLM APIs as if the corpus of documents were static, while all updates to it are handled by the framework itself.


Nice list of resources!


Know that there is also a CLI to manage configurations defined in YAML files. And a few options to deploy Airbyte in "one click". It's all in the README, sorry to hear you didn't find your way around our docs... There are a number of growing features and deployment options now.


We are using the CLI, but it leaves a lot to be desired. Honestly, if you want to get serious about configuration management you need to put out a Terraform provider ASAP.

The disaster recovery story in particular is very poor. To use this in production with confidence, we need to be able to spin up a brand new instance, and have a single command ("terraform apply" or equivalent), to apply the exact same configuration state (sources, destinations, connections) that was on a previously running instance.


Totally agree - in fact from the issues pages it seemed like there had been a conscious decision from Airbyte not to support a Terraform provider and instead build out Octavia, which as you say leaves a lot to be desired. Deploying Airbyte is still far too hard, I'd be happy to pay for an 'on prem' version like Gitlab where it's a bit more managed. I also think the failure modes for Airbyte on prem are too hard to debug in comparison to other data 'MDS' tools.


Thanks for the feedback and we (Airbyte) are actively working on IaC solutions. We are publishing a Public API in 2023 [0]. We are working on an official Terraform provider built on that API, as well as, language specific SDKs. We hope to have these tools ready in 2023 as well.

[0] - https://api.airbyte.com (There is a roadmap there).


Thanks for the feedback! Recently a community member created a Terraform provider: https://github.com/eabrouwer3/terraform-provider-airbyte


My bet: Data testing, data monitoring and data catalog solutions will consolidate to cover data quality all together.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: