Evals is not suitable for evaluating LLM applications such as RAG, etc because one has to evaluate on their own data where no golden test data exists, and techniqus used have poor correlation with human judgement.
We have build RAGAS framework for this https://github.com/explodinggradients/ragas
Great project! We're building an open-source platform for building robust LLM apps (https://github.com/Agenta-AI/agenta), we'd love to integrate your library into our evaluation!