> Can you give some rough indications of how many pages you index in total?
I index like 300 million documents right now, though I crawl something like 1.4 billion (and could index them all). The search engine is pretty judicious about filtering out low quality results, mostly because this improves the search results.
> How many page you crawl each day?
I don't know if I have a good answer for that. In general the crawling isn't really much of a bottleneck. I try to refresh the index completely every ~8 weeks, and also have some capabilities for discovering recent changes via RSS feeds.
> Size of the machine(s) in RAM and HDD?
It's an EPYC 7543 x2 SMP machine with 512 GB RAM and something like 90 TB disk space, all NVMe storage.
There's a ton of factors.
https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...
> Can you give some rough indications of how many pages you index in total?
I index like 300 million documents right now, though I crawl something like 1.4 billion (and could index them all). The search engine is pretty judicious about filtering out low quality results, mostly because this improves the search results.
> How many page you crawl each day?
I don't know if I have a good answer for that. In general the crawling isn't really much of a bottleneck. I try to refresh the index completely every ~8 weeks, and also have some capabilities for discovering recent changes via RSS feeds.
> Size of the machine(s) in RAM and HDD?
It's an EPYC 7543 x2 SMP machine with 512 GB RAM and something like 90 TB disk space, all NVMe storage.