Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been experimenting with techniques for figuring out the topic of a blog, and never quite gotten anything to work well that didn't involve extreme computational costs that just do not work on a search-engine scale without a serious hardware budget (zero shot classifiers did a decent job IIRC). An unga bunga technique like TF-IDF works well for a given document, but less well for a website.

Had some half-decent success finding similar blogs given a reference though.



Yes, I expect full text extraction and normalization to be computationally expensive. I did some simple minded experiments on Usenet feeds back in the 90ies and it was quite some work for the VAX it ran on.


You can get it to run fairly fast on modern hardware. Like run a text extraction, tokenization and POS-tagging workflow on a quarter billion documents on PC hardware, takes like 24-36 hours. That's doable and affordable. But ML-adjacent methods are not. Requires far too much GPU compute, have no A100s :-/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: