Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Pipet – CLI tool for scraping and extracting data online, with pipes (github.com/bjesus)
49 points by yoavm on Oct 2, 2024 | hide | past | favorite | 7 comments
I often find myself in situations where I need to extract some data from a website, either as one time thing or periodically (and then watch for changes). Maybe I'm tracking stocks, or a delivery, or I want to know when tickets become available to the local sauna; maybe a friend called and asked "can you get me all data from that website into a spreadsheet?" (this happens surprisingly often)

I used to write one-off scripts for that, often with Python or JavaScript, but I noticed I spend about one minute getting the right CSS selectors, and then another 10 minutes setting the rest of the script up. I decided to write Pipet so that I'll never need to write any of the boilerplate code again. By now I probably spent more time writing Pipet than all scrapers I have ever written, but hey, it's been fun…

Pipet has two modes: curl and playwright. With curl, you can either type "curl http://news.ycombinator.com" or just copy-paste the request you want to duplicate from the browser. Pipet runs curl just like your shell does, so all headers and preferences work just the same. I found that this was super useful when trying to emulate a real browser or accessing something behind a login. With playwright mode, you just use JavaScript. If it worked in the devtools console, it should work with Pipet too.

Some other features I think are useful is that you can output the results as text, JSON, or write a template file for the results to be rendered into. You can also run Pipet with an interval, and run a command when the data changes. Lastly, Pipet fully integrates UNIX pipes, so you can do stuff like `div#main h1 | wc -c` and it will take the h1 from a div with id “main”, and pipe its HTML to `wc -c`. It makes it extremely easily to just use tools you already know to process the data before Pipet outputs it. It also works when processing JSON - you can call jq or whatever you like to use to help with the processing.

I’ve noticed I have written so many little scrapers since I have Pipet around because it became such an easy task, so hopefully others will find it useful too!



Thus is great. Thanks for sharing. Sitting on a cache of one-off scripts as well. Looking forward to checking this out further.


Just as an example: if you have go installed on your laptop, you can extract all the comments from this post as JSON by creating a file with

  curl https://news.ycombinator.com/item?id=41695549
  .comment
    div > div
save it as comments.pipet and run `go run github.com/bjesus/pipet/cmd/pipet@latest --json comments.pipet`. Or run it with `--interval 60 --on-change "notify-send {} "` to periodically check for updates and call notify-send (on Linux) to get a notification when new a comment appears!


Well done!


That's all nonsense. If it's not imitating a real user by launching a real browser on a GPU and moving the mouse cursor with AI like a real human would, you could throw this project down the trash. Also, the coding doesn't matter unless it's run from a network identified as regular residential apartments.


The idea is that you run this from your personal computer, so I don't see any issues with the network really? Regarding the AI emulation browser in GPU - do you have an example on such a website that you'd like to scrape? I'd love to see how Pipet can support that.


Said by "cynicalsecurity" :D


Because every website has such an advanced protection?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: