Hacker Newsnew | past | comments | ask | show | jobs | submit | bede's commentslogin

Thanks for publicising. I recently decided not to renew my Backblaze in favour of 'self hosting' encrypted backups outside the US. But I was horrified to learn that my git repos may not have been backed up, nor my Dropbox, whose subscription I also recently cancelled. Good riddance.

My experience using restic has been excellent so far, snapshots take 5 mins rather than 30 mins with backblaze's Mac client. I just hope I can trust it…


> Probably obvious for many, but I didn't realize ACs don't transport any air into the room, but just moves it around

I had the same epiphany as you days after acquiring a CO2 monitor. Most people notice poor indoor air quality from proxies such as humidity and temperature. AC (without ventilation) eliminates these and tricks our senses very effectively, giving us cool and fresh feeling indoor spaces full of CO2 and devoid of oxygen.


> There’s another distance limit at work here, and that is the speed of light. It takes milliseconds for the signal in your phone to reach the hotel above ground and be handed over to the mobile network.

It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?


.. and in 1mS it travels 300km. Maybe they just want to sound technical, somewhat to match the rest of the article. They certainly didn't use chat gpt, so maybe that's a good thing.


For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further


Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.


Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.



Author of [0] here. Congratulations and well done for resisting. Eager to try it!

Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)


A Zstd maintainer clarified this: https://news.ycombinator.com/item?id=45251544

> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses


Fascinating, thank you.


Thanks for reminding me to benchmark this!


I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!


Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)


Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.


BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)

e.g. https://github.com/ArcInstitute/binseq


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: