Hacker Newsnew | past | comments | ask | show | jobs | submit | eivanov89's commentslogin

When IOMMU is not enabled, any PCIe device capable of DMA could access arbitrary physical memory. It allows to read any sensitive data, modifying memory and fully compromising the system without CPU involvement.

There are many DMA-based attacks described in the literature. Even with IOMMU, some attacks are still possible due to misconfiguration or incomplete isolation. For example: https://www.repository.cam.ac.uk/items/13dcaac4-5a3d-4f67-82...

In our case, we didn’t dive deeply into the security aspects. Our typical deployment assumes a trusted environment where YDB runs on dedicated hardware, so performance considerations tend to dominate.


BTW, the whole situation with IRQ accounting disabled reminds me the -fomit-frame-pointer case. For a long time there was no practical performance reason, but the option had been used... Making slower and harder to build stacks both for perf analyses and for stack unwinding in languages like C++.

After careful reading I'm surprised how small IRQ squares build up 30%. Should search for interrupts when I inspect our flamegraphs next time.


I was doing over 11M IOPS during that test ;-)

Edit: I wrote about that setup and other Linux/PCIe root complex topology issues I hit back in 2021:

https://news.ycombinator.com/item?id=25956670


FYI 11M IOPS in terms of AWS EBS is 138 gp3 volumes (80K IOPS each), which costs about $56K/month or about $1.3M over 2 years. If anyone was considering using EBS for high-IOPS workloads, don't.

I think your test had 10 980 Pros, which were probably around $120 each at the time (~$1200 total). SSDs are wildly more expensive now, but even if you spend $500 each, it's nowhere close to EBS.

It's apples vs oranges, but sometimes you just want fruit.


That's super hot. Especially the update with the 37M IOPS reference. Might be very useful for my next tasks related to a setup with 6 NVMe disks: 1. Get all disks saturated through the network (including RDMA usage). 2. Play with io_uring to share a polling thread. Currently, no luck: if I share kernel poller between two devices then improvement is just +30% (at a cost of 1 core). Considering alternative schemes now.


Unfortunately, we don't have a proper measurements for IOPOLL mode with and without IOMMU, because initially we didn't configure IOPOLL properly. However, I bet that this mode will be affected as well, because disk still has to write using IOMMU.

You suggest a very interesting measurements. I will keep it in my mind and try during next experiments. Wish I have read this before to apply during the past runs :)


Yeah you'd still have the IOMMU DMA translation, but would avoid the interrupt overhead...


That's a popular DBMS pattern. We chosen writes over reads, because on many NVMe devices writes are faster and it is easier to measure software latency.

I guess that in case of sequential I/O result would be similar. However with larger blocks and less IOPS the difference might be smaller.


So perhaps a mixed read+write workload would be more interesting, no? Write-only is characteristic of ingestion workloads. That said, libaio vs io_uring difference is interesting. Did you perhaps run a perf profile to understand where the differences are coming from? My gut feeling is that it is not necessarily an artifact of less context-switching with io_uring but something else.


There are a couple of challenges with mixed read+write workloads on NVMe.

In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

As for profiling: we didn’t run perf, so the following is my educated guess:

1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources. 2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned. 3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up.

We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable.


Some quality thoughts here, thanks.

> In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

I see. Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion.

For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described.

> For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

Yeah but in which case you would then be testing the limits of ublk performance, no? Also, it seems to be implemented on top of io_uring AFAICS.

I have personally learned to make experiments, and derive the conclusions out of them by running them in the environment which is as close as it gets to the one in production. Otherwise, there's really no guarantee that behavior observed in env1 will be reproducible or correlate to the behavior in env2. Env1 in this particular case could be write-only workload while env2 would be a mixed-workload.

> We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable

This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible.


> Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion.

That’s absolutely fair. Also, it would be useful to test across different devices, since their behavior can vary significantly, especially when preconditioned or under corner-case workloads.

In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further. That said, we believe the observed trends are fairly general.

> For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described.

I agree that for mixed workloads the outcome may differ. However, for us the primary concern in the AIO vs io_uring comparison is syscall behavior.

It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better.

> This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible.

The component we are working on is designed for write-intensive workloads. Due to DWPD constraints, we intentionally limit sustained write throughput to what the device can safely handle over its lifetime. In practice, this is often on the order of ~200–300 MB/s, which a single thread can easily saturate.

At the same time, we care a lot about burst behavior. With AIO, we observed poor predictability: total latency depends heavily on how requests are submitted (especially with batching), and syscall time can grow proportionally to batch size * event count.

io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups.


> Also, it would be useful to test across different devices, since their behavior can vary significantly, especially when preconditioned or under corner-case workloads.

Agreed. And from first-hand experience I know how painful this is, and how proving or disproving the hypothesis you have about certain wheel in the system can turn into a crazy rabbit hole, especially in the infrastructure software which due to ever increasing volume in data (and distribution thereof) is stressing the software and hardware up to their limits.

I used to test my algos across wide range of HW I had access to. It included "slow" HDD and "fast" NVMe disks, even Optane, low and high amount of RAM, slow and fast CPUs, different cache sizes and topologies, NUMA vs no-NUMA etc. This was the case because software developed didn't have the leisure of running within the fully controlled SW/HW so I had to make sure that it runs well across different configurations, even operating systems, microarchitectures, etc.

And it was a challenge to be able to decouple noise from the signal, given how many experiments one had to run and given how volatile (stateful) our HW generally really is, barring all the non-determinism imposed by the software (database kernel + operating system kernel).

> In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further.

Yes, that is fair enough and basically only what matters - not solving a "general" problem but solving a problem at hand has been most successful strategy for me as well.

> It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better.

Yes, I would probably easily agree that io_uring in general is a better design. C++ executors-like design but in the kernel itself, pretty advanced from what I could tell last time I delved into the implementation details (~2 years ago). Given I had developed an executor-like (userspace) library myself, I figure that in more extreme cases one would like to gain the total control of the IO scheduling and processing process. This is an exercise I would like to do at certain moment.

> ... io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups.

Thanks for sharing the details. I figured there was something peculiar about what you're doing. Quite interesting requirements.


Dear folks, I'm the author of that post.

A short summary below.

We ran fio benchmarks comparing libaio and io_uring across kernels (5.4 -> 7.0-rc3). The most surprising part wasn’t io_uring gains (~2x), but a ~30% regression caused by IOMMU being enabled by default between releases.

Happy to share more details about setup or reproduce results.


Thanks for sharing this.

Was the iommu using strict or lazy invalidation? I think lazy is the default but I'm not sure how long that's been true.


We compared IOMMU fully disabled vs enabled. When it is enabled, I expect it to be lazy (should be the default for IOMMU). Note, that we recommend to use passthrough to completely bypass translation for most devices independent on strict/lazy mode.


That's indeed interesting, thank you for sharing.


To the extent of my knowledge, only forks like YugabyteDB make PostgreSQL truly distributed. Or you should switch to another DB.


Dear friends, I’m the author of this post and I’d love to hear your thoughts and discuss it here.


Sorry, might be that the title is a little bit inaccurate. However, the post indeed describes multiple cases, when attackers have stolen many BTC from the exchanges, because of the issue with a weak isolation level. Moreover, one of the exchanges was totally ruined because of that.


Well, then maybe write a blog post explaining exactly what happened here and submit that?

Because, even having re-read the article you linked, it does not support the conclusion that "[an] exchange[...] was totally ruined because of [weak isolation]" at all?


My post is a secondary research regarding potential issues with weak isolation levels. It includes a link [0] to an in-depth description of what happened to Flexcoin. Additionally, the post references another similar BTC attack [1] that exploited a "lost update" due to weak isolation levels.

The goal of the post is to highlight this problem, as cited research papers clearly demonstrate that such issues occur more frequently than commonly perceived.

Again, I'm sorry that the title might be misleading and you have expected a different content.

[0] https://hackingdistributed.com/2014/04/06/another-one-bites-...

[1] https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole...


Hi there, I'm the author of the post. I'm happy to answer any questions and appreciate any feedback and experience shared.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: