Does anyone know of a good description of which parts of K's implementation make...

geocar · on March 3, 2018

> I have heard the interpreter itself is quite small which helps of course

It helps a lot more than a lot: Small is everything.

"Main memory" is something like 1000x slower than "L1 CPU cache", so if your whole program lives in L1 you only pay to receive data, which streams in as fast as main memory can. How can you possibly go any faster than that?

The interpreter looks a lot like this[1], scan a character, do something with it. There's no scanning phase, or AST, or very much analysis at all. Being simple helps keep it small. Writing dense makes it easy to see (without scrolling) similar code which can be refactored into less space. This is how small (dense) source code can also help make small object code. This is how small (dense) source code is fast!

[1]: nsl.com/papers/origins.htm

Once you've done all that, vector instructions and multi-threading can help eke out a little bit more speed. Recognising a couple characters at a time and treating them specially can sometimes help as well, but it also can cost a lot of object size quickly, so there needs to be some balance.

Athas · on March 3, 2018

You can "go faster" than memory bandwidth by accessing memory more intelligently (that is, by reducing the amount of memory traffic, possibly by improving cache behaviour). The classic example is loop tiling for matrix multiplication. Here you write significantly more code, but reduce accesses to main memory by a good constant factor. Sadly, with modern architectures, just keeping things small and simple is rarely the way to peak performance. My experience from doing high-performance programming (mostly on GPUs) is that you have a very generous budget when it comes to how large your code can be, but you have an extremely tight budget when it comes to how large your data is. (Of course, in absolute terms you access significantly more data than code, but you generally mostly worry about data size, not so much code size.)

LfLxfxxLxfxx · on March 3, 2018

> which parts of K's implementation makes it so fast?

the missing bullshit parts :)

also simplicity and good memory access patterns

> vector instructions

if you write simple loops in c, compilers are good at generating those

> Does the K interpreter pattern-recognize

afaik some, like *| but fewer than dyalog

justincormack · on March 3, 2018

No vector instructions at present (I think an older version may have done).

feelin_googley · on March 3, 2018

As a user (IANAP), I think avoiding I/O contributes significantly.

The design paradigm of k is to memory map the data and then "forget about the filesystem"; transform data in memory, instead of reading and writing "files".

The slowest part of working in k, for me, is the process of reading files into memory. And I run kernel and filesystem entirely in RAM (mfs and tmpfs); no disk is involved.

If there are other software authors who also adopted this paradigm of avoiding I/O, I would like to know of them.

feelin_googley · on March 4, 2018

I probably used the wrong terminology here. What I meant really is avoiding (minimizing) disk I/O. (Others have already pointed out how k avoids I/O with main memory by staying resident in a CPU cache.)

Ideally I would prefer a system with sufficient RAM, without the need for "short-term" disk storage and thus without the need for virtual memory. Because today, I have the RAM I need. I have no requirement for short-term disk storage.

But like the rest of us, I use kernels that date back to a different era. These are kernels that assume insufficient RAM, the presence of a disk and the need for a "swap device" to do work. Disk I/O is a major part of that paradigm and has been embraced by software authors.

To me, disk is slow. Even SSD. I seek a different paradigm, not one still focused on secondary storage. k is the closest thing I have found.