Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Strings in WebAssembly (medium.com/wasm)
70 points by dmit on Feb 24, 2020 | hide | past | favorite | 30 comments


The Mozilla developer working on web assembly wrote some good articles on types

https://hacks.mozilla.org/2019/08/webassembly-interface-type...


Maybe it was just me but this felt a bit overly pedantic. Understanding the internals of wasm bindgen is important for understanding how rust handles strings in WASM, but I was expecting a higher level discussion of how strings are passed to WASM.


Due to the the lack of native strings in WebAssembly different Wasm compilers have different memory layouts and string encodings. For example assemblyscript uses ucs2 for the sake fo compatibility with JavaScript. This obliges to carefully work with memory bounds, string length estimation due to difference in host native and guest string encodings.

For the specific goal of working with Strings in rust and assemblyscript I've created this project: https://github.com/onsails/wasmer-as.


I've created an equivalent library for Python and used your project as a reference:

https://github.com/miracle2k/wasmbind


I'm wondering why AssemblyScript uses UCS-2 instead of UTF-8. Do browsers use UCS-2 as well?



AS as well as JS interpret code strings as UTF16-LE during follow methods: String.p.codePointAt, String.p.toUpperCase/toLowerCase, String.p.localeCompare, String.p.normalize, String.fromCodePoint, Array.from(str). In rest cases strings interprets as UCS-2.


The do in all the observable JS APIs, but behind the scenes there are a number of optimizations in each JS engine to deal with the fact that most JS and JSON source comes off the wire in UTF-8 or ASCII.


To quote the Medium comment from Masklinn:

> Actually, humans generally think in terms of graphemes, which may or may not be composed of multiple Unicode code points (irrespective of the normalization form being used).


Medium delenda est.


"Delendum". Medium delendum est. It's neuter gender unlike Carthage which was feminine like many city names.



A joke is not meant to be accurate. This one is funnier if you only change one word of the original quote, even if the resulting one is wrong.


Medium should be "media," no?


I hate medium myself, but before going that route it is important to recognize the success of it and ponder why.

- why so many tech users use medium to write tech articles and not ghost, wix, tumblr or a wordpress provider?

- why so many people post links to article from medium?

Unless we answers those questions and provide an alternative can we destroy it, otherwise people will just keep doing that.


For the Greek-impaired: “Medium must be destroyed.” [Edit: Latin! I meant Latin! Oy, vey!]


It is Latin, no?


Of course. For those who aren't familiar with the phrase: https://en.wikipedia.org/wiki/Carthago_delenda_est


Is it unusual for a VM not to have a string (or at least a bytes) type? I have little experience in the space, but it seems clunky. Curious why WASM went this direction.


WASM just reached MVP. The design goal of the MVP seems to be a small spec that includes the minimum amount of requirements.

I guess that is the reason why all four major browsers cloud adopt it almost in the same time, without having many proposal ping-pongs between Google vs Mozilla vs Apple vs Microsoft.


This article, and proposals for reference types, though, make it sound like working around the lack of strings/bytes/chars might have been more work in the end. Sort of like a unicycle isn't really a bicycle MVP :)


Right, but it is better than having to wait more years until browsers reach consensus. I'm not sure but specifying a WASM-native string representation might be more favorable to a specific JS engine than the others, depending on its existing JS string implementation. This might cause disagreements.


The problem is that WebAssembly exists for other languages to be compiled into.

And those other languages don't agree what a string looks like.

So do you have a Rust string? Or a C string? Or a C# string? Or one of the other representations?

By just giving you a bunch of bytes to play with, and letting different languages use them differently, WebAssembly stays out of the way and lets the compiler decide how it wants to make strings.


"By just giving you a bunch of bytes to play with"

It doesn't appear to do even that though. You get i32, i64, f32, and f64. The linear memory is some multiple of the 64KiB page size.


And linear memory, which is just an array of bytes.


An i32 is a byte.


The common definition of byte nowadays is an octet.


In WASM i32 can be treated like a 8-bit byte too.


Can you elaborate? I'm not well-versed in WASM internals and I might be missing something.

I mean, any integer with 8+ bits can be treated like an octet, but memory layout is very different (e.g. an array of i8 and an array of i32 look very different even if they represent the same values).


Although WASM types only have 32/64 bits it has 8/16 bit operations too. So it won't matter much:

https://webassembly.org/docs/semantics/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: