Strings in WebAssembly

Benmcdonald__ · on Feb 24, 2020

The Mozilla developer working on web assembly wrote some good articles on types

https://hacks.mozilla.org/2019/08/webassembly-interface-type...

phibz · on Feb 24, 2020

Maybe it was just me but this felt a bit overly pedantic. Understanding the internals of wasm bindgen is important for understanding how rust handles strings in WASM, but I was expecting a higher level discussion of how strings are passed to WASM.

brainsmith · on Feb 24, 2020

Due to the the lack of native strings in WebAssembly different Wasm compilers have different memory layouts and string encodings. For example assemblyscript uses ucs2 for the sake fo compatibility with JavaScript. This obliges to carefully work with memory bounds, string length estimation due to difference in host native and guest string encodings.

For the specific goal of working with Strings in rust and assemblyscript I've created this project: https://github.com/onsails/wasmer-as.

miracle2k · on Feb 24, 2020

I've created an equivalent library for Python and used your project as a reference:

https://github.com/miracle2k/wasmbind

kbumsik · on Feb 24, 2020

I'm wondering why AssemblyScript uses UCS-2 instead of UTF-8. Do browsers use UCS-2 as well?

austincheney · on Feb 24, 2020

https://mathiasbynens.be/notes/javascript-encoding

maxgraey · on Feb 24, 2020

AS as well as JS interpret code strings as UTF16-LE during follow methods: String.p.codePointAt, String.p.toUpperCase/toLowerCase, String.p.localeCompare, String.p.normalize, String.fromCodePoint, Array.from(str). In rest cases strings interprets as UCS-2.

johncolanduoni · on Feb 24, 2020

The do in all the observable JS APIs, but behind the scenes there are a number of optimizations in each JS engine to deal with the fact that most JS and JSON source comes off the wire in UTF-8 or ASCII.

continuational · on Feb 24, 2020

To quote the Medium comment from Masklinn:

> Actually, humans generally think in terms of graphemes, which may or may not be composed of multiple Unicode code points (irrespective of the normalization form being used).

schnitzelstoat · on Feb 24, 2020

Medium delenda est.

LessDmesg · on Feb 24, 2020

"Delendum". Medium delendum est. It's neuter gender unlike Carthage which was feminine like many city names.

schnitzelstoat · on Feb 24, 2020

How many Romans? https://www.youtube.com/watch?v=IIAdHEwiAy8

BiteCode_dev · on Feb 24, 2020

A joke is not meant to be accurate. This one is funnier if you only change one word of the original quote, even if the resulting one is wrong.

earthboundkid · on Feb 24, 2020

Medium should be "media," no?

BiteCode_dev · on Feb 24, 2020

I hate medium myself, but before going that route it is important to recognize the success of it and ponder why.

- why so many tech users use medium to write tech articles and not ghost, wix, tumblr or a wordpress provider?

- why so many people post links to article from medium?

Unless we answers those questions and provide an alternative can we destroy it, otherwise people will just keep doing that.

drfuchs · on Feb 24, 2020

For the Greek-impaired: “Medium must be destroyed.” [Edit: Latin! I meant Latin! Oy, vey!]

schnitzelstoat · on Feb 24, 2020

It is Latin, no?

saagarjha · on Feb 24, 2020

Of course. For those who aren't familiar with the phrase: https://en.wikipedia.org/wiki/Carthago_delenda_est

tyingq · on Feb 24, 2020

Is it unusual for a VM not to have a string (or at least a bytes) type? I have little experience in the space, but it seems clunky. Curious why WASM went this direction.

kbumsik · on Feb 24, 2020

WASM just reached MVP. The design goal of the MVP seems to be a small spec that includes the minimum amount of requirements.

I guess that is the reason why all four major browsers cloud adopt it almost in the same time, without having many proposal ping-pongs between Google vs Mozilla vs Apple vs Microsoft.

tyingq · on Feb 24, 2020

This article, and proposals for reference types, though, make it sound like working around the lack of strings/bytes/chars might have been more work in the end. Sort of like a unicycle isn't really a bicycle MVP :)

kbumsik · on Feb 24, 2020

Right, but it is better than having to wait more years until browsers reach consensus. I'm not sure but specifying a WASM-native string representation might be more favorable to a specific JS engine than the others, depending on its existing JS string implementation. This might cause disagreements.

AndrewDucker · on Feb 24, 2020

The problem is that WebAssembly exists for other languages to be compiled into.

And those other languages don't agree what a string looks like.

So do you have a Rust string? Or a C string? Or a C# string? Or one of the other representations?

By just giving you a bunch of bytes to play with, and letting different languages use them differently, WebAssembly stays out of the way and lets the compiler decide how it wants to make strings.

tyingq · on Feb 24, 2020

"By just giving you a bunch of bytes to play with"

It doesn't appear to do even that though. You get i32, i64, f32, and f64. The linear memory is some multiple of the 64KiB page size.

kaoD · on Feb 24, 2020

And linear memory, which is just an array of bytes.

AndrewDucker · on Feb 24, 2020

An i32 is a byte.

kaoD · on Feb 24, 2020

The common definition of byte nowadays is an octet.

kbumsik · on Feb 24, 2020

In WASM i32 can be treated like a 8-bit byte too.

kaoD · on Feb 24, 2020

Can you elaborate? I'm not well-versed in WASM internals and I might be missing something.

I mean, any integer with 8+ bits can be treated like an octet, but memory layout is very different (e.g. an array of i8 and an array of i32 look very different even if they represent the same values).

kbumsik · on Feb 24, 2020

Although WASM types only have 32/64 bits it has 8/16 bit operations too. So it won't matter much:

https://webassembly.org/docs/semantics/