As a fellow Mandarin learner - this is super cool! Intuitively makes a lot of se...

As a fellow Mandarin learner - this is super cool! Intuitively makes a lot of sense for the "full immersion" component of language. I love to see exciting uses of AI for language learning like this instead of just more slop generation :)

I haven't dug into the github repo but I'm curious if by "guided decoding" you're referring to logit bias (which I use), or actual token blocking? Interested to know how this works technically.

(shameless self plug) I've actually been solving a similar problem for Mandarin learning - but from the comprehensible input side rather than the dictionary side:

https://koucai.chat - basically AI Mandarin penpals that write at your level

My approach uses logit bias to generate n+1 comprehensible input (essentially artificially raising the probability of the tokens that correspond to the user's vocabulary). Notably I didn't add the concept of a "regeneration loop" (otherwise there would be no +1 in N+1) but think it's a good idea.

Really curious about the grammar issues you mentioned - I also experimented with the idea of an AI-enhanced dictionary (given that the free chinese-english dictionary I have is lacking good examples) but determined that the generated output didn't meet my quality standards. Have you found any models that handle measure words reliably?