Honestly I think this is a problem of over-engineering and simply allowing the user to press a button when he wants to start talking and press it when he's done is good enough. Or even a codeword for start and finish.
We don't need to feel like we're talking to a real person yet.
We don't need to feel like we're talking to a real person yet.