I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed.
Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications