Yes it would generally work by adding to the end of the transcripts, but you need to make sure that the finetuning data have complete utterances/sentences, otherwise the EOU prediction will not be accurate. Also, to evaluate the EOU performance you will need to do force alignment on the finetuning data to get the timestamps for start-of-utterance and end-of-utterance. Note that ASR WER will degrade if finetuning data is small.
The finetuning scripts for EOU are still in a PR which is to be merged by early next month, but you can already use it at https://github.com/NVIDIA-NeMo/NeMo/pull/14740/files#diff-e0436d26c60ad81f641827fee4ba5785ba5dd79e67f488ab5b67c762767f6977