Define Region of Interest for Dynamic Stimuli

Hello everyone,
In our experiment, we wish to present a video stimuli and track the eye gaze data of the participant. For the analysis, we want to see the viewing time in a predefined region of interest for that particular video. If anyone can suggest on how to predefine the ROI. I mean which tool/software would be best for it? We do not want to use the proprietary software of the eye tracking company but any suggestion for open source software are welcome.
I suppose it can be done in python but a little guidance will be appreciated.

