Is OCR on PGS subtitle files always this bad?
Is OCR on PGS subtitle files always this bad?
I'm working on trying to streamline the process of ripping my blu-ray collection. The biggest bottlneck in this process has always been dealing with subtitles and converting from image-based PGS to textbased SRT. I usually use SubtitleEdit which does okay with occasional mistakes. My understanding is that it combines Tesseract with a decent library to correct errors.
I'm trying to find something that works in the command line and found pgs-to-srt. It also uses Tesseract, but it appears without the library, the results are...not good:
Here's the first two minutes of Love, Actually:
1
00:01:13,991 --> 00:01:16,368 DAVID: Whenever | get gloomy with the state of the world, 2 00:01:16,451 --> 00:01:19,830 | think about the arrivals gate alt [Heathrow airport. 3 00:01:20,38 --> 00:01:21,415 General opinion Started {to make oul
This is just OCR of plain text on a transparent background. How is it this bad? This is using the Tesseract "best" training data.
Edit: I’ve been playing around with ocr-to-pgs which also uses tesseract and discovered that subtitles having black outlines really messes with it. I made some improvements.