SubLog Extractor Filter
Download Executable here : sublog100.zip (version 1.00 - 41KB)
Download Source Code here : sublog100src.zip (version 1.00 - 34KB)
SubLog Extractor is a VirtualDub filter designed to extract hardcoded
subtitles from a video stream.
Installing SubLog Extractor
Extracting hardcoded subtitles is really a two parts job :
1) Identifying the changes in the subtitles frames, dump them into bitmaps files, and create a time index for them.
SubLog Extractor Filter is taking care of part #1, while an external application such as
SubRip can proceed the OCR from the files dumped by SubLog Extractor.
2) (optionally) Applying OCR to the resulting bitmaps to extract text data
More precisely, SubLog Extractor works the following way:
Filtering the input (optionnal).
b) Temporal Processing
Differential scanning & frame selection.
c) Post-processing (VobSub mode only)
Formating the output.
Using SubLog Extractor... in a nutshell
- Just copy the "SubLog.vdf" file and the "SubLog Manual" folder in your VirtualDub/plugins directory.
Of course, you probably won't end up with any good results without some minimum real serious settings first :)
- Run Virtualdub
- Crop the input video to the precise part including the subtitles, using the internal Null Transform filter.
- Add the SubLog Extractor filter and input a destination for the output files.
- Now hit "Preview" in the File Menu of VirtualDub, and make sure the "show input video" and "show output video"
are not checked (if they are checked, it will still work, but processing time will be much longer).
- Pre-Processing Settings :
- Processing Settings :
- The Pre-Processing Level slider allows you to discard the pixels with color too far from the subtitles color.
At 0 level it has no effect, at MAX level it just discards all the pixels :)
- The Red / Green / Blue values are the values of the color of the subtitles in the input stream (usually white
or close to white).
- Post-Processing Settings (VobSub Mode):
- The Temporal Noise Threshold slider is compared to the changes between frames to decide if the frame has changed enough to
consider a subtitle candidate. The default value of 10 is ok for low noise subtitles on black background. Note that the size of
the subtitles in the frame are affecting the way this parameter should be set, that's why proper cropping is so important.
- The Time Threshold slider is the minimum number of consecutive frames required before considering a subtitle candidate.
This not only avoids accounting subtitles repeatedly when some frames are messed up, but also takes care of slowly fading subtitles.
On the other hand, if the time space between the subtitles is over this treshold, the subtitles will be lost. The default value
is "2", which should be ok for low noise / low fading.
- The Index from frame #0 checkbox is to tell SubLog Extractor to account frames with first frame processed as
frame #0 instead of the real source frame #. This checkbox is checked as default.
- The Use VobSub format for output checkbox is to tell SubLog Extractor to activate Post-Processing and use VobSub
.IDX & .SUB format instead of just dumping the bitmaps and the time index. VobSub format (which really is the format used for
DVDs) allows for more advanced features in SubLog Extractor, but requires more precise settings than when just dumping the
24 bits bitmaps. Usually you will want to use VobSub format if you want to keep the subtitles in graphical format or if you
want to use SubRip for OCR. On the other hand you will probably prefer simple BMP & time index dumping if you plan to use the
output files for manual work. Also note that the Preview button has no effect besides Pre-Processing if VobSub format
is not selected.
A Few Notes
- The Font Threshold slider is the gray-mapped level for a pixel to be retained as part of a subtitle font pixel. The default value is 127.
- The Antialiasing Threshold slider is the gray-mapped level for a pixel to be retained as part of a subtitle antialiasing pixel.
If this value is over the Font Treshold value, no antialiasing pixels will be retained. The default value is 63.
- The Add Emphasis pixels checkbox is to tell SubLog Extractor to add emphasis pixels on the edge of the font pixels. This checkbox is checked as default.
- The Use Auto-Cropping checkbox is to tell SubLog Extractor to crop the output bitmap to the subtitles only. This checkbox is checked as default.
- The Show processing checkbox is special : In this mode, while no temporal subtitle extraction is actually done and no output files are dumped, you will see
how frames are post-processed : Font, Antialiasing, Emphasis & Auto-Cropping (post-processed frames appear with a red hue). In normal mode, only captured frames will
appear like this on the output screen. Check this box to help adjusting the other settings, then uncheck it for real processing.
FAQ (updated Jul 17 2003)
- SubLog Extractor is, by its very nature, a temporal tool, so it will NOT work properly if you try to process frame-by-frame with
forward and back buttons. I suggest you to only use Preview in the File Menu of VirtualDub, except during the settings.
- Don't forget to first crop properly your video to extract the exact part containing the subtitles. THIS IS THE MOST IMPORTANT THING
TO DO as a small subtitle within a too large area may be discarded as noise. Also remember trying to extract subtitles which
are over the movie itself will require more work than if they are only in the black widescreen bars: if you REALLY want to do it,
you will need a proper setting of the pre-processing and I suggest that you play with various other filters to get the subtitles
to be as clean as possible first (try Temporal Smoother at max level), and you WILL anyway have to edit the output manually if
you want something clean. Also I strongly suggest that you don't use an area smoother filter if you're planning to process OCR.
- Be sure to understand what you do with the settings and source material input as if you don't do it right, you may end up
with SubLog Extractor dumping hundreds of thousands of uncompressed bitmaps on your hard disk !!!
- Finally, I would say : just check the final result ! A good way to do that is to use VobSub to quickly display your processed output
over the original source video to check if subtitles are clean. Honestly SubLog Extractor will save you hours and hours of manual work
(which you would probably never do anyway) if used properly :)
- What kind of results can I expect ? I don't want to spent hours correcting the output manually !
You can expect 100% full automated clean recovery if the subtitles are in the black bars. If they are not, results
will vary with the accuracy of your pre-processing settings.
- Some subtitles of the movie are missing in my output file.
Try to reduce your Time Threshold setting. Also note that SubLog Extractor may experience problems if there are two immediately
consecutive frames with different subtitles.
- The last subtitles of the movie are missing in my output file.
There must be at least n frames without subtitles at the end of the movie, n beeing your Time Treshold setting, or the
last subtitles will be lost. So if your movie ends with a subtitle, just add black frames at the end before running
- My movie has subtitles in Dark on Light instead of Light on Dark.
SubLog Extractor won't work properly on these: just use a filter to reverse the colors.
- Why is it so slow ?
Well, it's not that slow, you have to understand that we are dealing with analyzing frames both in an area-based way and in
a temporal way. The first alpha version of this tool was using the secondary derivative of the video stream and was a lot
more slow :)
- Is the source code of SubLog Extractor available ?
Yes, since no major bugs have been found yet, and since I'm not working a lot on it anymore,
I have released the source code (check my website). I understand that a lot of people are interested in specific parts of
the code, especially the Mpeg2 export code... Please be free to use it as you will as long as you credit me for the
code you use :)
- Jul 17 2003 - Version 1.0 - Source Code release.
- Jun 24 2002 - Version 1.0 - First public release.