When we cannot find a digitized version on the internet, we photograph or scan a book or article ourselves. We end up with photos of pages that are warped and rotated, often with a lot of the surrounding showing, saved as files that are very large. ScanTailor is here to help, to turn those clunky photos into a neat and crisp PDF, that, with effort, can look as though it came straight from the publisher. Let us take a look at how to set things up and what to expect.
I talked about ScanTailor before, but I did not go into how to install it on a Mac. This has become a more pressing issue, now that the latest version of MacOS (Catalina) does not support 32 bit applications. What that exactly means is not very important, the practical consequence is that older software simply does not work anymore on the latest computers. That includes the 8-year old ScanTailor. No time to mourn; out of the ashes has arisen a newer, better, and still very much free version called ScanTailor Advanced.
What does it do, exactly?
Allow me to demonstrate with an example from my own work. I bought a published edition of Yazjī Ughlū’s (d. 855/1451) commentary on Ibn ʿArabī’s Fuṣūṣ al-ḥikam. It’s not the most thrilling read, but nice to have nonetheless. Now, it would be much better to have a private copy in digital format. Unfortunately I did not find it on waqfeya.com, where I would normally look for a digitized edition. So I scanned it on a Xerox copy machine, which took me about an hour. This resulted in 143 images like this:
What you see here is a reduced version of the actual image that I obtained: a TIFF image weighing 21,3MB. That means 3,18GB for the entire book: a stupidly large file size. In addition, you can see the pages are a bit slanted, some sticky notes are showing at the edge and there is an annoying border from the scanner itself. Horrible to work with.
ScanTailor Advanced turned it into this:
Again, for display purposes I had to adapt the file a bit but let me assure you: the resulting two images from the original TIFF were TIFF’s weighing 52KB and 61KB. That’s a reduction of, well, a lot. The entire book, now a PDF of 286 pages, comes in at a very reasonable 11,7MB. On top of that, it is now in a shape that would allow me to give it to Tesseract, an OCR engine, that would give usable results back. Let’s forget about the OCR part for now and focus on the editing from color image of a page spread to a black and white single page, neat and crisp file.
How do I get it?
Once ScanTailor is installed, it has a very intuitive interface. For Windows and Linux users, you can follow instructions here. I will only cover the case of MacOS, that is, if you are working on a MacBook or iMac. The things you have to do might feel a bit advanced but I hope it won’t deter you.
First you need to open Terminal. You could (not advised) do so by simply pressing Cmd+Space and typing Terminal. There is another, preferable way. This is to open Finder and go to a folder you know well, right-click and select New Terminal at Folder. This will make it easy later on to clean up some files we will download but afterwards don’t need anymore. In case you are in doubt, the logo of the application looks like this.
It will open a black window with white text – something you may associate with MS-DOS. We are used to software that has beautiful graphical interfaces that cover our screens, with buttons to interact with using our mouse or trackpad, but the Terminal should help us bring back to memory that in the end a computer is just a glorified calculator: you punch in some numbers and it will crunch them for you. For most purposes, working with the Terminal is just harder, slower, impractical. For example, when we want to write an article with footnotes and markup like italics and bold, a program with a GUI, a graphical user interface, is much better to use. However, there are some things for which the Terminal either does a better job or simply is the only way to get something done. The latter is the case for us here; if we want to install ScanTailor, we need to use the Terminal. But on the bright side: it should be a straightforward copy and pasting of some command that I give here. We will be using two programs that only work within the Terminal. Such programs do not have a GUI, graphical user interface, but a CLI, a command line interface. These programs are Git, and HomeBrew. The first was secretly on your computer all along, so no need to worry about that one. The latter you will have to install first – yes that’s right, we are installing one program just so that we can install another program! If at this point you already feel overwhelmed or bored, maybe the need to have ScanTailor is not high enough for you. For everyone else, please copy the following, paste it in your Terminal window, and hit Enter:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
If you are uncomfortable with this, take a look at the website of HomeBrew where more is explained about what is about to happen and what it will do.
Once you successfully installed HomeBrew (affectionally known simply as Brew), you are ready for the next command.
git clone "https://github.com/yb85/scantailor-advanced-osx.git"
If you want to know more first, you can look here.
Once the download is finished, the following command:
cd means ‘change directory’. It tells the Terminal to hop into that directory (synonym for Folder).
Now a final command:
brew install ./scantailor.rb
This will actually install ScanTailor and all its dependents. It might take a while.
Once it’s done, you have ScanTailor, how awesome is that. The bad news: You can only launch it from the Terminal by opening the Terminal, typing
scantailor and hitting Enter. I have a solution for that.
If you click here you can get a ScanTailor Clickable Logo. You can download it, unzip it, and then move into your Applications folder. There is one more step to be done: go to your Applications folder and right click (two fingers or Control+Click) and select Open. A pop-up message warns you that this is an unverified application and you should proceed to open it. This is a one-time message. After that, ScanTailor will open like any other app and shows up in your Mission Control, as it does on my computer, second row from the top, third item from the left:
(As a bonus you get to see some of my oft-used applications)
How do I use ScanTailor Advanced?
Let us run quickly through the different steps of a ScanTailor workflow, as a gentle introduction and a comparison between the new and the old version. Since all of us (Windows, Linux and Mac users) can use the graphical user interface, the process is fairly straightforward. The interface guides you through it.
Upon opening the application, this will be the window greeting you (new version on top, old version below):
Noticeably, the interface guides you to start a new project. The new version of ScanTailor has gotten rid of the left navigation column to make this even more clear: thou shalt clicketh New Project…and nothing else.
Clicking New Project will open a dialogue box which is identical between new and old version:
ScanTailor needs a folder with images of a text. It will estimate the order in which the images ought to be from the file name, so this filename needs to be serialized. If you took the photos in the correct order, this is probably already so. If you actually have a PDF, you first need to convert the PDF into separate images. The option at the bottom is important for us: Right to left layout should be checked if you have photos of pagespreads (when you see two pages as an opened book) in which the first page is on the right and the second on the left.
Once a folder with photos is loaded, you are presented with the next screen, being the first stage of the menu on the left. It is the same for the new and old version:
If your photos are jumbled, e.g. some of them upside down, you can correct that in this screen. It is virtually never a problem (for me) so let’s go ahead.
You can proceed by clicking on Split Pages in the menu on the left. If you click the Play button next to it, it will perform the function on all pages. I like it better to first test it out on only the first page. As in this case, it usually does a superb job at finding the spine and splitting the two pages from the one photo of a pagespread. Since we indicated that right pages are prior to left pages, it made the detected page on the right red and the one on the left blue. The new and old version are nearly identical, except for the menu that pops up when you click Change: the new version has more flexibility in applying changes to a specific range of photos.
The next stage can be entered by clicking Deskew. You do not need to run the Split Pages over all photos to enter Deskew. In fact, if I do not expect much problems I immediately go into Select Content and let that run over all photos. Nonetheless, Deskew is there and it is a powerful feature. The screenshots below show the page on the left of the first pagespread, which was tilted ever so slightly. ScanTailor fixes the orientation by splitting only the left part of the page spread and rotating it -0.19º. Noticeably, both versions are the same and indeed automatically arrived at the same deskewing. In almost all cases, ScanTailor does a stellar job at detecting the correct deskew settings, but the graphical user interface gives ultimate control to make minute manual changes for every image.
The stage in the very middle of ScanTailor’s workflow is the most important one. All others can easily be left to the algorithms of the software, but this one can always use some manual checking. ScanTailor has, by now, detected the left and right pages, separated them and deskewed them. It now attempts to find the actual content. In this case, it nailed it, correctly ignoring the grey noise at the right side of the image. In the new version of ScanTailor, there is also a Page Box detection. I suspect that this is a useful feature if the original images show a lot of the surroundings such as the support upon which the book rests and/or a machine to keep the book open: it will give ScanTailor a clue where the page is and will only search for the contents within that area. Perhaps more use can be had with this feature but since it is something that I do not need I am not able to tell. Instead, what I do is let the Select Content function run on all images (this will take a while) and then use the dropdown box at the very bottom-right of the window to select Order by Width. I then go to the very first image (which is the smallest in terms of width) and check if the content is selected correctly. If there are pages left blank in the book, they will show up here. I check this for all images until I see that a normal width is reached. I then go all the way to the bottom, where the widest content is detected. Usually ScanTailor will have erred for a few cases, thinking that a speck was content too. I correct this manually (and then the page will jump up in the order to its natural place) and move my way up until I hit, again, a normal width. I then select Order by Height and do the same. ScanTailor typically will make some mistakes: when there is only a few lines of text on the page, it might ignore a footnote at the bottom of the page, or it will include a smudge a speck.
Page box is indicated in orange (border only), while content is in blue. New version of ScanTailor on top, older below.
We are now getting closer to the final product. ScanTailor has a special stage for selecting the margins you wish to include in your final images. New and old versions do the same. A margin is useful to ensure that all final images will look uniform. This is great for readability and also helpful for OCR.
A surprising amount of features is all packed in the last stage, Output, whose interface looks like this (new on top, old below):
The DPI can easily be kept at 600: this will ensure high quality output. Only for specific use cases could you think of reducing it. You will probably know if this is the case.
The mode can be left at Black and White. Again, you will know if this should be changed.
Thinner or thicker refers to the thickness of the strokes of your letters. Some poor quality scans make it necessary to make a change here, if only for specific pages. In general, it can be nice to make everything a bit thinner. For example, with Arabic this could make the difference between diacritic marks showing up as separate blobs of black pixels or attached.
Only touch dewarping if there is a noticeable wave in the pages in your images. It remains experimental at best but can fix some of it.
Despecking is a really great feature. Tiny specks of dust or dirt may show up in your images: adding this feature will remove such noise. Obviously, if you set the despeckling too high it will start eating up some of your letters.
ScanTailor Advanced comes with many more options, most of which I have no in-depth experience with so far. Here is a zip-file showing the threshold difference for the different algorithms, Otsu, Sauvola, and Wolf. Otsu performed noticeably thinner on this sample. All three created files of about 58kb. These features will come in handy when you have awkward photos that require a very specific approach to get the most out of them.
ScanTailor Advanced is an allround workhorse for turning hastily made photos and scans into a useful format for longterm storage, for personal reading, and as an excellent preprocessing step towards OCR. Now that it can be installed on all major platforms, I can’t recommend it enough. For those seeking to become more computer-savvy and more self-sufficient, ScanTailor is an excellent first step. For those who have progressed a bit more, ScanTailor remains a useful tool in your workflow.