Converting colour scans to b&w

This is more for my own notes, but could be of interest to some I suppose;

One problem I had recently was the challenge of converting large scanned colour pdfs to smaller more manageable black and white ones. Naturally I don't like doing this, especially if pictures are involved, as you lose a lot of detail - but it's necessary for file size, or I'll quickly fill up the allocation I have on the web server.

It proved to actually be quite a bit of a harder problem than I imagined; there are various Web sites which will do it for you, but these give little control, and were no use for the 100mb files I had. Desktop programs were generally 'paid for', but this was out the question for a small task like this.

So after a bit of research I settled on a program called Imagemagick, which can be downloaded free. This is a very powerful piece of software, but still had to configure it to best process the files I wanted to convert. This is where it becomes tricky, becasue what I didn't know before, is that there are two basic ways to convert to b&w. One is a simple threshhold approach, (called BiLevel) which will look at how dark each pixel is, and then mark it as black or white depending whether over the given threshold. The second is called dithering, which will look at a given area and then interpolate black and white pixels to reproduce a larger colour specturm. What this means practically is a gray area will be represented by alternate black and white pixels to make it look gray, ie.

bwbwbw
wbwbwb
bwbwbw

Whereas the threshold approach would have this as either completely black or completely white. Now the problem is that dithering works great for images, but with text it doesn't work well at all, as makes it less visible. The threshold approach works great with text, but makes images look contrasted and lacking detail. Now this can be compensated by using more pixels to retain more of the detail (called a higher DPI), though this does lead to bigger file sizes, though still nothing compared to using colour. Generally on the website, most the pdfs will only have text, but for the missionary periodicals I was doing, they had a lot of images too, so was necessary to take this into account.

The ideal solution would be to use a mixture of dithering and bilevel for images and text withing the document - this appears to be what Google Books do, but then Google have their own custom software which sadly we don't have access to!

So for reference, imagemagick is used in this way;

magick convert -density 600 input.pdf -threshold 60% -type bilevel -compress fax out.pdf
- for bilevel

magick convert -density 300 input.pdf -monochrome -compress fax out.pdf
- for dithering

magick convert -density 150 input.pdf -quality 80 -compress jpeg out.pdf
- reduces quality but keeps in colour

Website Related

Reply