Background 

   At first, there appeared to be a simple solution to this problem.  This solution was to use a luminance calculation to remove the background, since in most cases the background pixels have a fairly high luminance and the ink pixels are relatively dark.  Using this idea, we would remove the background by setting a luminance value that would be used as a separation point between ink pixels and background pixels.  All pixels above the separation point would be set as background pixels (to white), and all pixels below the separation point would be considered ink pixels.   The luminance formula used in this attempt was:

CCIR 601:

LUMINANCE = 0.299 * RED + 0.587 * GREEN + 0.114 * BLUE

    Even after experimenting with many different separation points to differentiate between ink pixels and background pixels, this luminance based approach turned out to have several fundamental flaws.  First of all, the antialiased ink pixels were almost entirely removed, since most antialiased ink pixels are fairly bright ( high luminance).  This resulted in an apparent "thinning" of the text, as well as making the written text appear jagged, since the antialiasing of the text was removed.   Also, a significant portion of the blue and red lines on the paper were not removed.  There were many pixels in the paper's lines that were relatively dark, and would not be removed using this approach.  There were also significant problems removing wrinkles and noise generated by the scanner.

    Although this luminance based approach did not completely fulfill our needs, it did highlight most of the major problems that had to be overcome to produce a high quality image.  These problems are:

A More Complex Approach

    Another approach was to set probabilities for each pixel in the image.  Using probabilities, we would raise or lower the probability of a particular pixel based on image processing techniques that would provide insight into whether a particular pixel was background or ink related.  Several image processing techniques were used to adjust the probabilities.  They were:

  1. Using luminance values for pixels that are at the extremes of the luminance spectrum.  If a pixel has a luminance at or near 255, then we will consider it as background.  If a pixel has a luminance at or near 0, we will consider it as an ink pixel.  Noise that passes this test, will be removed later with noise filtering.
  2. Marr-Hildreth edge detection to find where the ink pixels join with the background pixels.
  3. A line detection algorithm based on the Hough algorithm that would allow for broken lines (lines that went through written text).
  4. Applying noise filtering after previous algorithms were applied.

 

Edge Detection

   Using the Marr-Hildreth edge detection algorithm, was an attempt to break up the image into components that would clearly be ink or clearly be background pixels.  The algorithm was modified so that all pixels that were part of the edges found would be set to a high probability (background), and these edges would then be used to separate the image into blocks.  The initial idea was that the final blocks would either be entirely ink pixel data or entirely background pixel data.   The algorithm worked great at finding edges, but did not work well for separating the image into blocks of background pixels and blocks of ink pixels, that would be easily separable.  Edges were found around noise, lines, as well as ink pixels, with some loss of antialiased ink pixels.  In the end, the edge detection algorithm did not help in providing an acceptable solution to our problem.

    For the records, with 178 being the center pixel or current pixel being examined, the Marr-Hildreth edge detection filter used was:

0 0 0 -1 -1 -2 -1 -1 0 0 0
0 0 -2 -4 -8 -9 -8 -4 -2 0 0
0 -2 -7 -15 -22 -23 -22 -15 -7 -2 0
-1 -4 -15 -24 -14 -1 -14 -24 -15 -4 -1
-1 -8 -22 -14 52 103 52 -14 -22 -8 -1
-2 -9 -23 -1 103 178 103 -1 -23 -9 -2
-1 -8 -22 -14 52 103 52 -14 -22 -8 -1
-1 -4 -15 -24 -14 -1 -14 -24 -15 -4 -1
0 -2 -7 -15 -22 -23 -22 -15 -7 -2 0
0 0 -2 -4 -8 -9 -8 -4 -2 0 0
0 0 0 -1 -1 -2 -1 -1 0 0 0

Line Detection

    The line detection algorithm was used to adjust the probabilities of each pixel in an attempt to remove lines in the paper from the image, as well as background which would qualify as a line in this algorithm.  The modified Hough line detection algorithm would look for lines along the edges of the image, and scan across the image searching for pixels within a local color space of the previously accepted ink pixel.  If the minimum number of pixels were found that qualified as a line, then the starting point of the line was recorded along with the angle of traversal which would later be used for removing the line.

    Although, the modified line detection algorithm did work reasonably well, there were many pixels that were incorrectly identified.  This was mostly due to the fact that from the average scan, most of the lines are not perfectly straight (most of the lines on paper are not straight to begin with).  Even when traversing line segments at extremely small angles across the image, a significant portion of the line was not removed at the end of the line due to the small angles of traversal.   There was also a substantial loss of antialiased ink pixel data, due to the fact that the line colors often blend in with the antialiased ink pixels where they meet.   Another side effect of applying the line detection algorithm, was that in the event there were enough horizontal or vertical text aligned correctly, it would be removed.   Although this was uncommon in testing, since typical handwriting does not perfectly line up with text on neighboring rows or columns of the sheet.

Other Techniques

     Another technique used with the previously mentioned algorithms, was using luminance to find pixels that are almost definitely ink pixels or background pixels.  This is a toned down approach of our previous attempt.  In this case, pixels that were at or near zero luminance were considered as ink pixels, and pixels at or near full luminance were considered as background pixels.  This did work well, except that there was still a significant portion of noise that was not identified correctly.

    The final algorithm that was applied to all of the previous approaches was to remove as much noise as possible without any ink pixel data loss.   This involved scanning a pixel region around the current pixel being examined.   If there were more than a specified number of pixels that were currently considered as ink within this region, then the current pixel would be considered ink.  If there were less than the specified number of pixels that were currently considered as ink within this region, then the current pixel would be considered as noise.  This worked well, except most of the dots on the "I"s in the image would be removed.

  Background