Announcement

Collapse
No announcement yet.

OCR improvement

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    OCR improvement

    hello

    my project involves OCR, and I am looking for ways to post-processly improve the image that the scanner produces in favor of a better OCR success rate (OCR rate of 1% means 100 mistakes in a page and this is hard to correct if we are talking about many pages)

    scanner driver itself offers adjustments on contrast, brightness, gamma correction that may improve OCR success rate, but those may depend on the specific document and I couldn't be able to find manualy the optimum combination

    I don't know how the OCR program (Abbyy Finereader) reads exactly the characters, but I suppose there is something that can be done to help it, using a more sophisticated image editing software that the image adjustments that can be done through the scanner driver or the OCR program

    basicaly, it would be beneficial to increase:
    - contrast between black characters of the text and white background of the page
    - make white of the page, true white, and black of the characters, true black
    - smooth the edges of the characters
    - remove artifacts, small dots that would confuse the OCR program, etc
    - find automaticaly the optimum combination of contrast, brightness, gamma correction, sharpness

    is there anything that can be done?

    thanks
    Last edited by user; 27.09.2008, 09:33 AM.

    #2
    I doubt very much there is anything that IrfanView could do that the scanner software cannot. Obviously, any scanner software is optimised for this very task. Adjusting gamma and contrast is about it. The only solution is to scan at a higher resolution, which, of course, has its own problems of taking longer to scan, longer to process, etc. You may find that saving scans as PNG instead of JPG will help, as it creates no artifacts. Changing gamma and reducing colours can remove marks on the paper, as well as greatly reducing the file size.
    Before you post ... Edit your profile • IrfanView 4.62 • Windows 10 Home 19045.2486

    Irfan PaintIrfan View HelpIrfanPaint HelpRiot.dllMore SkinsFastStone CaptureUploads

    Comment


      #3
      Scan at a resolution of at least 600dpi, and use the scanner's settings to get the best contrast. Never save as jpg. My scanner -- a cheap one! -- does a creditable job of producing pure black and white scans from anything that is not dark and dirty. If your pages are stained, post-processing will probably be necessary. Using the IrfanPaint plugin's Color Replacer tool might help in that case. I've been trying it out on scans of old maps.

      If you are interested in trying out an alternative OCR, Softi FreeOCR is what I use -- it uses a very "smart" algorithm and the results are terrific.
      Its: Belongs to "It"
      It's: Shortened form of "It is"
      ---------------------
      Lose: Fail to keep
      Loose: Not tight

      ---------------------
      Plurals do not require apostrophes

      Comment


        #4
        thanks for your reply

        I use TIF 600dpi at greyscale already, which are supposed to be the best settings

        currently I use Finereader which is imo the best

        as you can see below, the scanner somehow produces images with some marks (I don't know if the glossy paper is responsible for this (the marks look like reflections) or the scanner or something else)

        this is the crop of a scanned page:

        it looks like there is nothing bad about it and you may think that it's absolutely readable, but OCR program may give 'uncertain characters' indication and maybe result an erroneus recognition


        look now why it does this:
        this is the same crop but with lowered brightness and contrast in photoshop:


        now it is shown that the characters have marks around them and these can cause problems to the OCR program

        I wonder if there is a way to fix this by finding the optimum contrast, brightness, gamma correction combination

        performing auto-contrast in photoshop results in higher OCR success rate, so I suppose image editing software can 'fix' the scanned images better than the scanner software or the OCR program
        Last edited by user; 28.09.2008, 09:33 AM.

        Comment


          #5
          My Epson TWAIN scanner driver has all the controls you should need. I very rarely do any OCR, so I don't know which settings work best. I am sure it will vary a lot depending on the original copy being scanned. You could save several optimal settings, so there should be no need to use separate image adjusting software.

          For your sample, just increasing the contrast substantially seems to get rid of most of the artifacts.
          Attached Files
          Before you post ... Edit your profile • IrfanView 4.62 • Windows 10 Home 19045.2486

          Irfan PaintIrfan View HelpIrfanPaint HelpRiot.dllMore SkinsFastStone CaptureUploads

          Comment


            #6
            Never dealt with OCR, but some remarks.
            1) Scanned documents which were made with a typewriter, like the ones by E.W. Dijkstra, often caused serious problems with OCR,
            because some characters were 'connected'. So one or more black pixels linking two characters.
            If this would occur, it's necessary to remove these pixels, sometimes even manually.
            Every character should be 'on its own'.
            2) This is why I think that "smooth the edges of the characters" is not a good idea. Character edges should be as tight as possible.
            It would lead to gradations of grey pixels around the characters, increasing the risk of touching the neighbouring character.
            3) I did a workaround with the example (first checking the number of used colors with 'I') :
            a- Decreased the color depth to 4 (possible in this case, other scans probably need 16 first)
            b- Used Flood Fill ,with a tolerance of 64, on the background
            c- Checked the palette to make black real black and white real white
            d- Decreased color depth to 2. It looks the same but reduces the filesize.
            0.6180339887
            Rest In Peace, Sam!

            Comment


              #7
              Flood fill won't help letters with enclosed areas, and it is too tedious to hand-fix a lot of files if a lot of steps are involved.

              The first thing would be to improve the scan quality as much as possible. Then the saved images. TIF is not always the best, there are different types of compression and it may be just as lossy as any JPG (it may in fact be a JPG). It is fine if you use the right type of compression (or none) and especially if the scan is two-color, pure black and white. If a black and white "line art" setting is available and the paper is clean enough, that might give the best results, but it may leave spots. I think a higher DPI would improve your rate. Definitely try that and saving as PNG or BMP.

              Using grayscale scans, you can improve the contrast a lot, but the exact settings can only be found by trial and error. It may be better if too many spots turn up using black and white, or if the scanner doesn't handle it with enough finesse.

              The sample is very muddy, it looks like a JPG. My attached specimen was scanned in b&w at 600 dpi from a computer printout. I have to do this frequently because my colleagues think that printouts are easier to read and they don't save files...and I don't retype their newsletter articles if I can help it. I have bad eyes and use a computer because I can copy and paste, darn it. That is why I am getting pretty good at OCR tricks.
              Attached Files
              Its: Belongs to "It"
              It's: Shortened form of "It is"
              ---------------------
              Lose: Fail to keep
              Loose: Not tight

              ---------------------
              Plurals do not require apostrophes

              Comment


                #8
                Hauling the thread back up because of some curious peeking and poking I did during a recent day of much scan-OCR.

                My darling FreeOCR (which uses the Tesseract engine, which is currently owned and developed by Google, so I suspect it is considered good enough for Google Books lol) is definitely not impressed by smooth, anti-aliased fonts. I took a look into its temporary folder, and found that it creates a 2-color bitmap to do its actual work from. So I ran tests on images with different color depths. Some of the temp images I looked at seemed almost all white, there was so little difference between "color 0" and "color 1". Others were inverted, white text on black. A matter of expediency, what ends up black and what is white. The common denominator was two colors -- the OCR had no use for shades of gray anywhere (as I had already supposed).

                The moral of the story is -- scan black-and-white for OCR.

                The Microsoft Office Document Image Writer (*2003) does a creditable job, BTW. I'm thinking of using it to rip PDFs, because it can do it with little hand-holding and doesn't run paragaphs together. (I can't read PDFs because of the font and font size plus no line wrap -- hurts my eyes). Disadvantage: it has to send the text to Word. Don't get me started LOL

                I had to OCR all the content for a new website -- ack.
                Its: Belongs to "It"
                It's: Shortened form of "It is"
                ---------------------
                Lose: Fail to keep
                Loose: Not tight

                ---------------------
                Plurals do not require apostrophes

                Comment


                  #9
                  Just my 2 cents: bigger isn't always better.

                  I use Caere Omnipage Pro 9.0 and have found that scanning at too high a resolution will eventually worsen the results. Scanning at 200dpi does fairly well for me. - Presumably the scanner quality comes into it a lot.

                  currently running 4.56 / 32 bit

                  Comment


                    #10
                    Couple of good points there. The size can make a difference if the print is very small, I have found, but doesn't matter much if the original document is clear and clean.

                    My cheap junk scanner often amazes me.
                    Its: Belongs to "It"
                    It's: Shortened form of "It is"
                    ---------------------
                    Lose: Fail to keep
                    Loose: Not tight

                    ---------------------
                    Plurals do not require apostrophes

                    Comment


                      #11
                      In version 4.25 there was added the KADMOS OCR plugin to add OCR features to IrfanView!

                      see: IrfanView plugin site

                      (Click to download the KADMOS OCR plugin!)


                      see: Older versions - History of changes

                      (Click to download the KADMOS OCR plugin!)

                      Comment


                        #12
                        The hotkey F9 is actually easy to use.

                        With ABBYY Finereader 7 a 600 scan resolution was due. Now with the new release 300 is enough.

                        If you use a camera select the TEXT MODE in order to get the white …white.
                        I didn't succeed to get that even with Irfan corrections.

                        Comment

                        Working...
                        X