Tuesday, November 22, 2011

How to handle .pdf files containing both B&W and color pages


How to handle .pdf files containing both B&W and color pages.

Summary:

Visionary products are designed to ignore .jpg files while doing OCR work.  Here is a process to allow you to provide OCR data for both .tif and .jpg files.

This process is designed for production services, vice end-users; so we are going to be using the Visionary Scan program.  This process will only focus on converting one document.  If you have multiple documents to convert, judicious use of IrfanView’s options and Windows Explorer will all the conversion of multiple documents.  That process is outside the scope of this write-up, but can easily be done by a competent computer tech.

Resources:

(Here is a link to the sample file we will be using, Getting Started - Going To Trial)
(We will be using a free program called IrfanView for the image conversion.  You can find it on the internet or use your own favorite program.  This write-up will specifically reference IrfanView.)

Process:

With a Scan project already started, choose File, Import, PDF Files.

Browse to, Select and click the Open button for the sample .pdf file.  The .pdf conversion process converts the .pdf file into trial usable image files.  In this scenario, the majority of the files are .jpg color images.

We can continue at this point as normal and finish our project.

However, if we want to produce OCR data for these files, we will find that we do not get any results for the .jpg image files.

The following is the process to get OCR data for both the .tif and .jpg images.
  1. In the Scan program; import the mixed .pdf file.
  2. Then select Tools, Delete Empty Folders in Project.
  3. Close the Scan program.
  4. Use Windows Explorer to Append “Original” to the directory name portion of the deponents .vigx directory.  Leave the .vigx portion alone.
  5. Again use Windows Explorer to create a copy of the deponents .vigx directory.  Replace the “Original” with “Working” in the directory name portion, leaving the .vigx portion alone.
  6. Launch IrfanView.
  7. Choose File, Batch Conversion/Rename.
  8. In this window,
    1. Check Batch conversion,
    2. Output format = “TIF - Tagged Image File Format”,
      1. Options = “CCITT Fax 4”,
      2. Uncheck “Use advanced options (for bulk resize...)
    3. Click the “Use current (‘look in’) directory” button.
    4. Use the drop down option in the “Look In:” option select the copied and renamed directory from above.
    5. Use the drop down for “files of type:” to choose “JPG/JPEG - JPG Files”.
    6. Click the “Add All” button.
    7. Click the “Start Batch” button.
  9. When the process completes, click “Exit batch”.
  10. Close IrfanView.
  11. Use Windows Explorer to delete all the .jpg files from the “Working” .vigx directory.
  12. Remove the “Working” portion of the .vigx file name.
  13. Launch the Scan program and select this Scan project.
  14. Choose OCR Doc.
  15. When the OCR process finishes, close the Scan program.
  16. Use windows Explorer copy the “OCR” directory from the .vigx directory into the “Original” .vigx directory.
  17. Delete the non-”Original” .vigx directory.
  18. Remove the “Original” from the .vigx directory.
  19. Launch the Scan program and select this Scan project.
  20. QC and finish the process.



There may seem to be many steps, but in reality, the process can be completed very quickly.

If there are any questions, do not hesitate to contact Visionary Support form the web site, Visionary Legal.

Here is a link to information about the Visionary Scan program.  Scan information.

thank you,
chuck

No comments:

Post a Comment