Scan-ocr

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Scan-ocr as PDF for free.

More details

  • Words: 1,909
  • Pages: 17
Imaging and OCR K.T.Anuradha National Centre for Science Information Indian Institute of Science Bangalore – 560 012 (E-Mail: [email protected])

15-20 April 2002

Imaging and OCR

PI-3

1

Goals of This Presentation „

To give an overview of Imaging and Optical Character Recognition process

15-20 April 2002

Imaging and OCR

PI-3

2

1

What Will You Learn? „

„

You will get an overview of Imaging and OCR process What you need to do in the lab: „

Scan some specific documents and using a few OCR software installed, convert the scanned images to text

15-20 April 2002

Imaging and OCR

PI-3

3

Historical Perspective „

„

M. Sheppard's invention, GISMO - A Robot ReaderWriter in 1951 J. Rainbow developed a prototype machine in 1954 „

„

„

able to read uppercase typewritten output at the “fantastic” speed of one character per minute

IBM, Recognition Equipment, Inc., Farrington, Control Data, and Optical Scanning Corp, marketed OCR systems by 1967 NASA used imaging system to enhance and manipulate satellite images 15-20 April 2002

Imaging and OCR

PI-3

4

2

Historical Perspective „

Several standards were developed „

„

„

„ „

Character Set for Optical Character Recognition (OCR-A). ANSI X3.17-81 Character Set for Optical Character Recognition (OCR-B). ANSI X3.49-75 Paper Used in Optical Character Recognition Systems. ANSI X3.62-87. Several standards were developed Optical Character Recognition (OCR) Inks. ANSI X3.86-80. Optical Character Recognition (OCR) Character Position. ANSI X3.93-81 15-20 April 2002

Imaging and OCR

PI-3

5

Applications „

Industries and Institutions in which control of large amounts of paper work is critical „

„

The medical community „

„

Banking, Credit cards, Insurance industries To capture, store and transmit radiology images

Libraries and archives „

For conservation and preservation of vulnerable documents and for the provision of access to source documents 15-20 April 2002

Imaging and OCR

PI-3

6

3

Glossary „ „

„

„

Glyph – the image of a character rendered in pixels. Raster – the scanned image created by a kinescope (a CRT, Cathode Ray Tube, such as that used in computer displays) Text image – the content of a text record, often the contents of a page of text. Pixel – (Picture ELements) or pels (Picture ELements), an image sample area that is almost always square. Arranged in a grid, pixels form a raster image. A scanned page of a paper or microform document creates a digital image that is a raster of pixels. 15-20 April 2002

Imaging and OCR

PI-3

7

More about Pixels „ „ „

„

All pixels are identical in size and arrangement. All pixels are processed the same way. All pixels are scanned, displayed, and printed the same way. Each pixel has a location and a colour. „ „ „

Both given as numbers. Location: latitude and longitude Color: Amount of Red Green and Blue „ Max on all 3 is white, minimum on all 3 is black 15-20 April 2002

Imaging and OCR

PI-3

8

4

Bit-Mapped Images „

A bit-mapped image is a raster of pixels. „ Printed as a raster. „ Can be created by raster scanning. „ Can be created by a RIP (Raster Image Processor) in a printer.

15-20 April 2002

Imaging and OCR

PI-3

9

How many shades „

Five main types of image shades „

„ „ „ „ „

One-bit black and white or bi-tonal: no shades between black and white 4 bit gray scale: 16 shades of gray 8 bit gray scale: 256 shades of gray 8 bit colour: each bit can be one of 256 colours 24 bit colour: 16.8 million colours 32 & 42 bit colours: not used much; opted by photographers 15-20 April 2002

Imaging and OCR

PI-3

10

5

Resolution „

„ „

Number of dots per inch (dpi) determines the resolution Higher the dpi, larger is the size 1 bit black and white image at 100 dpi requires 10 Kb of storage and 24 bit colour image at 400 dpi requires 475 Kb of storage

15-20 April 2002

Imaging and OCR

PI-3

11

Image trasmission and Access „ „

„

„

On the Net via standard protocol such as TCP/IP Transferring a single archival image over 56 Kbps line require about 18 minutes, thumb nail within seconds. LAN should support 10 Mbps to 100 Mbps Colour Monitor of 19 inch size that support 1024 by 768 line resolution is ideal. Desktop laser printers for monochrome with 300 to 600 dpi to the more expensive gray scale and colour laser printers 15-20 April 2002

Imaging and OCR

PI-3

12

6

Types of images „

Thumbnail „

„

Service „

„

Allows to judge in viewing the image; requires about 1035 Kb of storage space for each image Designed to convey information; typically are compressed, requires up to 300 Kb for each image

Archival „

Uncompressed image free of the artifacts resulting from compression; highest quality images requires several Mb each 15-20 April 2002

Imaging and OCR

PI-3

13

Indexing of Images „

Images are indexed to identify and retrieve images „

„

„

Eg. Purchace order number, Policy number, account number, profile number, ISSN number

MARC format for bibliographic records has some limitations in indexing images Two alternatives to MARC are Dublin Core and EAD (Encoded Archival Description) 15-20 April 2002

Imaging and OCR

PI-3

14

7

Image formats „

Raster „

„

„

bit mapped graphics and is composed of coloured dots. Common formats include .tiff (tagged image file format: basis for all image files), .jpg (joint photo- graphic experts group for gray line images), .gif (for colour images), mpg (motion picture experts group), .bmp, .pdf images are edited in paint and photo programs like Adobe PhotoShop and Metacreations Painter

15-20 April 2002

„

Vector „

„

„

mathematically defined with coded instructions that define the angles and relationships between every line in the image. Common vector formats include .wmf and .cgm images are edited in drawing programs like Adobe Illustrator and CorelDraw.

Imaging and OCR

PI-3

15

Image formats: uses and advantages „

Raster „

„

„

In continuous tone images eg photographs; on the web where there are no vector formats currently supported Only format that will show smooth gradients and subtle detail necessary in photographic images; Allow for color correction much easier then vector images

15-20 April 2002

Vector „ Logos with a few solid colours and need to be shown at a variety of sizes; Creating specialized text effects; 3D and CAD programs „ Resolution independent; Smooth curves; Small file sizes

Imaging and OCR

PI-3

16

8

Image capture interfaces „

IDE „

„

SCSI „

„

Faster seek time, costs more, 40Mb-160Mb/sec

USB (Universal Serials Bus) „

„

Widely used, low cost, poorest seek time

Ease of setup, 15Mb/sec

IEEE 1394 „

Initially developed by Apple, 3.2Gb/sec, not all pcs support 15-20 April 2002

Imaging and OCR

PI-3

17

Image Drivers An image driver is required for an image capture device to communicate with software applications. Two standards are available „ ISIS „

„

Proprietary product developed by Pixel Translation

TWAIN „

Developed and designed by TWAIN Working Group in 1999 adopted TWAIN 1.7 as the current standard

15-20 April 2002

Imaging and OCR

PI-3

18

9

Selecting Imaging System „

Imaging systems selection depends on the type of application „

„

Workflow or transaction processing system: Focus on processing of documents and automating the process; Capturing and storing images without alteration. Eg. Purchase orders, invoices, credit card charges and insurance policies Storage and retrieval systems: Store and retrieve large number of documents in a variety of types and formats. Capturing and inhancing them to facilitate readability Eg. Medical, Library community 15-20 April 2002

Imaging and OCR

PI-3

19

Types of Imaging System „

„

„

„

Drum Scanners: High-end scanners „ Use photo multipliers „ Expensive and sensitive devices Flatbed Scanners „ Ideal for odd-sized images Sheetfed Scanners „ Can scan only loose sheets „ Compact in size and easy to install Handheld scanners „ Provide portability and functionality at the low cost 15-20 April 2002

Imaging and OCR

PI-3

20

10

What, Why and When of OCR „

„

Allows to scan printed, typewritten or hand written text (numerals, letters or symbols) and/or convert scanned image to a computer process able format, either in the form of a plain text or a word document or an excel spread sheet, which can be edited, used or reused in other documents It uses raster images 15-20 April 2002

Imaging and OCR

PI-3

21

What, Why and When of OCR „

„

„

OCR is used when recreating a document in electronic form takes more time The converted text files take less space than the original image file and can be indexed Bridges the gap between the paperless and the papered 15-20 April 2002

Imaging and OCR

PI-3

22

11

How of OCR „

It has three components: „

Image scanner, OCR hardware/software, Output interface

15-20 April 2002

Imaging and OCR

PI-3

23

PI-3

24

How of OCR

15-20 April 2002

Imaging and OCR

12

How of OCR „

Scanner has 4 components: „

„

A detector, An illumination source, A scan lens and a document transport

OCR hardware/software performs three operational steps: „

Document analysis, Character recognition, Contextual processing

15-20 April 2002

Imaging and OCR

PI-3

25

How of OCR „

Output Interface „

Allows character recognition results to be electronically transferred into the domain that uses the results

15-20 April 2002

Imaging and OCR

PI-3

26

13

Types of OCRs „

Two types of OCRs „ „

„

Task specific readers General purpose readers

Task specific readers „

„

Reads only specific documents: bank cheques, mail address used primarily for high-volume applications which require high system throughput: Assigning ZIP Codes to letter mail, Reading data entered in forms, e.g., tax forms, Automatic accounting procedures used in processing utility bills 15-20 April 2002

Imaging and OCR

PI-3

27

PI-3

28

Types of OCRs „

General purpose page readers „

High end OCR (usually for offices) „ „ „

„

Speed and Accuracy are important Format preservation Good proof reading solutions

Low end OCR (usually for house use) „ „

Speed is not required Proof reading is done manually

15-20 April 2002

Imaging and OCR

14

Factors affecting OCR quality „ „ „

„ „ „ „

Scanner quality Scan resolution Type of printed documents, whether laser printer outputs or photocopied Paper quality Fonts used in the text Linguistic complexities Dictionary used 15-20 April 2002

Imaging and OCR

PI-3

29

Evaluating OCRs „ „ „ „ „

„ „

Neat interface Easy-to-use wizards Accurate recognition Scan resolution setting (600 dpi is advisable) Time taken from scanning to deliver the final product Enhanced usability of the product Ability to modify the scan setting 15-20 April 2002

Imaging and OCR

PI-3

30

15

Summarizing „ „

„

„ „

We learnt basics of imaging system and images Different steps involved in OCR technique and scanning Conversion of raster image to text using OCR techniques Types of imaging system and OCR software Evaluation of imaging system and OCR software 15-20 April 2002

Imaging and OCR

PI-3

31

References „

Web Sites: „ „ „ „

„

www.archivebuilders.com Sunsite.berkeley.edu www.cedar.buffalo.edu/Publications/TechReps/OCR/ocr.htm navigatela.lacity.org/samples/start/

Journals „ „

Chip July 2000 Pcquest Product review column

15-20 April 2002

Imaging and OCR

PI-3

32

16

Questions? Comments? Discussions? (Pl. fill the feedback form) Thank You!

15-20 April 2002

Imaging and OCR

PI-3

33

17