Imaging and OCR K.T.Anuradha National Centre for Science Information Indian Institute of Science Bangalore – 560 012 (E-Mail:
[email protected])
15-20 April 2002
Imaging and OCR
PI-3
1
Goals of This Presentation
To give an overview of Imaging and Optical Character Recognition process
15-20 April 2002
Imaging and OCR
PI-3
2
1
What Will You Learn?
You will get an overview of Imaging and OCR process What you need to do in the lab:
Scan some specific documents and using a few OCR software installed, convert the scanned images to text
15-20 April 2002
Imaging and OCR
PI-3
3
Historical Perspective
M. Sheppard's invention, GISMO - A Robot ReaderWriter in 1951 J. Rainbow developed a prototype machine in 1954
able to read uppercase typewritten output at the “fantastic” speed of one character per minute
IBM, Recognition Equipment, Inc., Farrington, Control Data, and Optical Scanning Corp, marketed OCR systems by 1967 NASA used imaging system to enhance and manipulate satellite images 15-20 April 2002
Imaging and OCR
PI-3
4
2
Historical Perspective
Several standards were developed
Character Set for Optical Character Recognition (OCR-A). ANSI X3.17-81 Character Set for Optical Character Recognition (OCR-B). ANSI X3.49-75 Paper Used in Optical Character Recognition Systems. ANSI X3.62-87. Several standards were developed Optical Character Recognition (OCR) Inks. ANSI X3.86-80. Optical Character Recognition (OCR) Character Position. ANSI X3.93-81 15-20 April 2002
Imaging and OCR
PI-3
5
Applications
Industries and Institutions in which control of large amounts of paper work is critical
The medical community
Banking, Credit cards, Insurance industries To capture, store and transmit radiology images
Libraries and archives
For conservation and preservation of vulnerable documents and for the provision of access to source documents 15-20 April 2002
Imaging and OCR
PI-3
6
3
Glossary
Glyph – the image of a character rendered in pixels. Raster – the scanned image created by a kinescope (a CRT, Cathode Ray Tube, such as that used in computer displays) Text image – the content of a text record, often the contents of a page of text. Pixel – (Picture ELements) or pels (Picture ELements), an image sample area that is almost always square. Arranged in a grid, pixels form a raster image. A scanned page of a paper or microform document creates a digital image that is a raster of pixels. 15-20 April 2002
Imaging and OCR
PI-3
7
More about Pixels
All pixels are identical in size and arrangement. All pixels are processed the same way. All pixels are scanned, displayed, and printed the same way. Each pixel has a location and a colour.
Both given as numbers. Location: latitude and longitude Color: Amount of Red Green and Blue Max on all 3 is white, minimum on all 3 is black 15-20 April 2002
Imaging and OCR
PI-3
8
4
Bit-Mapped Images
A bit-mapped image is a raster of pixels. Printed as a raster. Can be created by raster scanning. Can be created by a RIP (Raster Image Processor) in a printer.
15-20 April 2002
Imaging and OCR
PI-3
9
How many shades
Five main types of image shades
One-bit black and white or bi-tonal: no shades between black and white 4 bit gray scale: 16 shades of gray 8 bit gray scale: 256 shades of gray 8 bit colour: each bit can be one of 256 colours 24 bit colour: 16.8 million colours 32 & 42 bit colours: not used much; opted by photographers 15-20 April 2002
Imaging and OCR
PI-3
10
5
Resolution
Number of dots per inch (dpi) determines the resolution Higher the dpi, larger is the size 1 bit black and white image at 100 dpi requires 10 Kb of storage and 24 bit colour image at 400 dpi requires 475 Kb of storage
15-20 April 2002
Imaging and OCR
PI-3
11
Image trasmission and Access
On the Net via standard protocol such as TCP/IP Transferring a single archival image over 56 Kbps line require about 18 minutes, thumb nail within seconds. LAN should support 10 Mbps to 100 Mbps Colour Monitor of 19 inch size that support 1024 by 768 line resolution is ideal. Desktop laser printers for monochrome with 300 to 600 dpi to the more expensive gray scale and colour laser printers 15-20 April 2002
Imaging and OCR
PI-3
12
6
Types of images
Thumbnail
Service
Allows to judge in viewing the image; requires about 1035 Kb of storage space for each image Designed to convey information; typically are compressed, requires up to 300 Kb for each image
Archival
Uncompressed image free of the artifacts resulting from compression; highest quality images requires several Mb each 15-20 April 2002
Imaging and OCR
PI-3
13
Indexing of Images
Images are indexed to identify and retrieve images
Eg. Purchace order number, Policy number, account number, profile number, ISSN number
MARC format for bibliographic records has some limitations in indexing images Two alternatives to MARC are Dublin Core and EAD (Encoded Archival Description) 15-20 April 2002
Imaging and OCR
PI-3
14
7
Image formats
Raster
bit mapped graphics and is composed of coloured dots. Common formats include .tiff (tagged image file format: basis for all image files), .jpg (joint photo- graphic experts group for gray line images), .gif (for colour images), mpg (motion picture experts group), .bmp, .pdf images are edited in paint and photo programs like Adobe PhotoShop and Metacreations Painter
15-20 April 2002
Vector
mathematically defined with coded instructions that define the angles and relationships between every line in the image. Common vector formats include .wmf and .cgm images are edited in drawing programs like Adobe Illustrator and CorelDraw.
Imaging and OCR
PI-3
15
Image formats: uses and advantages
Raster
In continuous tone images eg photographs; on the web where there are no vector formats currently supported Only format that will show smooth gradients and subtle detail necessary in photographic images; Allow for color correction much easier then vector images
15-20 April 2002
Vector Logos with a few solid colours and need to be shown at a variety of sizes; Creating specialized text effects; 3D and CAD programs Resolution independent; Smooth curves; Small file sizes
Imaging and OCR
PI-3
16
8
Image capture interfaces
IDE
SCSI
Faster seek time, costs more, 40Mb-160Mb/sec
USB (Universal Serials Bus)
Widely used, low cost, poorest seek time
Ease of setup, 15Mb/sec
IEEE 1394
Initially developed by Apple, 3.2Gb/sec, not all pcs support 15-20 April 2002
Imaging and OCR
PI-3
17
Image Drivers An image driver is required for an image capture device to communicate with software applications. Two standards are available ISIS
Proprietary product developed by Pixel Translation
TWAIN
Developed and designed by TWAIN Working Group in 1999 adopted TWAIN 1.7 as the current standard
15-20 April 2002
Imaging and OCR
PI-3
18
9
Selecting Imaging System
Imaging systems selection depends on the type of application
Workflow or transaction processing system: Focus on processing of documents and automating the process; Capturing and storing images without alteration. Eg. Purchase orders, invoices, credit card charges and insurance policies Storage and retrieval systems: Store and retrieve large number of documents in a variety of types and formats. Capturing and inhancing them to facilitate readability Eg. Medical, Library community 15-20 April 2002
Imaging and OCR
PI-3
19
Types of Imaging System
Drum Scanners: High-end scanners Use photo multipliers Expensive and sensitive devices Flatbed Scanners Ideal for odd-sized images Sheetfed Scanners Can scan only loose sheets Compact in size and easy to install Handheld scanners Provide portability and functionality at the low cost 15-20 April 2002
Imaging and OCR
PI-3
20
10
What, Why and When of OCR
Allows to scan printed, typewritten or hand written text (numerals, letters or symbols) and/or convert scanned image to a computer process able format, either in the form of a plain text or a word document or an excel spread sheet, which can be edited, used or reused in other documents It uses raster images 15-20 April 2002
Imaging and OCR
PI-3
21
What, Why and When of OCR
OCR is used when recreating a document in electronic form takes more time The converted text files take less space than the original image file and can be indexed Bridges the gap between the paperless and the papered 15-20 April 2002
Imaging and OCR
PI-3
22
11
How of OCR
It has three components:
Image scanner, OCR hardware/software, Output interface
15-20 April 2002
Imaging and OCR
PI-3
23
PI-3
24
How of OCR
15-20 April 2002
Imaging and OCR
12
How of OCR
Scanner has 4 components:
A detector, An illumination source, A scan lens and a document transport
OCR hardware/software performs three operational steps:
Document analysis, Character recognition, Contextual processing
15-20 April 2002
Imaging and OCR
PI-3
25
How of OCR
Output Interface
Allows character recognition results to be electronically transferred into the domain that uses the results
15-20 April 2002
Imaging and OCR
PI-3
26
13
Types of OCRs
Two types of OCRs
Task specific readers General purpose readers
Task specific readers
Reads only specific documents: bank cheques, mail address used primarily for high-volume applications which require high system throughput: Assigning ZIP Codes to letter mail, Reading data entered in forms, e.g., tax forms, Automatic accounting procedures used in processing utility bills 15-20 April 2002
Imaging and OCR
PI-3
27
PI-3
28
Types of OCRs
General purpose page readers
High end OCR (usually for offices)
Speed and Accuracy are important Format preservation Good proof reading solutions
Low end OCR (usually for house use)
Speed is not required Proof reading is done manually
15-20 April 2002
Imaging and OCR
14
Factors affecting OCR quality
Scanner quality Scan resolution Type of printed documents, whether laser printer outputs or photocopied Paper quality Fonts used in the text Linguistic complexities Dictionary used 15-20 April 2002
Imaging and OCR
PI-3
29
Evaluating OCRs
Neat interface Easy-to-use wizards Accurate recognition Scan resolution setting (600 dpi is advisable) Time taken from scanning to deliver the final product Enhanced usability of the product Ability to modify the scan setting 15-20 April 2002
Imaging and OCR
PI-3
30
15
Summarizing
We learnt basics of imaging system and images Different steps involved in OCR technique and scanning Conversion of raster image to text using OCR techniques Types of imaging system and OCR software Evaluation of imaging system and OCR software 15-20 April 2002
Imaging and OCR
PI-3
31
References
Web Sites:
www.archivebuilders.com Sunsite.berkeley.edu www.cedar.buffalo.edu/Publications/TechReps/OCR/ocr.htm navigatela.lacity.org/samples/start/
Journals
Chip July 2000 Pcquest Product review column
15-20 April 2002
Imaging and OCR
PI-3
32
16
Questions? Comments? Discussions? (Pl. fill the feedback form) Thank You!
15-20 April 2002
Imaging and OCR
PI-3
33
17