Automated Video Enhancement

  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Automated Video Enhancement as PDF for free.

More details

  • Words: 12,017
  • Pages: 84
Automated Video Enhancement Video Processing Tools For Amateur Filmmakers

Himanshu Madan 404034 Sanika Mokashi 404037 Sneha Rangdal 404042 Siddharth Shashidharan 404048

Guide :

Prof. Mrs C V Joshi H.O.D, Dept. of E&TC College of Engineering, Pune

Mr. Sumeet Kataria Oneirix Labs

COLLEGE OF ENGINEERING, PUNE Shivajinagar, Pune - 411 005 (Formerly Government College of Engineering, Pune)

CERTIFICATE

This is to certify that Himanshu Madan [404034], Sanika Mokashi [404037], Sneha Rangdal [404042], and Siddharth Shashidharan [404048],

of final year,

B.Tech, E&TC, have satisfactorily completed their B.Tech project titled “Automated Video Enhancement” under the guidance of Prof. Mrs. C.V.Joshi, as a part of the coursework for the year 2007-08.

Prof. Mrs. C.V.Joshi Internal Guide and H.O.D, Dept. of E&TC College of Engineering, Pune

II

CERTIFICATE

This is to certify that Himanshu Madan [404034], Sanika Mokashi [404037], Sneha Rangdal [404042], and Siddharth Shashidharan [404048], studying in the final year in the Electronics and Telecommunication department at COEP, were working on their B.Tech project at Oneirix Labs.

The project, titled “Automated Video Enhancement”, is deemed complete. Mr. Sumeet Kataria, of Oneirix labs was assigned as their guide for the entire duration of this project. The aims that had been decided upon at the start of the project undertaking were satisfactorily completed in good time. Oneirix Labs is pleased with the quality of work and ultimate results obtained.

III

ACKNOWLEDGEMENT

We take this opportunity to express our heartfelt gratitude to everyone at Oneirix Labs, professors and staff of College of Engineering, Pune for their constant inspiration and guidance. We would, in particular, like to thank our guide, Sumeet Katariya, for his untiring efforts, technical advice and guidance throughout the course of this project. We are also deeply indebted to Prof. Mrs. C.V.Joshi for her support and guidance derived from the vast pool of her experience and knowledge. We are extremely grateful to everyone at Oneirix Labs, in particular, Udayan Kanade, and Sanat Ganu, for their invaluable inputs at critical junctures during the project work. We could not have conceived of undertaking a project on such a massive scale without the encouragement and blessings of everyone at Oneirix Labs, our professors and parents.

IV

Abstract Art, it is said, merely reflects life and society. If there is one medium of expression that has seduced the imagination of the populace, it is the motion picture. From humble beginnings, the art of film-making is now nearly a century old. It comes as little surprise then to see a whole host of amateur filmmakers making home-made films. This of course has been made possible due to the surge in video and audio recording technologies. The digital video camcorders have been envisaged to help such target market segments by adding a lot of functionalities such as autofocusing and autoexposure. This however, does not guarantee the right results all the time. In fact a large portion of every film’s production time is normally wasted in the post-production stage, with editing required to perform various tasks such as colour correction, brightness and contrast control and a frame-by-frame editing of videos. Powerful computers and commercially available versatile video-audio editing softwares such as Apple Corp's Final Cut Pro or Avid are expensive and sophisticated tools offering a multitude of services in the post-processing stage of development of a film. Given the cost and sheer complexity of these softwares, they are often limited to use in studio houses or editing companies. And even here, it requires trained professionals who rely heavily on human senses to produce error-free, professional looking films. This requires time, training and introduces human errors, in that there will always be a limitation on what is perceived as the right brightness level or colour consistency over the entire duration of a scene from the film. This is where we come into the picture. The problems with autofous and autoexposure in modern digital cameras are discussed in depth in the chapters to follow.

V

If we understand exactly how images are formed in a video stream, it is possible to use the concepts of image processing to correct the irregularities cause by the camera, by using the good images to correct or improve those lacking in quality, leading to an improved overall video. Additionally, we are seeking to make the whole process as automated as possible to eliminate the inevitable human-perception errors. For the end user, it translates to shorter hours, more automation and reliability and even more functionality.

VI

Contents

Page No

A1. Certificate Of Completion – COEP

II

A2. Certificate Of Completion – Sponsors

III

A3. Acknowledgement

IV

A4. Abstract

V

A5. Contents

VII

A6. List of Figures

IX

1. Need for the Project

01

2. Video Encoding And Decoding Concepts

04

2.1 MPEG Encoding

05

2.2 Software’s to Extract Frames

08

2.2.1 FFMpeg

08

2.2.2 Frame Dumper

09

3. Problems Faced By Amateurs

10

3.1 Auto Exposure

11

3.1.1 The problems with auto exposure 3.2 Autofocus 3.2.1 The problems with autofocus 4. Post-processing Tools

12 13 14 15

4.1 Apple’s FCP

16

4.2 Avid media composer

16

4.3 Autodesk Smoke

17

4.4 Autodesk Fire

18

5. Our Approach

19

6. Block Diagram

22

7. Motion Estimation

25

7.1 What is Correlation

26

7.2 Algorithm for motion detection

27

7.3 Test Results Of Motion Algorithm

29 VII

7.4 Problems Faced And Its Solution

31

8. Large Space Matrix

33

9. Exposure Correction

37

9.1 A quantitative measure of the exposure in an image

38

9.2 Plotting of the brightness curve

38

9.3 Approach 1: The transfer curve method

40

9.4 Approach 2

42

9.5 Approach 3

44

9.6 Algorithm

45

9.7 Backslash or matrix left division

46

9.8 Pseudo Inverse

47

9.9 Implementation of the algorithm

48

10. Focus

50

10.1 Quantifying focus

51

10.2 Methods to measure the frequency components

51

10.2.1 FFT

51

10.2.2 DCT

53

10.3 Practically applying the above concepts 11. Actual Correction 11.1 Calculation of weights

57 61 64

12. Results

65

13. Conclusion and Future Scope

69

14. References

73

VIII

List of Figures 5.1 Frame 5 containing good information and frame 20 with

Page No 21

poor information. 6.1 Block Diagram

23

7.1 Illustration for the idea of correlation

27

7.2 Sequences of images showing the apparent motion

29

between them. 7.3 Plot of calculated motion, using the motion estimation

30

Algorithm. 7.4 Frame with very poor information content.

31

7.5 Good block obtained from the entire frame.

32

8.1 Large space matrix with each frame as one dimension.

35

8.2 Combined fames after motion compensation.

35

9.1 Energy pattern of the frames.

39

9.2 Correction needed of the energy pattern of the frames.

39

9.3 Results of transfer curve method.

40

9.4 Frames showing the variations of the pixel intensity

42

in time before the auto exposure of camera settles. 9.5 Independent correction curves of different pixels.

43

9.6 Corrected image showing banding effect.

44

10.1 Demonstration of FFT transforms.

53

10.2 Image converted to its frequency domain.

55

10.3 Demonstration of 2d DCT transforms.

56

10.4 Images with varying focus.

57

10.5 FFT of an image.

58

10.6 DCT of an image.

59

10.7 Focus factor using FFT.

59

10.8 Focus factor using DCT.

60

11.1 Old and new exposure curve and original pixel values.

61

12.1 First Frame.

66 IX

12.2 First Frame After Correction.

66

12.3 Second Frame.

66

12.4 Second Frame After Correction.

66

12.5 Third Frame.

66

12.6 Third Frame After Correction.

66

12.7 Fourth Frame.

67

12.8 Fourth Frame After Correction.

67

12.9 Final Frame

67

12.10 Final Frame After Correction.

67

X

Chapter 1

Need For the Project

-1-

Need For the Project Technology today has stretched out its long arms to embrace the populace ad empower them to do things once deemed impossible. Nearly anyone today can shoot and edit films with the help of a digital camera and a computer at home. These Digicams have spurned a generation of filmmakers or all hues and styles. When a motion picture is made, the use of elaborate and technologically advanced cameras is not the only thing that adds that touch of professionalism. There are also the elaborate lighting and staging processes and of course, a long detailed editing stage necessary to bring out the picture on the big screen. These are luxuries that an amateur armed only with his hardy Digicam cannot afford. In these cases then, he needs to rely on what little he may have in his toolkit by means of post-processing tools. The problems that are done away with by controlling the lighting and other environmental factors in the case of a full-fledged movie rear their heads when Digicams are used. Factors such as the lighting and ambient colour cannot really be controlled externally by an amateur as he does not have the necessary resources for setting up expensive studio lights or reflectors etc. The video that will be obtained by his Digicam is bound to look rather crude and unedited. The video editing softwares available today offer him/her some video tools that the filmmaker can use to improve the overall appearance of his video shoot. These are discussed in the next section. These softwares however cannot really do much to compensate for the inherent shortcomings of digital cameras and handycams. There are certain problems which arise due to features such as auto-exposure and auto-focus that come with virtually every digital camera now-a-days. They are discussed in detail -2-

in the pages that follow. The traditional method has always been to rectify or compensate for these shortcomings at the post-processing stage rather than at the data acquisition stage itself. Thus, what one hopes to do is to use the video obtained directly from the camera in its raw form to rectify and create aesthetically appealing, professional-looking videos. For now, an attempt is made to show the need to develop a different or a new approach to fixing those very problems and the shortcomings of the softwares and video tools in the present market scenario.

-3-

Chapter 2

Video Encoding And Decoding Concepts

-4-

Video Encoding And Decoding Concepts Historically, video was stored as an analog signal on magnetic tape. Around the time when the compact disc entered the market as a digital-format replacement for analog audio, it became feasible to also begin storing and using video in digital form and a variety of such technologies began to emerge. We can, in most modern contexts atleast, think of most video formats consisting of a continuous stream of images played out a particular rate called as the ‘fps’ or frames per second. By keeping this at a high rate (usually from 24 to 30 fps), one uses the persistence of human vision to generate what looks like a continuous motion picture. Thus, improving the quality of the video boils down to improving these images. Most modern cameras today store data digitally. MPEG is the standard or preferred format that is used to record and store videos. This file format too, as discussed above may be thought of frames that are placed one after the other forming a continuous stream of images. Let us discuss this coding technique in greater detail now: 2.1 MPEG Encoding MPEG stands for Moving Pictures Expert Group, the committee of industry which created the standard. MPEG is, in fact, a whole family of standards for digital video and audio signals using DCT compression. MPEG-2, which employs DCT compression, is certain to become the dominant standard in consumer equipment for the foreseeable future. MPEG takes the DCT compression algorithm and defines how it is used to reduce the data rate, how packets of video and audio data are multiplexed together in a way that will be understood by an MPEG decoder.

-5-

DCT or Discrete Cosine Transform, to give it its full name, uses the fact that adjacent pixels in a picture (either physically close in the image (spatial) or in successive images (temporal)), may be the same value. Small blocks of 8 x 8 pixels are 'transformed' mathematically in a way that tends to group the common digital signal elements in a block together. DCT doesn’t directly reduce the data but the transform tends to concentrate the energy into the first few coefficients and many of the higher frequency coefficients are often close to zero. Bit rate reduction is achieved by not transmitting the higher frequency elements, which have a high probability of not carrying useful information MPEG first aim was to define a video coding algorithm for application on 'digital storage media', in particular for CD-ROM. Very rapidly the need for audio coding was added and the scope was extended from being targeted solely on CD-ROM to trying to define a 'generic' algorithm capable of being used by virtually all applications, from storage-based multimedia systems, television broadcasting, and communications applications such as VoD and videophones. Both the MPEG-1 and MPEG-2 standards are split into three main parts: Audio coding, video coding, and system management and multiplexing. MPEG itself is split into three main sub-groups, one responsible for each part, and a number of other sub-groups to advice on implementation matters, to perform subjective tests, and to study the requirements that must be supported. It is neither cost effective nor an efficient use of bandwidth to support all the features of the standard in all applications. In order to make the standard practically useful and enforce interoperability between different implementations of the standard, MPEG has defined profiles and levels of the full standard. Roughly speaking, a profile is a sub-set, suitable for a particular application, of the full possible range of algorithmic tools, and a level is a defined range of parameter values (such as picture size for instance) that are reasonable to implement and practically useful. There are as many as six MPEG2 profiles

-6-

though only two are currently relevent to broadcasting, main profile which is essentially MPEG-1 extended to take account of interlace scanning and encodes chroma 4:2:0 and professional profile which has 4:2:2 chrominance resolution and is designed for production and post production. MPEG-2 makes extensive use of motion compensated prediction to eliminate redundancy. The prediction error remaining after motion compensation is coded using DCT, followed by quantisation and statistical coding of the remaining data. MPEG has two types of prediction. The so-called 'P' pictures are predicted only from pictures that are displayed before the current picture. 'B' pictures on the other hand are predicted from two pictures, one that is displayed earlier and one later. In order to do this non-causal prediction the encoder has to reorder the sequence of pictures before sending them to the decoder and then the decoder has to return them to the correct display order. B-pictures add complexity to the system but also produce a significant saving in bit-rate. An important feature of the MPEG prediction system is the use of 'I frames' that are coded without motion compensation. These break the chain of predictive coding so that channel switching can be done with a sufficiently short latency. The most significant extension of MPEG-2 Main Profile over MPEG-1 is an improvement in options within a picture that can be used to do motion compensated prediction of interlaced signals. MPEG-1 treats each picture as a collection of samples from the same moment in time (known as frame-based coding). MPEG-2 understands about interlace, that samples within a frame come from two fields that may represent different moments of time, Therefore MPEG-2 has modes in which the data can either be predicted either using one motion vector to give an offset to a previous frame or two vectors giving offsets to two different fields. MPEG-2 extends this performance to allow: ƒ

Multiple programs with independent time-bases

-7-

ƒ

Error prone environments

ƒ

Remultiplexing

ƒ

Support for scrambling

2.2

Software’s to Extract Frames To extract images (frames) from a video file, we used the following open-

source softwares: 2.2.1 FFMpeg – a cross-platform tool we used for Ubuntu specifically FFmpeg is a collection of software libraries that can record, convert and stream digital audio and video in numerous formats. It includes libavcodec, an audio/video codec library used by several other projects, and libavformat, an audio/video container mux and demux library. The name of the project comes from the MPEG video standards group, together with "FF" for "fast forward". The project is made of several components: ƒ

ffmpeg is a command line tool to convert one video file format to another. It also supports grabbing and encoding in real time from a TV card.

ƒ

ffserver is an HTTP (RTSP is being developed) multimedia streaming server for live broadcasts. Time shifting of live broadcast is also supported.

ƒ

ffplay is a simple media player based on SDL and on the FFmpeg libraries.

ƒ

libavcodec is a library containing all the FFmpeg audio/video encoders and decoders. Most codecs were developed from scratch to ensure best performance and high code reusability.

ƒ

libavformat is a library containing demuxers and muxers for audio/video container formats.

ƒ

libavutil is a helper library containing routines common to different parts of FFmpeg.

-8-

ƒ

libpostproc is a library containing video postprocessing routines.

ƒ

libswscale is a library containing video image scaling routines.

2.2.2 Frame Dumper – a simple tool for dumping images in windows in .bmp format. It supports all types of video sources as long as they are recognized in your Windows Media Player. It runs extremely fast and reliable, and is a helpful tool for video related research work. Frame Dumper is a command line based tool, and the usage is very simple: Usage: FrameDumper.exe VideoName FromFrameID ToFrameID StepSize [TargetDir] Parameters: ƒ

VideoName: The source video file name (e.g., video.mpg). Use '\' for path.

ƒ

FromFrameID: The starting frame number (indexed from 1).

ƒ

ToFrameID: The ending frame number (indexed from 1).

ƒ

StepSize: The step size during dumping. 1 for continuous dumping.

ƒ TargetDir: Optional. Specify the dumping directory. Otherwise use current.

-9-

Chapter 3

Problems Faced By Amateurs

- 10 -

Problems Faced By Amateurs As discussed in an earlier section, the basic problems that all amateurs or users of digital cameras have to deal are primarily a result of the camera’s autofocus and auto-exposure features. It is therefore essential to understand fully what the problems are and why they tend to be created due to those features of digital cameras. One of the most noticeable problems affecting amateur videos is the sudden transitions in brightness of the subsequent frames. This is especially alarming when the camera is moved from a region which is not sufficiently illuminated, to a light source which is a very bright area, or vice versa. The camera auto-exposure correction algorithm takes some time to adjust its value, and meanwhile, the transition phase images can look extremely bright or extremely dark.

3.1

Auto Exposure Most of the digital cameras available on the market today provide a host of

features, the most common being auto-exposure setting, auto-focus, auto-white balance, image stabilizers, etc. Exposure, termed very simply, refers to the amount of light that the camcorder's lens collects; auto exposure is a system that controls the incoming light to prevent (among other things) over- or underexposure. An over-exposed shot looks washed out and overly bright, while an underexposed shot looks shadowy and dark. The camcorder's exposure system regulates two things: iris and shutter speed. The iris diaphragm in the lens controls the amount of light admitted, while the electronic circuitry referred as the shutter governs the amount of time the chip has to respond to the light. By manipulating the lens aperture (and sometimes - 11 -

the shutter speed), the camcorder does its best to deliver optimum light to the image sensing chip regardless of lighting conditions. When you turn on the camera, special circuitry analyzes the amount of light hitting the chip during the 1/60 second it takes to form an image. If that amount is greater than the optimum value, which is usually the case, the circuitry calculates how much to "stop down" (close) the iris diaphragm that sits between the sensor and the light source. Then it sends the appropriate command to the circuits controlling the servo motor of the diaphragm. The motor closes the diaphragm down, light transmission falls to ideal, and the CCD forms a perfectly exposed image.

3.1.1 The problems with auto exposure Although auto exposure facility in video cameras saves a lot of hassle and trouble for the amateur photographer, it has some obvious disadvantages. The two main problems associated with it are slow response, and not enough intelligence in the algorithms to decode the image that light forms, let alone determine which part of that image to expose properly. It's slow because exposure adjustments require electro-mechanical operations, which take time. As a result, the system often can't react fast enough to changing light to maintain a steady exposure. This problem is especially evident when we try panning onto someone who walks out from a relatively dark area into a brightly lit region. The second major problem arises mainly because of the finite contrast ratio that most camcorders will support. Contrast" is the difference in brightness between the lightest and darkest parts of the image. It's expressed as a ratio, such as four to one, meaning that the brightest point on the image is four times as light as the darkest. Within the system's contrast range, details in the light areas ("highlights") and dark areas ("shadows") are distinct and

- 12 -

readable. Above and below that range, highlights "burn out" or become blobs of pure white and shadows "block up" or drop to solid black. Four-to-one is about as wide a contrast ratio as a video system can maintain. The problem is that we are generally faced with much higher contrast ratios in the real world, leading to the camcorder properly recording those parts of the scene that fall within its contrast range and lets the others burn out or block up. Many algorithms such as the simplistic average brightness based methods to sophisticated weighted averaging to calculate exposure better for the center portion, are used, The autoexposure modes attempt to make whatever you are metering 18% gray (in the middle). However even the best of them suffer from problems especially evident in videos panned across a wide brightness range.

3.2

Autofocus Autofocus is a feature of some optical systems that allows them to obtain

(and in some systems to also continuously maintain) correct focus on a subject, instead of requiring the operator to adjust focus manually. It is that great time saver that is found in one form or another on most cameras today. In most cases, it helps improve the quality of the pictures we take. Autofocus (AF) really could be called power-focus, as it often uses a computer to run a miniature motor that focuses the lens for you. Focusing is the moving of the lens in and out until the sharpest possible image of the subject is projected onto the film. Depending on the distance of the subject from the camera, the lens has to be a certain distance from the film to form a clear image. In most modern cameras, autofocus is one of a suite of automatic features that work together to make picture-taking as easy as possible. These features include: ƒ

Automatic film advance

ƒ

Automatic flash

- 13 -

ƒ

Automatic exposure

There are two types of autofocus systems: active and passive. Some cameras may have a combination of both types, depending on the price of the camera

3.2.1 The problems with autofocus The two main causes of blurred pictures taken via autofocus video cameras are: ƒ

Mistakenly focusing on the background

ƒ

Moving the camera while recording the images

The human eye has a rather fast autofocus. For e.g holding your hand up near your face and focusing on it, and then quickly looking at something in the distance shows the distant object clearly. Cameras however, are not nearly this quick or this precise. Autofocus in a video camera is a passive system that also uses the central portion of the image. Though very convenient for fast shooting, autofocus has some problems: ƒ

It can be slow to respond.

ƒ

It may search back and forth, vainly seeking a subject to focus on.

ƒ

It has trouble in low light levels.

ƒ

It focuses incorrectly when the intended subject is not in the center of the image.

It changes focus when something passes between the subject and the lens.

- 14 -

Chapter 4

Post-processing Tools

- 15 -

Post-processing Tools Now that there is a fair idea of the problems that we are dealing with, let us look closely at the current market scenario and the post-processing tools available. Attention is drawn to the costs for each of these tools listed below. 4.1

Apple’s FCP Final Cut Pro is a professional non-linear editing system developed by

Apple Inc. The program has the ability to edit many digital formats. The system is only available for Mac OS X version 10.4.9 or later, and is a module of the Final Cut Studio product. Final Cut Pro has found acceptance among professionals and a number of broadcast facilities because of its cost effective efficiency as an off-line editor as much as a digital on-line editor. Final Cut Pro is also very popular with independent and semi-professional film-makers. Features ƒ

Broad format support

ƒ

Incredible real-time effects

ƒ

Comprehensive editing tools

ƒ

Expanded power as the hub of Final Cut Studio

ƒ

Open, extensible architecture

ƒ

Final Cut Pro 6 now also supports mixed video formats (both resolution and frame rate) in the timeline with real time support.

ƒ 4.2

Cost USD $ 1299. Avid media composer

The Media Composer is a non-linear editing system. Released in 1989 on the Macintosh II as offline editing system. The application features have increased to allow for film editing, uncompressed SD video and high definition editing and

- 16 -

finishing. Media Composer is Avid's primary editing software solution. Current version of Media Composer (MCSoft) has the following important features •

Animatte



3D Warp



Paint



Live Matte Key



Tracker



Timewarps with motion estimation (FluidMotion)



SpectraMatte (high quality chroma keyer)



Cost USD $ 1822

4.3

Autodesk Smoke Autodesk smoke covers the projects like, films, commercials, long-form

episodic’s, bumpers, and promos on a single system. Autodesk Smoke systems software is the solution for online editing and creative finishing, providing realtime, mixed-resolution interactivity for all types of creative editorial work and client-supervised sessions. 4.4

Autodesk Fire Autodesk Fire is the industry benchmark for non-compressed, non-linear

editing and finishing systems. Autodesk Fire systems software is used for creating visual effects. It provides Interactive manipulation of high-resolution film in advanced 3D DVE and compositing environment.

While these tools listed above have tremendous capabilities, they are found severely wanting in the following aspects: •

The software itself is extremely expensive, with costs running into a few lakh INR, making it inaccessible to the populace. In fact one rarely finds

- 17 -

users for these softwares outside of big video editing studios or workshops. •

Sufficient time and effort needs to be invested in learning and operating these tools. In effect, professionals who can take advantage of the array of tools available in them are needed to make economical sense and to do justice to the video itself.



The process of analysing every frame for colour and brightness itself is rather tedious and takes many days for a few seconds of footage.



The most important factor here though, besides the cost, is the fact that it is prone to the deficiency of human perception. For e.g, even a professional judging two or three different frames for, say the brightness, might cause a small amount of mismatch in the settings. If these errors are allowed to accumulate, he/she may need to redo the whole process after looking at the resultant video.

Therefore, our approach, which shall be discussed next, involves some basic correction on the video’s brightness and an improvement in its aesthetic appeal. An attempt has been made to ensure that the end-user has little to do in terms of the actual correction itself, which requires a sophisticated automation of the whole process.

- 18 -

Chapter 5

Our Approach

- 19 -

Our Approach We have, upto this point, discussed the problems that arise out of using Digicams and how it is difficult to rely on video editing softwares already available in the market to help with the editing. We discuss hereon a different method to automate and improve the overall video quality by relying on the familiar procedure of post-processing. Post-processing essentially involves the raw video file obtained from the Digicam. Usually, the areas that need substantial improvement arise due to a motion in the camera, as the user is moving the camera to focus on different views. As a result, the parts of the video (or more precisely, the set of frames that are involved in this transition) appear very unappealing when viewed in its raw form. Hence, a particular set of frames will definitely appear to be extremely dark or extremely bright, depending on the kind of transition. At a post-processing stage, we can use ‘information’ present in earlier or latter frames to correct the ones that are bad (usually belong to the timeframe corresponding to the transition itself). This is discussed below. More often than not, when there is a drastic change in the brightness as discussed above, it is either due to the fast motion of the camera or due to the Digicam’s inability to adjust its focus or brightness fast enough. However, as the same scene or view is kept in focus for a moderate amount of time after the transition, the camera eventually adjusts both its exposure and focus to obtain correct images. We can think of these good frames as those containing a high amount of information. The transitory frames then are lacking in this same information. The method sought to be implemented here involves using the information from good frames to improve and correct the bad frames. Think of a rather - 20 -

simple example of a camera initially filming an open street in sunlight. The camera is now moved to focus on an object located inside the apartment. The apartment lighting is poorer than the bright sunlit exterior and when the camera rests upon the object inside, it takes a finite amount of time to adjust to its ambient luminosity. Note that in this transition period, there will be a number of frames that are of poor quality. In this case, the transitory frames will appear dark as the exposure of the camera is initially set for the street’s luminosity and thus the camera lens’s aperture would be open only very slightly. Now for human eyes, there is also this same transition from brightness to relative darkness. Except in the case of our eyes, they are able to adjust much faster to changing luminosity and focal distances. Thus, Digicams that are unable to do this result in videos that look unaesthetic and therefore unprofessional.

Frame 5

Frame 20

Figure 5.1: Frame 5 containing good information and frame 20 with poor information. Note that in the example above, the object inside the apartment that the camera focuses upon is present both in the bad frames as well as the good ones with the information. Thus, what is proposed is to find the exact motion between these frames and restore or fill in information, on a pixel-by-pixel basis. This however does not mean simply replacing good frames into bad ones. In fact, the algorithm followed is a lot more complex and described in the sections below.

- 21 -

Chapter 6

Block Diagram

- 22 -

Block Diagram

Input Video File

Extracting Frames (FFMpeg)

Frames (Images)

Image Processing

Quantifying Exposure

Motion Estimation Quantifying Focus

Image Correction

Compiling Frames into Video

Storing Audio File

Enhanced Video Output

Figure 6.1: Block diagram - 23 -

The block diagram shown above systematically details the procedure followed in order to try to automate the process. The MPEG file from the user’s camera forms the input video file that is split or ripped up into frames using ffmpeg, which has been discussed before. The individual frames in the .jpg format are now processed. The correction of these images requires that the motion between these frames is identified and used. Similarly the focus and exposure values of these frames are also calculated. The final stage, post the correction, simply involves recompiling the frames into a video. This again is done by means of the ffmpeg software. So as to maintain the sound integrity of the original video, the sound is originally stored separately and then used again when the final video is being compiled. The resultant video is of an improved quality, with the few sections that appeared bad being corrected. The following pages detail the entire procedure in greater depth.

- 24 -

Chapter 7

Motion Estimation

- 25 -

Motion Estimation It should, by now, be obvious as to why the exact motion between the frames obtained using ffmpeg needs to be calculated, so as to deal with various problems such as focus, exposure, etc., faced by amateur video makers. As we are using Ubuntu, which supports an open source software “Octave” (which is similar to matlab), the detection of motion by matrix approach was the obvious choice. After going through literature on motion detection, we tried to implement motion detection using the in-built function of correlation. 7.1

What is Correlation

In simple terms, correlation of two signals is nothing but the measure of similarity between them. Mathematically, correlation is calculated as:

Consider two images as n-dimensional vectors (where n is the number of pixels in it), the concepts of basic geometry can be applied to obtain an intuitive explanation of the concept of correlation. For example, consider two images of the size NXN, which can be represented as N2-dimension vectors. The degree of similarity between these two vectors will be the projection of one vector on the other vector. This can alternatively be viewed as the dot product or the scalar product of the two vectors. The correlation coefficient can be found by dividing the correlation by the vector lengths, in order to obtain a normalized measure of the similarity between two vectors.

- 26 -

7.2

Algorithm for motion detection Consider two consecutive images from the video. As frame rate is

generally 24 images per second, the motion between the two images is restricted to few pixels. Considering this, the algorithm crops the second image by a few pixels on each side. The correlation was found shifting image 2 over image 1, with the constraint that the correlation always be valid i.e. complete overlap. This resulted in less overheads, and comparatively decent results since the motion is restricted to about a 5 pixel wide length, except for extreme cases of really fast motion. The maximum value of the fraction “cross correlation/auto correlation” determines the movement of the frames, with respect to one another. Whenever the maximum value of this ratio is obtained, it indicates that the particular overlap is the ‘best match’, which in turn can yield the x and y co-ordinates of the camera motion.

Figure 7.1: Illustration for the idea of correlation

With these basic results, we strove to enhance the algorithm, and after considerable trial-and-error, we decided to implement the same algorithm, this time using the filter function instead of the lengthy procedure of shifting the smaller image over the larger one. The filter function readily allows us to calculate the correlation of the images, and does so only over the overlap area when we specify the attribute of ‘valid’.

- 27 -

The filter function is defined to perform the following operation: Y = filter2(h,X,shape) It returns the part of Y specified by the shape parameter. ‘shape’ is a string with one of these values: ƒ

'full'

Returns the full two-dimensional correlation. In this case, Y is larger

than X. ƒ

'same' (default) Returns the central part of the correlation. In this case, Y is the same size as X.

ƒ

'valid' Returns only those parts of the correlation that are computed without zero-padded edges. In this case, Y is smaller than X.

- 28 -

7.3

Test Results Of Motion Algorithm To demonstrate the use of the motion detection algorithm, these are a few

test images which can simulate a camera moving over a fixed background towards the bottom right direction.

Figure 7.2: Sequences of images showing the apparent motion between them. The next logical step was to implement this algorithm in order to prepare an entire database of the motion for a complete sequence of frames. The database so formed would be useful while evaluating and enhancing further performance specifications of the video such as focus, exposure, jitter, etc. The graph below shows the plot of such a database, for the test images above. As can be readily

- 29 -

seen from the images, the camera pans across the images towards the downward right direction, which is very evident from the motion graph.

Figure 7.3: Plot of calculated motion, using the motion estimation algorithm.

- 30 -

7.4

Problems Faced And Its Solution The main problem that was faced during motion detection was due to images

like the ones shown below. Most of the image is dark, with a grayscale value of 0. This results in highly inaccurate correlation coefficient results. To overcome this problem, we followed an approach of using only “good regions” of the image to estimate the motion.

Figure 7.4: Frame with very poor information content.

The images were divided into a number of blocks, say 16 (4X4). For each of these blocks, certain parameters are calculated, on the basis of which the quality of the block is defined. The best block out of these is used, depending on the quality factor that is assigned to each of them. Motion is estimated only for this particular block, and the same motion is assumed for the entire frame.

- 31 -

Figure 7.5: Good block obtained from the entire frame.

Parameters used to define the quality factor: •

The mean of the block should be in the linear range (approximately grayscale values of 50-200).



The variance of the block should not be too high, implying that most of its pixels are in the linear range



These parameters are considered for the r, g and b components separately.

This approach yielded us the best out the 16 blocks, to be used to find out the motion for the 2 particular frames in consideration. This algorithm was implemented on every frame before using correlation to estimate the motion. The results that were obtained were very consistent with the visual perception of motion.

- 32 -

Chapter 8

Large Space Matrix

- 33 -

Large Space Matrix –Obtained From Motion Estimation At this juncture, it is now clear how to estimate accurately the exact motion between two consecutive frames. This motion, in the form of x-y co-ordinates can be obtained now for the entire series of frames that require correction with a few of those frames containing good information. The challenge herein is to utilize this motion tool mathematically and computationally so that a ‘database’ can be created which consists of these frames with their motion accounted for. The exact nature of this database is discussed now. The entire spatial region which the camera has panned in the course of its motion can be termed as a ‘global space’ as far as the camera is concerned. A 640 x 480 frame, a part of the global space, is captured by the camera, every 1/30th of a second (this is decided by the frame rate of the camera). We can understand this by analogy of how a panoramic picture is shot, which consists of a set of photos taken one after the other with the result being a larger photo with information from all the pictures after they have been stitched together (obviously, after compensating for motion). Similarly, our large space is defined as frames placed so that each one corresponds to one plane in a multi-dimensional space. The location of these frames in this large space is dictated by its motion from the previous frame. Shown below are some results to better understand the formation of this ‘matrix’:

- 34 -

Figure 8.1: Large space matrix with each frame as one dimension.

Figure 8.2: Combined fames after motion compensation. When compensated for by the motion between the frames, the matrix formed is shown in the figure above. Mathematically, each of these pixel values will be a number. The motion above is purely horizontal. However, motion can exist in both directions. - 35 -

Thus, the exact x-y motion co-ordinates for the entire series of frames to be corrected are used to form a database of these values. The database formed is now in a format that is easily usable and is used in the next stage of brightness correction. The following information can now be obtained very easily from the database or ‘large space’ matrix: •

The location of each point in global space, with respect to local camera frame co-ordinates.



Information regarding when a pixel was introduced into the database.



Information regarding the depth of each point, i.e., the number of frames for which a point stayed in the database.

Clearly, we are now able to decide how long (in terms of the number of frames) that any single object persists in the video. This is what has been referred to earlier as the depth of the point. Hence, further manipulations of this large matrix can be carried out in the exposure correction section discussed next.

- 36 -

Chapter 9

Exposure Correction

- 37 -

Exposure Correction The auto-exposure algorithm implemented in the digital camera takes some finite time to adjust the camera aperture in accordance with the changing brightness of the surroundings. This leads to loss of data in intermediate frames when camera moves from a bright area to a relatively darker one or vice versa. In other words, the brightness of intermediate frames reduces/increases making the video unpleasant to our eyes which adapt to brightness changes more quickly. 9.1

A quantitative measure of the exposure in an image From the earlier concept of an image represented as an N-dimensional

vector, we can intuitively understand that the energy contained in a signal is equal to summation of the squares of each of the discrete values, over the entire range. For an image, this yields nothing but the energy of the image, which can be considered as a measure of the brightness in the image. The camera exposure will be highly correlated with the total brightness that is observed in the image. 9.2

Plotting of the brightness curve With the above definition of brightness of the image, we implemented a

simple algorithm to plot the brightness of an image over a series of images in a video sequence. As expected, this curve shows drastic curves for videos where the camera is panned across dimly lit to brightly lit areas in a short time. This curve can enable us to zero in on the portion of the video which needs brightness correction.

- 38 -

The typically observed energy pattern of the frames can be shown as:

Figure 9.1: Energy pattern of the frames.

We propose to make this transition, as smooth as possible.

Old brightness curve Corrected portion of the curve

Figure 9.2: Correction needed of the energy pattern of the frames.

- 39 -

9.3

Approach 1: The transfer curve method ƒ

Treat the frames as a whole.

ƒ

Increase/decrease the brightness of the frame to make brightness curve as smooth as possible

The simplest ways we attempted to change the brightness of an image were trying to scale the total brightness, implementation using a transfer curve, etc. The problems with these algorithms were quickly evident: they scaling resulted in loss of information, as shown below, and though the transfer curve method showed better results, there were many problems associated with it too. We tried to implement transfer curve functions such as the one given below, to boost the dark region of the image, and reduce the intensity in the bright regions. However these too resulted in a loss of information, especially since cameras treat all values above a certain threshold to be 255 (the maximum brightness value). Due to this limitation of its range, the transfer curve method tends to reduce the intensity of portions where it should not really be scaled at all, for eg. the tubelight region in the figures below.

- 40 -

O/P

255

I/P

Figure 9.3: Results of transfer curve method. This approach assumes that brightness curve followed by each and every pixel on a frame is same. However, it is not true. The value (brightness)of every pixel changes individually and hence it should be treated individually.

- 41 -

9.4

Approach 2 ƒ

Treat every pixel separately

ƒ

Smoothen the brightness curve of every individual pixel

Frame 1

Frame 7

Frame 3

Frame 9

Frame 5

Frame 11

Figure 9.4: Frames showing the variations of the pixel intensity in time before the auto exposure of camera settles. From the above images, it can be demonstrated that each pixel needs to be treated individually for correction. The center tube light region in the above frames remains at the same grayscale value of 255 (saturated) right from frame number 1 till frame number 11. However the peripheral region starts out at a saturated value of 255, and ends in frame 11 at a much lower value of around 180 or so. Thus if we plot the brightness curve for these two images, they will be quite different from each other, and consequently, the correction to be implemented has to different for these pixels. The reason that this happens is because the actual brightness of the tube light is much higher than that of the peripheral region, however the limitations of the camera cannot capture these differences at a high exposure setting. Thus

- 42 -

both the regions appear saturated until the auto-exposure of the camera settles to the correct value for the frame.

Pixel 1

Pixel 5

Figure 9.5: Independent correction curves of different pixels. Thus the constraints on the algorithm to obtain new exposure values: ƒ

The difference between the old exposure values and new exposure values is minimum. This constraint makes sure that, there is no drastic difference between changed value and old value. Otherwise the optimization algorithm results will be a straight line, which is the smoothet possible curve in the time domain. It will however bear no resemblance to the original images, yielding frames which will look either too bright or too dark, and also introduce colour mismatch problems.



The difference between consecutive two values of new exposure should be minimum. This constraint is responsible for the smoothening of the brightness curve.

- 43 -

9.5

Approach 3 Testing of this approach led us to observe the problem of ‘Banding’ that we

hadn’t foreseen.

Pixel 1

Pixel 20

Global time

Pixel 1 Pixel 20 Figure 9.6: Corrected image showing banding effect. The algorithm starts correction only when new-pixel is introduced in the global matrix. In the above example, pixel 1 is introduced in the first frame itself, whereas pixel 20 is not introduced till the third frame. Since this algorithm does not put any constraint on the neighbouring pixel, the correction for a pixel and its neighbour can be different. Every time a series of pixels are introduced, their correction will start from that frame. It may happen that, there is considerable difference in the correction of neighbouring pixels if one of them is introduced late on the global time. This results in banding structure as can be observed from the corrected frame shown above. Thus the additional constraints on the algorithm to obtain new exposure values: ƒ

The new exposure values should be minimally deviant in space.

- 44 -

9.6

Algorithm: To calculate the New Exposure Curve For each new pixel

As explained earlier, every new value for exposure is subjected to three constraints: ƒ

The difference between the old exposure values and new exposure values is minimum.

ƒ

The difference between consecutive two values of new exposure should be minimum.

ƒ

The new exposure values should be minimally deviant in space.

Now, we have to solve a system

A× X = B Where ƒ

X: The new exposure value for every pixel Matrix obtained by applying pseudo-inverse function. Order of X: n X 1 n: total number of pixels in 3D matrix

ƒ

A: Constraint Matrix Order of A: (n*number of constraints) X n

ƒ

B: Column matrix which will have the ideal result values Order of B: (n*number of constraints) X 1

As we can see, there are many constraints on a single value of exposure, that is, the number of equations is more than number of variables. This results in an over-determined system, which can be solved using pseudo-inverse method. This equation is solved using the octave inbuilt function of matrix left division.

- 45 -

9.7

Backslash or matrix left division If A is a square matrix, A\B is roughly the same as inv(A)*B, except it is

computed in a different way. If A is an n-by-n matrix and B is a column vector with n components or a matrix with several such columns, then X = A\B is the solution to the equation AX = B computed by Gaussian elimination (see Algorithm for details). A warning message is displayed if A is badly scaled or nearly singular.If A is an m-by-n matrix with m ~= n and B is a column vector with m components, or a matrix with several such columns, then X = A\B is the solution in the least squares sense to the under- or overdetermined system of equations AX = B. The effective rank, k, of A is determined from the QR decomposition with pivoting (see "Algorithm" for details). A solution X is computed that has at most k nonzero components per column. If k < n, this is usually not the same solution as pinv(A)*B, which is the least squares solution with the smallest norm.

- 46 -

9.8

Pseudo Inverse The inverse A −1 of a matrix A exists only if A is square and has full rank. In

this case, A × X = B has the solution X = A −1 × B . A pseudoinverse is a matrix inverse-like object that may be defined for a complex matrix, even if it is not necessarily square. For any given complex matrix, it is possible to define many possible pseudoinverses. If A has full rank ( n ) we define:

A + = ( AT A) −1 AT and the solution of A × X = B is X = A + × B More accurately, the above is called the Moore-Penrose pseudoinverse. Calculation The best way to compute A + is to use singular value decomposition. With

A = USV T . Where

U and V (both nXn) orthogonal S (mXn) is diagonal with real,non-negative singular values

We find

A+ = V ( S T S ) −1 S T U T If the rank r of A is smaller than n , the inverse of

S T S does not exist, and one

uses only the first r singular values; then becomes an ( r X r) matrix and U , V shrink accordingly.

- 47 -

9.9

Implementation of the algorithm

Here we are trying to form constraint matrix A and ideal value matrix B. 1. Find a new-pixel, for how many frames it exists and what are its old exposure values. 2. In order to implement the first constraint: ƒ

enewi , j ,n = eold i , j ,n

ƒ

Convert all the pixel values of 3D matrix (obtained by motion detection algorithm) into a column vector. This column vector will form first [n X 1] part of the matrix B.

ƒ

An [n X n] identity matrix will form first part the matrix A thus implementing the above equation for every pixel in the 3D matrix.

⎡1000 ⎤ ⎢0100⎥ ⎥ ⎢ ⎢0010⎥ ⎥ ⎢ ⎣0001⎦

×

⎡enew1,1, 0 ⎤ ⎥ ⎢ ⎢enew1, 2,0 ⎥ ⎢enew ⎥ 2 ,1, 0 ⎥ ⎢ ⎢⎣enew2, 2,0 ⎥⎦

=

⎡eold1,1,0 ⎤ ⎥ ⎢ ⎢eold1, 2,0 ⎥ ⎥ ⎢eold 2 ,1, 0 ⎥ ⎢ ⎢⎣eold 2, 2,0 ⎥⎦

3. In order to implement the second constraint: ƒ

enewi , j ,n = enewi , j ,n −1

ƒ

Matrix B part corresponding to the second constraint is a column vector of zeroes.

ƒ

Matrix A is formed so as to implement the above equation, for each I,j possible. Thus every row of matrix A contains 1 and -1, at the positions corresponding to enewi , j , n and enewi , j ,n −1 , implementing the equation

enewi , j ,n − enewi , j ,n −1 = 0 .

- 48 -

⎡enew1,1,0 ⎤ ⎥ ⎢ ⎡1 − 100 ⎤ ⎢01 − 10⎥ × ⎢enew1,1,1 ⎥ ⎢enew ⎥ ⎥ ⎢ 1,1, 2 ⎥ ⎢ ⎢⎣001 − 1⎥⎦ ⎢⎣enew1,1,3 ⎥⎦

=

⎡0 ⎤ ⎢0 ⎥ ⎢ ⎥ ⎢⎣0⎥⎦

4. In order to implement the third constraint: ƒ

enewi , j ,n − enewi +1. j ,n = 0

ƒ

enewi , j ,n − enewi. j +1,n = 0

ƒ

Matrix B, is, again a column of zeroes equal in number to the equations being implemented.

ƒ

Every row of matrix A contain 1 and -1 at the locations corresponding to either enewi , j ,n and enewi +1. j ,n or enewi , j ,n and enewi. j +1,n depending on the equation being implemented, for all possible values of i, j and n.

⎡1000 − 100000000000 ⎢ ⎢01000 − 10000000000 ⎢001000 − 1000000000 ⎢ ⎢0001000 − 100000000 ⎢ ⎢000000001000 − 1000 ⎢0000000001000 100 − ⎢ ⎢00000000001000 − 10 ⎢ ⎢000000000001000 − 1 ⎢ ⎢10000000 − 10000000 ⎢010000000 − 1000000 ⎢ ⎢0010000000 − 100000 ⎢ ⎢00010000000 − 10000 ⎢ ⎢000010000000 − 1000 ⎢0000010000000 − 100 ⎢ ⎢00000010000000 − 10 ⎢ ⎢⎣000000010000000 − 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

×

⎡enew1,1,0 ⎤ ⎥ ⎢ ⎢enew1,1,1 ⎥ ⎢enew1,1, 2 ⎥ ⎥ ⎢ ⎢enew1,1,3 ⎥ ⎥ ⎢ ⎢enew1, 2,0 ⎥ ⎢enew ⎥ 1, 2 ,1 ⎥ ⎢ ⎢enew1, 2, 2 ⎥ ⎥ ⎢ ⎢enew1, 2,3 ⎥ ⎢enew ⎥ 2 ,1, 0 ⎥ ⎢ ⎢enew2,1,1 ⎥ ⎥ ⎢ ⎢enew2,1, 2 ⎥ ⎢enew ⎥ 2 ,1, 3 ⎥ ⎢ ⎢enew2, 2,0 ⎥ ⎥ ⎢ ⎢enew2, 2,1 ⎥ ⎢enew ⎥ 2, 2, 2 ⎥ ⎢ ⎢enew2, 2,3 ⎥ ⎦ ⎣

=

⎡0 ⎢ ⎢0 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢0 ⎢ ⎢0 ⎢ ⎢⎣0

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

5. Once we have formed matrix A and B, X can be obtained by pseudo-inverse.

- 49 -

Chapter 10

Focus

- 50 -

Focus 10.1

Quantifying focus: How to quantify the total focus quality of an image In our effort to understand how the focus for a video varies as images are

taken in rapid succession, we must be able to somehow quantify the focus of an image. Only then will it be possible to decide if the image is in good focus or not. This is a rather difficult problem as the focus quality of an image is not a physically measurable quantity. What comes to our rescue is the fact that high focus images are images that have a very clearly defined set of boundaries. That is, in good focus images, there is a very fine distinction in the details of the objects in the image. This points to high frequency components at the edges. Now, a measure of these high frequency components can give you a very decent estimate of the image’s focus quality. We have basically worked on two methods for focus quantification (i.e. frequency analysis). ƒ

The Fast Fourier Transform (FFT) for 2D images

ƒ

The Discrete Cosine Transform (DCT) for 2D images

Let us look at both these functions a little more closely. 10.2

Methods to measure the frequency components: DCT and FFT

10.2.1 FFT A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. FFTs are of great importance to a wide variety of applications, from digital signal processing and solving partial differential equations to algorithms for quick multiplication of large integers. Let x0, ...., xN-1 be complex numbers. The DFT is defined by the formula - 51 -

Evaluating these sums directly would take O(N2) arithmetical operations. An FFT is an algorithm to compute the same result in only O(N log N) operations. In general, such algorithms depend upon the factorization of N, but (contrary to popular misconception) there are FFTs with O(N log N) complexity for all N, even for prime N.

Many FFT algorithms only depend on the fact that

is a primitive root

of unity, and thus can be applied to analogous transforms over any finite field, such as number-theoretic transforms. Since the inverse DFT is the same as the DFT, but with the opposite sign in the exponent and a 1/N factor, any FFT algorithm can easily be adapted for it as well. This demonstration shows the FFT of a real image and its basis functions:

Note that in the images below, u* et v* are the coordinates of the pixel selected with the red cross on F(u,v). Blue cross contributes to the same frequency.

- 52 -

Figure 10.1: Demonstration of FFT transforms. 10.2.2 DCT A discrete cosine transform (DCT) is a Fourier-related transform similar to the discrete Fourier transform (DFT), but using only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry (since the Fourier transform of a real and even function is real and even), where in some variants the input and/or output data are shifted by half a sample. There are eight standard DCT variants, of which four are common.

- 53 -

The most common variant of discrete cosine transform is the type-II DCT, which is often called simply "the DCT"; its inverse, the type-III DCT, is correspondingly often called simply "the inverse DCT" or "the IDCT". The DCT, and in particular the DCT-II, is often used in signal and image processing, especially for lossy data compression, because it has a strong "energy compaction" property: most of the signal information tends to be concentrated in a few low-frequency components of the DCT, approaching the Karhunen-Loève transform (which is optimal in the decorrelation sense) for signals based on certain limits of Markov processes. In the two-dimensional DCT-II (DCT type 2) of N * N blocks are computed. In the case of JPEG compression, N is typically 8 and the DCT-II formula is applied to each row and column of the block. The result is an 8 × 8 transform coefficient array in which the (0,0) element (top-left) is the DC (zero-frequency) component and entries with increasing vertical and horizontal index values represent higher vertical and horizontal spatial frequencies. Like any Fourier-related transform, discrete cosine transforms (DCTs) express a function or a signal in terms of a sum of sinusoids with different frequencies and amplitudes. Like the Discrete Fourier Transform (DFT), a DCT operates on a function at a finite number of discrete data points. The obvious distinction between a DCT and a DFT is that the former uses only cosine functions, while the latter uses both cosines and sines (in the form of complex exponentials). However, this visible difference is merely a consequence of a deeper distinction: a DCT implies different boundary conditions than the DFT or other related transforms. The Fourier-related transforms that operate on a function over a finite domain, such as the DFT or DCT or a Fourier series, can be thought of as implicitly defining an extension of that function outside the domain. That is, once you write a function f(x) as a sum of sinusoids, you can evaluate that sum at any x, even for x where the original f(x) was not specified. The DFT,

- 54 -

like the Fourier series, implies a periodic extension of the original function. A DCT, like a cosine transform, implies an even extension of the original function. The formulae for a 2D DCT:

Four example blocks in spatial and frequency domains:

>>>

Spatial

Frequency

Figure 10.2: Image converted to its frequency domain. Inverse Discrete Cosine Transform To rebuild an image in the spatial domain from the frequencies obtained above, we use the IDCT formulae:

- 55 -

This demonstration shows the DCT of an image:

Figure 10.3: Demonstration of 2d DCT transforms. u* and v* are the coordinates of the pixel selected with the red cross on C(u,v).

- 56 -

10.3

Practically applying the above concepts Carrying out the DCT and FFT operations on images yielded a map of

their frequency components. As expected, there was a larger concentration of signal strength in the low frequency components. This is an obvious outcome as it represents most of the image’s uniformities. Thus, when quantifying the focus, we assigned a low scalar or weightage for these components. After all, we are more interested in the high-frequency components. Subsequently, the high frequency components were assigned higher weightage. For 2D FFT calculations, the low frequency components are shifted to the centre and a weight assigned to them. DCT low frequency components reside in the upper left corner. The distance from the origin i.e. u=v=0 was considered as the weight associated with the frequency components. A simple summation yields the focus factor. In order to normalize the operation for different amplitudes and slightly varying images, the above mentioned focus factor was divided by the sum of the amplitudes of all the frequency components of the respective transforms. Shown below are some sample images for the code written to estimate the focus quality in MATLAB.

Figure 10.4: Images with varying focus. - 57 -

The images shown above are numbered 1 to 6. The first two images represent pictures in bad focus, whereas the 4th and 6th images are of particularly good quality. Both, a 2D FFT and 2D DCT of the all images were taken, first by taking the luminance component of the RGB images and then performing the required operations. Shown below are what the DCT and the FFT looks like for the last image. Please note that the absolute value of the 2D DCT or FFT is plotted on a log scale.

Figure 10.5: FFT of an image. This is the FFT for the sixth image shown above, shifted to the centre. The low frequency components are located in the centre. Shown below is the 2D DCT for the same image. The top-left corner has high-valued low frequency components, whereas the other corners represent the high frequency components in their respective directions.

- 58 -

Figure 10.6: DCT of an image.

Figure 10.7: Focus factor using FFT. - 59 -

Shown above is the output obtained when the 2D FFT operation is performed for the six images in a sequence. Note the variation or change of the focus. The output for the 2D DCT is as shown below.

Figure 10.8: Focus factor using DCT.

- 60 -

Chapter 11

Actual Correction

- 61 -

Actual Correction

Figure 11.1: Old and new exposure curve and original pixel values. Consider a series of images, the exposure value for them is given by the red curve shown in above diagram. The new exposure curve obtained is given by the blue curve. Now, for a pixel whose pixel values change as 10 35 50 75 … on the global time, the ideal value for a that pixel, at a particular instant of time, will be function of old exposure curve, new exposure curve, future and past values of that pixel. There is a need at this juncture then, to decide exactly how much weightage needs to be given to this global pixel’s value in different frames, i.e on a global time axis. The implementation of the program becomes substantially simpler if we consider this example mathematically as follows: Let initial value(actual value of a pixel) of a be: I Exposure value be: E New exposure values be: Enew Temporary variable: temp

- 62 -

and resultant value be: R Mathematically, R can be given as

temp(t ) =

1 N



I (t + K ) × E (t + K ) E (t )

for k=depth of that pixel in 3D global matrix, that is, k will take all future and past values of that particular pixel. Where, N: Normalizing factor N = ∑ I (t + K )

for k=depth of that pixel in 3D global matrix The temp variable thus obtained will correct the pixel value, considering the past and future values of the pixel are only function of old exposure curve and they are correct However, this may not be true. The actual pixel value should be a function of new exposure curve as well. Mathematically,

R (t ) =

1 N



W (t + K ) × temp(t ) × Enew(t + k ) Enew(t )

for k=depth of that pixel in 3D global matrix, that is, k will take all future and past values of that particular pixel. Where, N: Normalizing factor N = ∑ I (t + K )

for k=depth of that pixel in 3D global matrix W: Weight factor. It represents how much trust should we put on the value obtained at that particular instant

- 63 -

11.1

Calculation of weights

When we try to change the value of a pixel at a particular instant t, we correct it using old and new exposure values of that pixel at future and past instances. In this process we cannot predict that future and past pixel values will always be correct. Thus we have to assign some weigtage to these calculated values. This weight assigned is a function of: 1. Value of pixel: ƒ

In image processing, the pixel values that fall on the linear range are suppose to contain maximum information. If the pixel value is within the linear range, more is the weight assigned. On the other hand, if it too low(noise) or it is too high(saturated) less weight is assigned.

2. Distance from the pixel ƒ

If value of k is small, that is, if the pixel that is near to the original pixel on the time axis, it is more likely the signal values will be similar. We should trust these pixels more.

ƒ

If value of k is large, that is, if the pixel that is near to the original pixel on the time axis, signal value is more likely to change, so we put less trust (weight) in these pixel values.

3. Focus quality ƒ

If the focus quality of a frame containing the pixel is more, more weight is assigned to that pixel.

Thus, we get corrected value for each pixel. This process is repeated for every pixel of the each and every frame of the video to give corrected video.

- 64 -

Chapter 12

Results

- 65 -

Results

Figure 12.2: First Frame After Correction

Figure 12.1: First Frame

Figure 12.4: Second Frame After Correction

Figure 12.3: Second Frame

Figure 12.6: Third Frame After Correction

Figure 12.5: Third Frame - 66 -

Figure 12.7: Fourth Frame

Figure 12.8: Fourth Frame After Correction

Figure 12.9: Final Frame

Figure 12.10: Final Frame After Correction

The results for a set of frames are shown above. For the first image in particular, there is a rather stark improvement in the brightness, while still not immediately jumping to the level of the last images. Subsequent images also show an improvement over the original ones. Note however that even the corrected images are increasing in their brightness value at some reasonable gradient, so there is no sudden peak in the brightness, which again might seem unaesthetic.

- 67 -

To really understand what has transpired here, attention is drawn to the first set of original and corrected images. Rather than simply being a case of increasing the brightness, there’s a substantial increase in the information content as well. The bag and some details of the clothes begin to appear clearly right from the first frame in the corrected frames now. Hence there has been a transplant of information itself from latter frames where the information is clearly present to the earlier frames.

- 68 -

Chapter 13

Conclusion and Future Scope

- 69 -

Conclusion and Future Scope At this stage, the initial goals laid done have been achieved with good results. In the results section, the improvement in the frames is clear for all to see. Also, the entire process needed very little intervention from the user, and almost no error due to human perception was incorporated. What has effectively been achieved here may be explained in a different way. One may think of the video obtained after post-processing as being taken directly by an imaginary camera. The feature of this camera is that it sets a different shutter time, i.e allows for different exposure values at each and every pixel. This is in stark contrast to a normal camera where there is only one single exposure setting for the entire frame. One may think of a scenario where we are panning the camera across a window looking out on to the bright exterior of the building, but half covered by a curtain with designs on it. Now, ordinarily a camera would simply adjust to the bright light and keep a short shutter time. This would leave the curtain details in the dark. Imagine however that the camera pans to reveal those details in latter frames. The algorithm that has been applied in this project would be able to rectify even those frames where the sunlit side of the window appears to be exceptionally bright. Hence, we might even be able to see the window brightly lit on one side, and the curtain in all its glory on the other. As this was meant to serve as a proof of concept for a larger more complicated system, it has served its purpose. The opportunity, however to improve upon this work is great and would involve probably a complete automation of the entire process, with even he process of identifying the bad frames done by the program.

- 70 -

At the cost of complexity and time, we can also make the motion detection absolutely accurate by considering factors such as rotation, foreshortening and zoom. The idea for using the same to process colours will also require that the time expended based on this algorithm be reduced, as it would also involve additional constraints.

- 71 -

Chapter 14

References

- 72 -

References Determining Optical Flow Berthold K.P. Horn and Brian G. Rhunck Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Discontinuity-Preserving Computation of Variational Optic Flow in Real-Time Andres Bruhn, JoachimWeickert, Timo Kohlberger, and Christoph Schnorr2 Digital Signal processing Bernd Jahne Digital Signal processing Rafael C. Gonzalez, University of Tennessee Richard E. Woods, MedData Interactive Handbook of Image and Video processing Al Bovik. Digital Image processing William K Pratt Weisstein, Eric W. "Pseudoinverse." From MathWorld--A Wolfram Web Resource. www.wikipedia.com ffmpeg.mplayerhq.hu www.mathworks.com

- 73 -

www.octave.org www.google.com

- 74 -

Related Documents