DEVELOPMENT OF A CHILD DETECTION SYSTEM WITH ARTIFICIAL INTELLIGENCE (AI) USING OBJECT DETECTION METHOD
Lai Suk Na
Bachelor of Engineering with Honours (Mechanical and Manufacturing Engineering) 2018
FACULTY OF ENGINEERING
FYP REPORT SUBMISSION FORM Lai Suk Na Name : ___________________________________________
47241 Matric No. : _______________________
Development of a Child Detection System with Artificial Intelligence (AI) using Object Detection Method Title : _________________________________________________________________________________
_______________________________________________________________________________________ Ir Dr David Chua Sing Ngie and Manufacturing Engineering Supervisor : ___________________________________ Program: Mechanical ________________________________
Please return this form to the Faculty of Engineering office at least TWO WEEKS before your hardbound report is due. Students are not allowed to print/bind the final report prior to Supervisor’s Approval (Section B). The Faculty reserves the right to reject your hardbound report should you fail to submit the completed form within the stipulated time. A. REPORT SUBMISSION (To be completed by student) I wish to submit my FYP report for review and evaluation.
Signature: ___________________________
Date: ______________
\
B. SUPERVISOR’S APPROVAL (To be completed by supervisor) The student has made necessary amendments and I hereby approve this thesis for binding and submission to the Faculty of Engineering, UNIMAS.
Signature: ___________________________
Date: ______________
Name: ______________________________________________________
UNIVERSITI MALAYSIA SARAWAK Grade: Please tick () Final Year Project Report
Masters PhD
DECLARATION OF ORIGINAL WORK
This declaration is made on the 22nd day of June 2018.
Student’s Declaration: I, LAI SUK NA, 47241, FACULTY OF ENGINEERING hereby declare that the work entitled, DEVELOPMENT OF A CHILD DETECTION SYSTEM WITH ARTIFICIAL INTELLIGENCE (AI) USING OBJECT DETECTION METHOD is my original work. I have not copied from any other students’ work or from any other sources except where due reference or acknowledgement is made explicitly in the text, nor has any part been written for me by another person.
22nd June 2018 Date submitted
Lai Suk Na (47241)
Supervisor’s Declaration: I, IR DR DAVID CHUA SING NGIE hereby certifies that the work entitled, DEVELOPMENT OF A CHILD DETECTION SYSTEM WITH ARTIFICIAL INTELLIGENCE (AI) USING OBJECT DETECTION METHOD was prepared by the above named student, and was submitted to the FACULTY OF ENGINEERING as a partial fulfillment for the conferment of BACHELOR OF MECHANICAL AND MANUFACTURING ENGINEERING (Hons.), and the aforementioned work, to the best of my knowledge, is the said student’s work
Received for examination by: _ (Ir Dr David Chua Sing Ngie)
Date:
22nd June 2018
I declare this Project/Thesis is classified as (Please tick (√)): CONFIDENTIAL (Contains confidential information under the Official Secret Act 1972)* RESTRICTED (Contains restricted information as specified by the organisation where research was done)* √ OPEN ACCESS
Validation of Project/Thesis I therefore duly affirmed with free consent and willingness declared that this said Project/Thesis shall be placed officially in the Centre for Academic Information Services with the abide interest and rights as follows:
This Project/Thesis is the sole legal property of Universiti Malaysia Sarawak (UNIMAS). The Centre for Academic Information Services has the lawful right to make copies for the purpose of academic and research only and not for other purpose. The Centre for Academic Information Services has the lawful right to digitise the content to for the Local Content Database. The Centre for Academic Information Services has the lawful right to make copies of the Project/Thesis for academic exchange between Higher Learning Institute. No dispute or any claim shall arise from the student itself neither third party on this Project/Thesis once it becomes sole property of UNIMAS. This Project/Thesis or any material, data and information related to it shall not be distributed, published or disclosed to any party by the student except with UNIMAS permission.
Student’s signature
Supervisor’s signature: (22nd June 2018)
(22nd June 2018)
Current Address: NO 82 TAMAN LANDEH JALAN LANDEH 93250 KUCHING SARAWAK
Notes: * If the Project/Thesis is CONFIDENTIAL or RESTRICTED, please attach together as annexure a letter from the organisation with the period and reasons of confidentiality and restriction.
[The instrument was duly prepared by The Centre for Academic Information Services]
APPROVAL SHEET
This project report which entitled “DEVELOPMENT OF A CHILD DETECTION SYSTEM WITH ARTIFICIAL INTELLIGENCE (AI) USING OBJECT DETECTION METHOD” was prepared and submitted by LAI SUK NA (47241) as a partial fulfilment of the requirement for degree of Bachelor of Engineering with Honours in Mechanical and Manufacturing Engineering is hereby read and approved by:
_____________________ Ir Dr David Chua Sing Ngie (Project Supervisor)
____________________ Date
DEVELOPMENT OF A CHILD DETECTION SYSTEM WITH ARTIFICIAL INTELLIGENCE (AI) USING OBJECT DETECTION METHOD
LAI SUK NA
A dissertation submitted in partial fulfillment of the requirement for the degree of Bachelor of Engineering with Honours (Mechanical and Manufacturing Engineering)
Faculty of Engineering Universiti Malaysia Sarawak
2018
To my beloved family and friends.
ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my supervisor, Ir Dr David Chua Sing Ngie from Mechanical and Manufacturing Engineering Department for his guidance, advice and contribution of ideas throughout this project. I would also like to thank Google Inc for making object detection model and Tensorflow library open source which enable students and researchers to develop in the field of Artificial Intelligence and machine learning. I sincerely thank Google Images for allowing the usage of images for training purpose. Last but not least, I would like to express my heartiest thank to my family and friends for their support and corporation. With all their accompany this thesis become reality.
i
ABSTRACT The issue of children dying due to vehicular heatstroke has raised the public attention. The failure of current vehicular occupant detection devices to identify correctly the occupant as a child had triggered the idea of developing a child detection system using Artificial Intelligence (AI) technology. The usage of Convolutional Neural Network (CNN) had been recognised as an effective way to perform image classification. However, this approach required a significant number of images as training data and substantial time for model training in order to achieve desired results in accuracy. Due to the limitation of abundant dataset, transfer learning was used to accomplish the task. Modern convolutional object detector, SSD Mobilenet v1 trained on Microsoft Common Objects in Context (MS COCO) dataset was used as a starting point of the training process. MS COCO dataset that consisted of a total of 328k images were divided into 91 different categories including dog, person, kite and so on. The trained model was then retrained to classify adults and children instead of persons. At the end of the training, a real-time child detection system was established. The system was able to give different responses to the detection of a child and adult. The responses comprised of visual and audio outputs. Upon detection, a bounding box was drawn on a child or an adult face as visual output. At the same time, the system would trigger the speaker to speak out the statement “child is detected” for successful child detection whereas adult detection would result in the statement of “adult is detected”. Theoretically, the detection system could achieve an overall precision of 0.969. However, the experimental results obtained was able to match up to a precision of 0.883 that resulted in a small error of 8.88%.
ii
ABSTRAK Isu kanak-kanak maut dalam kenderaan kerana strok haba telah menimbulkan perhatian orang ramai. Kegagalan alat pengesan di pasaran untuk mengenal pasti kehadiran penumpang kenderaan adalah anak dengan tepat telah mencetuskan idea untuk membangunkan sistem pengesanan kanak-kanak menggunakan teknologi Kecerdasan Buatan (AI). Penggunaan rangkaian neural konvolusi (CNN) telah diakui sebagai cara yang berkesan untuk melakukan klasifikasi imej. Walau bagaimanapun, pendekatan ini memerlukan imej yang banyak sebagai data latihan dan masa yang panjang untuk latihan supaya dapat mencapai ketepatan yang dikehendaki. Oleh sebab batasan dataset yang banyak, pembelajaran pemindahan digunakan untuk menyelesaikan tugas. Pengesan objek konvolusi moden, SSD Mobilenet v1 yang dilatih dalam dataset Microsoft Common Objects Context (MS COCO) digunakan sebagai titik permulaan proses latihan. Dataset MS COCO mengandungi sejumlah 328 ribu imej dibahagikan kepada 91 kategori yang berbeza termasuk anjing, orang, layang-layang dan sebagainya. Model ini dilatih untuk mengklasifikasikan dewasa dan kanak-kanak yang pada asalnya mengenali kanakkanak dan orang dewasa sebagai orang. Pada akhir latihan, sistem pengesanan masa nyata ditubuhkan. Sistem memberi respons yang berbeza kepada kanak-kanak dan orang dewasa. Maklum balas terdiri daripada output visual dan audio. Secara teorinya, sistem pengesanan dapat mencapai ketepatan keseluruhan 0.969 sedangkan hasil eksperimen memberikan ketepatan 0.883, memberikan kesilapan sebanyak 8.88%.
iii
TABLE OF CONTENTS Acknowledgements………………………………………………..……...……… Abstract………………………………………...………………...………………. Abstrak.………………………………………………………………………...… Table of Contents……………………………………………………………...…. List of Tables………………………………………………........……………….. List of Figures……………………………………………....…………...……….. List of Abbreviations…………………………………………....………...……...
i ii iii iv vi vii ix
Chapter 1 INTRODUCTION 1.1 General Background………………………………...……………… 1.2 Problem Statement……………………………….......…………...... 1.3 Objectives………………………………………...………………… 1.4 Scope of Research………………………………………....………..
1 4 5 5
Chapter 2 LITERATURE REVIEW 2.1 History of Artificial Intelligence (AI)……………………………… 2.2 Machine Learning (ML)…………………………………...……...... 2.2.1 Supervised Learning……………………....……………….. 2.2.2 Unsupervised Learning…………………………...………... 2.2.3 Reinforcement Learning…………………………………… 2.3 Deep Learning (DL)……………………………………….……...... 2.3.1 Perceptron………………………………………………….. 2.3.2 Deep Neural Network……………………………………… 2.3.2.1 Convolutional Neural Network…………………... 2.3.2.2 Recurrent Neural Network……………………….. 2.4 Pretrained Models…………………………………...……………… 2.5 Related Works……………………………….........………………...
6 7 7 11 12 14 14 16 17 18 18 19
Chapter 3 METHODOLOGY 3.1 Introduction…………………………………….....…..……………. 3.2 Research Methodology…………………...……….…...…………… 3.3 Hardware…………………………………………………………… 3.4 Python Programming and Python Integrated Development……...… 3.5 System Design Flow Chart…………………………………….…… 3.6 Flow Chart of Methodology…………………………………...…… 3.7 Object Detection using SSD Mobilenet v1…………………..…….. 3.8 Dataset Pre-processing………………………………………..……. 3.8.1 Collection of Dataset………………………………...……... 3.8.2 Annotation…………………………………………...……...
23 23 24 24 25 26 27 29 29 30
iv
3.8.3 Binarization…………………………………………...……. 3.9 Training………………………………………………………..…… 3.9.1 Compute Loss…………………………….………………... 3.10 Testing……………………………….……………………………. 3.10.1 Compute Precision and Evaluate Performance of Model Checkpoints………………………...……………………... 3.11 Export Retrained Model……………………………………...…… 3.12 Completion of Child Detection System…………………………… 3.12.1 Types of Inputs……………………………………..…….. 3.12.2 Classifier……………………………………………..…… 3.12.3 Triggering System……………………………………..…. 3.13 Experimental Evaluation of Child Detection System…………..… 3.13.1 Response Time………………………………………...….. 3.13.2 Maximum Distance of Detection……………………...….. 3.13.3 Precision……………………………………………...…… 3.14 Project Management………………………………………...……..
31 31 31 32 33 34 35 35 35 36 36 36 37 37 38
Chapter 4 RESULT AND DISCUSSION 4.1 Introduction…………………………………………………..…...... 4.2 Object Detection on Jupyter Notebook…………………………….. 4.3 Detection with SSD Mobilenet v1…………………………...……... 4.4 Loss Graph…………………...……………………………..……… 4.5 Model Evaluation…………………………………………..………. 4.6 Experimental results …………………...……...……………...……. 4.6.1 Response Time………………………………………...…… 4.6.2 Maximum Distance of Detection……………………...…… 4.6.3 Precision………………………………………………...….. 4.7 Discussion……………………………………………………..…… 4.8 Sources of Error…………………………………………………..…
39 39 40 41 42 46 47 48 56 61 61
Chapter 5 CONCLUSIONS AND RECOMMENDATIONS 5.1 Conclusions…………………………………...…………...……… 5.2 Recommendations……………………………………………..……
62 62
REFERENCES APPENDIX A APPENDIX B APPENDIX C APPENDIX D APPENDIX E APPENDIX F APPENDIX G APPENDIX H
63 68 72 73 75 78 82 84 85
v
LIST OF TABLES Table
Table 1.1
Page
Comparison between the Advantages and Disadvantages of Current Vehicular Occupant Detection Systems in the Market
4
Table 2.1
Difference between Machine Learning Tasks
13
Table 2.2
Speed and Accuracy of Pretrained Model
19
Table 3.1
Pretrained Model
28
Table 3.2
Distribution of Training and Testing Dataset
30
Table 3.3
Gantt Chart for FYP1 and FYP2
38
Table 4.1
Sample Ground Truth and Computed Data
42
Table 4.2
Result of Testing at Step 19608
46
Table 4.3
Response Time for 6 Candidates
47
Table 4.4
Child Detection with Varying Distances
49
Table 4.5
Adult Detection with Varying Distances
54
Table 4.6
Reproducibility Test of the Child Detection
58
Table 4.7
Reproducibility Test of the Adult Detection
58
Table 4.8
Terminology in Confusion Matrix
59
Table 4.9
Confusion Matrix of Child
59
Table 4.10 Confusion Matrix of Adult
59
Table 4.11 Theoretical and Experimental Precisions
60
vi
LIST OF FIGURES Figure
Figure 1.1
Page
Circumstances of Child Vehicular Heatstroke Death in United States ( 1998 – 2016)
2
Figure 1.2
Sense-A-Life Vehicular Occupant Detection System
2
Figure 1.3
Hyundai Rear Occupant Alert System
3
Figure 2.1
Relationship between AI, ML and DL
7
Figure 2.2
Procedure for Building a Classification Model
9
Figure 2.3
Procedure in a Regression Task
10
Figure 2.4
Construction of Regression Line based on Samples
11
Figure 2.5
Characteristics of Machine Learning Models
13
Figure 2.6
The Processing Steps in a Perceptron
15
Figure 2.7
Architecture of ANN
16
Figure 2.8
Difference between Nueral Network and DNN
16
Figure 2.9
Hierarchy Feature Extraction in CNN
17
Figure 2.10
Architecture of RNN
18
Figure 3.1
Python IDLE Version 3.6
24
Figure 3.2
Logic Diagram of Child Detection System
25
Figure 3.3
Procedures in Methodology
26
Figure 3.4
Process Flow to Accomplish Object Detection Method
27
Figure 3.5(a)
Sample Adult Images
30
Figure 3.5(b)
Sample Child Images
30
Figure 3.6(a)
Labelling Images
30
Figure 3.6(b)
XML File
30
Figure 3.7
Command Window
31
Figure 3.8
Loss Values
32
Figure 3.9
Running Evaluation Command
33
Figure 3.10 Figure 3.11
The Computed Precision for Model Checkpoint at Step 19613 Child Detection System
33 35
vii
Figure 3.12
Experiment Setup Diagram
37
Figure 4.1
Results of Object Detection
39
Figure 4.2
Sample Image for Detection
40
Figure 4.3(a)
Detection Outcome based on SSD_Mobilenet_v1
40
Figure 4.3(b)
Detection Outcome based on Retrained model
40
Figure 4.4
Loss Graph during Training
41
Figure 4.5
Relationship between Predicted Bounding Box, 𝐵𝑝 and Ground Truth Bounding Box, 𝐵𝑔𝑡
42
Figure 4.6
Performance of Model Checkpoints by Category (Adult)
44
Figure 4.7
Performance of Model Checkpoints by Category (Child)
44
Figure 4.8
Mean Average Precision (mAP) of Model Checkpoints
45
Figure 4.9
Setup of Experiment
47
Figure 4.10
Response Time for 6 Candidates
48
Figure 4.11
Results of Child Detection
57
Figure 4.12
Results of Adult Detection
57
viii
LIST OF ABBREVIATIONS AI
-
Artifiial Intelligence
ML
-
Machine Learning
DL
-
Deep Learning
ANN
-
Artificial Neural Network
DNN
-
Deep Neural Network
CNN
-
Convolutional Neural Network
RNN
-
Recurrent Neural Network
IDLE
-
Integrated Development Learning Environment
XML
-
Extensible Markup Language
HTML
-
Hyper Text Markup Language
WAV
-
Waveform Audio
CSV
-
Comma Separated File
ix
CHAPTER 1
INTRODUCTION
1.1 General Background
Heatstroke is defined as a situation whereby the body lost the ability to cool itself due to prolonged exposure to high temperatures (Mayo Clinic, 2017). The symptoms of heatstroke include high body temperature, normally achieved a temperature of 40 oC or higher, nausea, vomiting, rapid breathing, racing heart rate, headache and so on. There are children dying from heatstroke each year after being left unattended in vehicles. According to McLaren (2005), although under a relatively cool ambient temperature, the majority rise in temperature of a parked vehicle occurs at the first 15 to 30 minutes. According to Kidsandcars (2017), the probability of children suffer from heatstroke in cars is much higher than that of adults. Core body temperature of children rises more rapidly under high temperature due to their greater body surface area to mass ratio as compared to adults. A child can get overheat at around 3 to 5 times faster than an adult. Statistics show that there is a total of 700 child fatalities due to vehicular heatstroke since 1998 in the United States and 87% of them were aged under 3 years old. Figure 1.1 shows the circumstances of child vehicular heatstroke (Null, 2017).
1
Circumstances
17%
1% Forgotten Gained Access
54% 28%
Left Intentionally Unknown
Figure 1.1: Circumstances of Child Vehicular Heatstroke Death in the United States (1998 – 2016) The issue of children dying of vehicular heatstroke has raised the attention of the public and lead to the invention of various types of devices to help remind drivers or caregivers. The typical product currently available in the market includes Sense-A-Life, a device that utilized pressure sensor to detect the presence of a child, and immediately alert driver that has left a child in the car through a mobile application and speakers. It is designed with simple installation and easy transfer between vehicles.
Figure 1.2: Sense-A-Life Vehicular Occupant Detection System (Adapted from Schlosser, 2016) 2
However, there are weaknesses in this invention. The usage of the pressure sensor cannot differentiate whether the force applied to it is a human or a load. This will cause the occurrence of false signal generation. Besides, it is designed to be installed in a baby car seat. It cannot detect the presence of a child who gained access to the car on their own. Another invention is developed by Hyundai Motor named rear occupant alert system. The system implemented ultrasonic sensor to detect motion of child in the rear seats after driver leaves the vehicle and activate triggering system including horns sound, light flash and send text message to the driver. The system can work effectively even the child is not put in the baby seat (Newswire, 2017).
Figure 1.3: Hyundai Rear Occupant Alert System (Adapted from Muller, 2017) However, the system cannot generate the signal if the child is sleeping or not move. In addition, the system will generate a signal to any motion regardless of whether the vehicular occupant is an adult or a child. This reduces the reliability of the system. Current products in the market generally comprised of sensors to sense temperature and vehicular occupant detection as well as triggering system. Weaknesses of these products are the ability to detect correctly whether there is a vehicular occupant present in the car and whether he or she is a child or an adult. As such, a study should be done to develop a child detection system which can accurately detect the presence of a child in a car. Application of Artificial Intelligence (AI) technology in image recognition system is the reason for the idea of developing a child detection system to achieve the objectives of this research.
3
Table 1.1: Comparison between the Advantages and Disadvantages of Current Vehicular Occupant Detection Systems in the Market Advantages Sense-A-Life
Ease to install
Disadvantages
Pressure
sensor
cannot
differentiate between a child and a load
The region of detection is limited in the baby car seat only
Hyundai Rear Occupant Alert System
Detection is not
Cannot
limited to the baby
occupant is an adult or a child
car seat only
differentiate
the
Fail to sense the occupant if there is no motion detected
1.2 Problem Statement
The problem encountered currently is the available sensors in the market cannot differentiate accurately the vehicle occupant is an adult or a child. The sensor can detect the presence of vehicular occupant and generate signal to alert car owner. However, the system cannot determine whether the occupant is a child or adult. This will increase the frequency of false signal generation. The inaccurate sensor such as pressure sensor will activate triggering system to any load applied on it, generating the false signal to users, thus reducing the reliability of the detection system. On the other hand, problem to be faced in this project is the qualities of input which will affect the accuracy of results. This problem may occur due to motion of children. Other factor such as brightness, distance and angle from the camera to the target would affect the performance of the system as well.
4
1.3 Objectives
The objectives of this study are: To provide a framework for the development of a child detection system with AI using object detection method. To establish an image recognition system that can detect the presence of a child. To develop a child detection system that is able to reduce the frequency of false signal generation.
1.4 Scope of research
This research focuses on developing an alternative way to detect vehicular occupant instead of using sensors available in the market. The system established will response to the presence of child only. This research will help to study the workability in using image recognition to differentiate a child against an adult. There are several challenges that may not be solved by using image recognition. Firstly, training an image recognition system requires a large amount of images as training data. Besides, training an image recognition model required the useage of Graphics Processing Unit (GPU) in order to speed up the process and obtain a more accurate retrained model.
5
CHAPTER 2
LITERATURE REVIEW
2.1 History of Artificial Intelligence
The field of Artificial Intelligence (AI) was coined by John McCarthy and two senior scientists: Claude Shannon and Nathan Rochester in 1956 at a conference, Dartmouth Conference, the first conference devoted to the subject (Buchanan, 2006). They proposed that every aspect of learning or any other feature of intelligence can be precisely described, and a machine can be made to simulate it. According to Luger (2009), artificial intelligence (AI) is the science to enable the machine to accomplish the thing that requires intelligence. It promotes the use of a computer to do reasoning. AI can refer to anything from a computer programme playing a game to pattern recognition, image recognition as well as text and speech recognition by utilizing AI technology. The field of AI reached its major advance in the year of 1980, with the emergence of Machine Learning (ML), an approach to achieve AI (Dietterich & Michalski, 1983). Machine learning works on the principle of using algorithms to learn from examples. The machine is trained to learn from a large amount of data by itself, without explicitly programmed. However, conventional machine learning techniques were limited in their ability to process natural data in their raw form (Luppescu & Romero, n.d.). The weakness had led to the blossom of Deep Learning (DL) in the 2000’s. DL is a set of methods that allow a machine to be fed with raw data and automatically discover the representations needed for detection and classification (Lecun, Bengio, & Hinton, 2015). DL architectures comprised of multiple processing layers to extract a useful representation of data. For example, in image processing, raw images are fed into the learning model. Initial layer extract edge detection, the second layer detects the motif, the third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and 6
subsequent layers would detect objects as combinations of these parts. In short, AI is any techniques that enable computers to replicate human intelligence. ML is the subset of AI involving the field of study that gives computers the ability to learn without explicitly programmed. Improvement of ML leads to DL, which allow a model to learn the representations of data with multiple levels of abstraction. Figure 2.1 shows the relationship between AI, ML, and DL. AI - Human intelligence exhibited by machines
ML - An approach to achieve AI DL - A technique for implementing ML
Figure 2.1: Relationship between AI, ML, and DL
2.2 Machine Learning
According to Negnevitsky (2002), machine learning consists of adaptive mechanisms that enable the computer to learn from experiences or data exposed to it. The knowledge is improved by keeping making adjustment and correction on the error signal generated. The most famous machine learning mechanism include artificial neural networks and genetic algorithms. Machine learning is a field of study that focuses on computer systems that can learn from data. These systems are called models. Models can learn to perform a specific task by analysing lots of examples for a particular problem. There are three modes to conduct machine learning, including supervised learning unsupervised learning and reinforcement learning.
2.2.1 Supervised Learning
A machine learning process whereby the model is to predict the provided output or target variables. In other words, all the target variables are labelled. Classification and regression are examples of the supervised learning problem, which the output variables are either categorical or numeric. Under supervised learning, machine learning can further be divided into classification task and regression task.
7
a. Classification In classification, input variables or features are fed into the machine learning model, the model then goes through a series of algorithms and predict the category of output (Michie et al., 1994). Classification differs from regression because its output is in categorical, such as different classes of dog breeds, flower species, weather conditions whether it is sunny, windy, cloudy or rainy. A classification problem can be binary or multi-class. For binary, the output variable has only two possible outcomes, either yes or no. On the other hand, a multi-class problem has more than two possible outcomes, such as the weather conditions, or predict the types of product customer are going to buy, predicting age group of humans, whether they are children, teenagers, adults or seniors. A machine learning model is a mathematical model or a parametric function over the input. The function of the model is to predict the output with respect to the input data. The parameters, typically called weights are frequently adjusted by learning algorithms within the model with exposure to various input in order to match the targets or desired outputs as much as possible. Figure 2.2 shows the procedure for constructing a classification model. The process involves 2 phases: training phase and testing phase.
8
Figure 2.2: Procedure for Building a Classification Model
The weights are continuously adjusted by algorithms throughout the training phase. Algorithms for classification tasks include: i. Decision Tree Classifier A classification model that uses a tree-like structure to represent multiple decision paths. Traversing each path leads to a different way to classify an input sample. 9
ii. K nearest neighbours (kNN) Classify samples, called nearest neighbours with similar values of input into the same class. Samples from different classes are separated by a large margin. The classifier uses simple Euclidean distances to measure the dissimilarities between samples represented as vector inputs (Weinberger, Blitzer, & Saul, 2006). iii. Naïve Bayes Naïve Bayes uses a probabilistic approach to classify an input sample. It captures the relationships between the input data and output class and predicts the probability of an event to occur correspond to the samples. These algorithms function to match the target output as much as possible with least error of prediction. At the end of training phase, the train model is obtained. The aim of testing phase is to evaluate how the train model perform. The model is exposed to test data, which it never seen before during testing phase.
b. Regression
In classification problem, the learning model is to predict the category where the input belongs to. On the other hand, regression problem is machine learning process which the model has to predict a numeric value instead of a category. Examples of regression problems include predicting the price of a stock, forecast temperature for the next day, estimate average house price, and predict power usage. Since the output data are labelled in numeric value, regression problem is a supervised task. Figure 2.3 shows the procedure in regression task.
Input Variables
Model
Numeric Output
Figure 2.3: Procedure in a Regression Task The process of building a regression model is similar to a classification model. It involves training and testing phases. Algorithm used to adjust the regression model is called linear regression. A linear regression model captures the relationships between 10
numerical output and input variables. Figure 2.4 shows the construction of regression line to separate data into two regions. The line represent the model prediction of the output correspond to the input.
Figure 2.4: Construction of Regression Line based on Samples (Adapted from https://goo.gl/images/32rgD2) In regression, the error, E is measured based on the distance between the regression line and the actual value of the input variables. The square of distance, E2 is called residual. Algorithms will do adjustments continuously to reduce sum of square distance in order to give an output with least error as possible.
2.2.2 Unsupervised learning
In unsupervised learning, the output variable remained unknown. The purpose of unsupervised learning is to model the underlying structure of the input data. This means the data are unlabelled. Example of unsupervised learning problems include cluster analysis and association analysis.
a. Cluster analysis
Cluster analysis is also known as clustering. The goal of clustering is to organize dataset of similar items into group, or cluster, so that the differences between samples are minimized. Cluster analysis is unsupervised task as each cluster has no target label. The simplest algorithm to cope with unsupervised cluster analysis is K-means cluster (Tan, Stenbach, & Kumar, 2005a). K-means algorithm starts with determining the centroid 11
coordinate. Then, based on the smallest distance between the samples and the centroid, the samples are grouped to respective centroids. When all the objects have been assigned, the position of K centroids are recalculated to reduce the error. Typical example of cluster analysis is grouping of customers based on purchasing behaviour.
b. Association Analysis
An association rule learning problem is to discover rules that describe large portions of data. Example of application of association analysis are web mining, document analysis, telecommunication alarm diagnosis, network intrusion detection and bioinformatics. The algorithm involved in association analysis is apriori algorithm (Tan, Steinbach, & Kumar, 2005b).
2.2.3 Reinforcement Learning
Both supervised and unsupervised learning, the trained model is used to do classification or detection tasks. On the other hand, reinforcement learning is continuously improved based on processed data and the results (Mnih, Silver, & Riedmiller, n.d.). Reinforcement learning learns through trial and error interaction. The goal of reinforcement learning is to develop efficient learning algorithm (Barto & Dietterich, 2004). Figure 2.5 shows the characteristics of each category in machine learning and their respective algorithms for constructing learning model.
12
Figure 2.5: Characteristics of Machine Learning Models
To summarise, classification, regression, clustering and association analysis tasks in machine learning mainly differ in the way of grouping input data and displaying prediction. Table 2.1 summarise the differences between the 4 machine learning tasks. Table 2.1: Difference between Machine Learning Tasks
Input
Output
Application
Classification Regression
Clustering
Association Analysis
Labelled data
Labelled
Unlabelled
Unlabelled data
data
data
Numerical
No target
Learning rule.
label but with
E.g. customers who
close
buy tea tends to buy
similarities
fruits
Predict
web mining
Categorical
Predict
Predict
weather
stock
reading
document analysis
condition
price
materials
telecommunication
either
preferred
alarm diagnosis
sunny,
by
network intrusion
windy,
customers
cloudy or
detection
rainy 13
bioinformatics
2.3 Deep Learning (DL)
In machine learning, the learning model needs to be told how it should do then the model gives an accurate prediction. This is achieved by feeding data to the model. In contrast, a DL model is able to learn on its own similar to human. To achieve that, DL uses a layered structure of algorithms called artificial neural network. The artificial neural network is designed to simulate the biological neural network in the human brain. DL is a process which enables computational models to learn the representation of data with multiple layers of abstraction. It possesses a set of methods, starts from the raw material input, the raw material is transformed to the machine which automatically discovers representation needed for detection or classification (LeCun, Yoshua & Geoffrey, 2015). This process is done by a feature extractor. The extracted features are further transformed to slightly more abstract level in the internal representation process. Finally, the features are sent to the classifier, which composes of higher layers of representation. In the classifier, the aspects which identified as useful input are amplified while irrelevant variations are suppressed. In image recognition using DL, users are required to train the machine to have the ability to classify various images scanned into it. At the same time, users are required to expose a large collection of data set to the machine and labelled it with the category as the desired output. Then, an objective function is computed to compare the error between the output scores and the desired pattern of scores. With the error, the machine tries to modify internal adjustable parameters to reduce errors in the next recognition. These adjustable parameters or typically called weights define the inputoutput function. The integration of processes described above brings to the creation of Convolutional Neural Networks (CNN). It is a network which connects the different layers of learned features as a whole.
2.3.1 Perceptron
Perceptron is a mathematical model with function similar to the biological neuron. Perceptron is the basic block of Artificial Neural Network (ANN) and it is a standard paradigm for statistical pattern recognition (Learning & Learning, 1999). There are 3 processing steps occurred within a single perceptron, namely, input, activation and output processes. When the inputs in the form of signals enter the perceptron, the mathematical model will compute a weighted sum of the input signal and generate a binary output. It 14
will give an output of 1 if the weighted sum is above a certain threshold. Otherwise, the model will give a zero output (Jain & Mao, 1996). Figure 2.6 shows the three processing steps within a perceptron.
Figure 2.6: The Processing Steps in a Perceptron (Adapted from O’Riordan, 2005)
Mathematically, the output, y for n numbers of inputs, x operating under a threshold function, u can be expressed as: 𝑛
𝑦 = 𝜃 (∑ 𝑤𝑗 𝑥𝑗 − 𝑢) 𝑗=1
Where 𝜃(∙) is a unit step function at 0 and j = 1,2,3,…,n. The application of neural network includes face identification, speech recognition, text translation, gaming, automated vehicles and robots control and so on. Multilayer perceptrons constitute the ANN. Each layer consists of more than one perceptron which receives input values, collectively called the input vector of that particular perceptron. The output of each perceptron, called weight are collectively identified as the weight vector of that perceptron.
15
Figure 2.7: Architecture of ANN (Adapted from Popescu, Balas, Perescu-Popescu, & Mastorakis, 2009). The first layer of the neural network is connected to the input data, typically called the input layer. Layers which are not directly connected to the environment called the hidden layers. The function of hidden layers is to transmit a signal to adjacent layers, without any processing on the input. The final output is connected to the activation function. The most used function currently is the sigmoid function (Popescu et al., 2009).
2.3.2 Deep Neural Network (DNN)
Deep Neural Network (DNN) is an artificial neural network composed of numerous hidden layers. Figure 2.8 shows the difference between neural network and DNN.
Figure 2.8: Difference between Neural Network and DNN (Adapted from Nielsen, 2015)
16
DNN derived its name because it possesses two or more hidden layers. Each hidden layer performs specific types of sorting and ordering tasks. DNN has the ability to deal with unlabelled or unstructured data by performing hierarchy feature extraction. Based on the different types of architectures, ANN can be grouped into two categories: 2.3.2.1 Feed-forward network (Convolutional Neural Network) 2.3.2.2 Recurrent Neural Network
2.3.2.1 Convolutional Neural Network (CNN)
In feed-forward networks, perceptrons are organised into layers that have unidirectional connections between them. The convolutional neural network is an example of a feed-forward neural network which is designed to process data that come in the form of multiple arrays. An example of such data is colour image. Computer stores image as tiny squares. An image is composed of small squares called pixels. Each pixel is only one colour defined by a set of numbers. The set of numbers represent a combination of three colours, namely red, green and blue, call an RGB image. In RGB image, the colour in a pixel is represented by three 8-bit numbers, in the range of 0-255. For example, a yellow colour is represented in an array of [255, 255, 0] as yellow is produced by a combination of green and red colours. If the RGB is set to full intensity, i.e [255, 255, 255], a white colour is displayed. However, if the RGB is muted, i.e. [0, 0, 0], black colour is displayed. A colour image contains three 2D arrays containing pixel intensities in the RGB channels. As such, CNN takes advantages of the natural signal properties to do recognition task. CNN exploits the natural signals in a hierarchy manner (Lecun, Bengio, & Hinton, 2015), in which higher-level features are obtained by composing lower level features as shown in Figure 2.9.
Figure 2.9: Hierarchy Feature Extraction in CNN (Adapted from Chen, 2016) 17
2.3.2.2 Recurrent Neural Network (RNN)
RNN is a deep neural network designed to perform sequential tasks, such as speech and language. Applications of RNN include predicting the next word in a sequence or next character in a text. Figure 2.10 shows the architectures of RNN.
Figure 2.10: Architecture of RNN (Adapted from Zhong, Peniak, Tani, Ogata, & Cangelosi, 2016) In RNN, the architecture is not a feed forward network. Some interconnection forming loop connecting perceptron of the same layer function (Popescu, Balas, PerescuPopescu. & Mastorakis, 2009). The feedback loops modify the inputs to each perceptron, which leads the network to enter a new state.
2.4 Pretrained Models
Pretrained models are built with combination of object detectors and feature extractors. Example of modern convolutional object detectors include Faster Regions with Convolutional Neural Networks features (R-CNN), Region-based Fully Convolutional Networks(R-FCN) and Single Shot Multibox detector (SSD). Feature extractors include mobileNet, VGG, Resnet 101, Inception V2, Inception V3, Inception Resnet V2 (Huang et al., 2016). Table 2.2 shows the speed and accuracy of pretrained models.
18
Table 2.2: Speed and Accuracy of Pretrained Model (Adapted from Shanmugamani, n.d.) Model Name
Speed
COCO mAP
ssd_mobilenet_v1_coco
fast
21
ssd_inception_v2_coco
fast
24
rfcn_resnet101_coco
medium
30
faster_rcnn_resnet101_coco
medium
32
faster_rcnn_inception_resnet_v2_atrous_coco slow
37
In this project, real time input is chosen for child detection in order to reduce the complexity of the design and save storage. Hence, ssd_mobilenet_v1_coco is chosen as the starting checkpoint for model retrain as it can achieve fast speed and considerable accuracy.
2.5 Related works The application of machine learning has rapidly advanced in various fields such as speech recognition, text and image recognition and so on. An example of image classification using machine learning is demonstrated by Pratik Devikar (2016). He used the pretrained model, Inception v3 to do transfer learning. The dataset used to retrain the model is obtained from Google images. The aim of this research was to train a model which can recognise and differentiate 11 types of dog breeds. Hence, he prepared 11 types of datasets, each datasets comprised of 25 slightly different images of particular dog breeds. To ensure the uniformity of the datasets, the images were set to a resolution of 100x100 pixels. Throughout the experiment, he implemented Python programming language and import TensorFlow library to conduct classification task. The accuracy score was generated by using SoftMax algorithm. The resulting accuracy of testing he achieved reached 96% (Devikar, 2016). Another example that utilised machine learning in image recognition is demonstrated by Tapas (2016). The aim of this experiment is to classify plant phenotyping. He used the pretrained model, GoogleNet to do retraining. The dataset was extracted from the database of Computer Vision Problems in Plant Phenotyping (CVPPP 2014) database. The dataset comprised of 3 categories, 2 on Arabidopsis, with 161 images and 40 images respectively. Another category of the dataset was Tobacco species, which 19
consisted of 83 images. The retraining process is conducted via TensorFlow library and python as a programming language. The output was displayed in probability using the computation of SoftMax function. The results of accuracy based on testing image reached 98%. In addition, a similar study on the flower classification using Inception v3 worth a consideration. The study was based on Inception v3 model of TensorFlow platform. The experimental datasets were acquired from two sources, Oxford-17 database, which consist of 17 categories of flowers and Oxford-102, which consist of 102 categories of flowers. The results depicted by SoftMax function regarding the possible output with the input of testing images are compared according to the two types of dataset. The result shows model trained under Oxford-17 dataset reach 95% of accuracy whereas Oxford102 dataset gives an accuracy of 94% (Xia & Nan, 2017). According to Chin et al. (2017), a research on intelligent image recognition system for marine fouling using SoftMax transfer learning and the deep convolutional neural network was done. They implemented transfer learning by retraining Google's Inception v3 model and SoftMax as an output of prediction based on image input. The images were processed by Open Source Computer Vision Library (OpenCV) and the retraining process is done with the help of TensorFlow Library. At the beginning of the process, Raspberry Pi 3 captured image of the marine fouling. The image was then uploaded to cloud to be classified by the retrained Inception V3 model and convolutional neural network. Then, the image was processed and the percentage of the area of macrofouling organisms was determined. The percentage in the range of 25-40% was considered as heavy fouling and cleaning process must be conducted. The datasets were obtained from captured images from the web. The model was retrained to classify 10 classes of fouling species, with dataset size in the range of 82-228 images. In order to enhance the accuracy of the model, the model was trained twice. Results show the lowest improvement in percentage is 10.302% where the highest can reach 41.398% of improvement. Upon testing on the reliability of the trained model, the highest accuracy achieved among the 10 classes of fouling species was rock oysters, which can reach 99.703% correct prediction. On the other hand, finger sponge species possessed the lowest accuracy, which is 76.617%.
20
Tamkin & Usiri (2013) claimed that diabetic retinopathy can be detected with the application of deep Convolutional Neural. They extracted a dataset from Kaggle competition database. The database was chosen because the images are taken in various conditions, including different cameras, colours, lighting and orientation. The more variety was the images sources, the higher the robustness of the trained model. A total of 35,126 images, with the size of more than 38 gigabytes is separated in the ratio 0f 8:2, whereby 80% of the images are used as training set, 20% are used as testing set. All images were resized to 256 pixels x 256 pixels. The highest accuracy achieved at the end of the experiment is 92.59%. Human age can be categorized into 4 phases: child (0-12 years old), adolescence (13-18 years old), adult (19-59 years old) and senior adult (60 years old and above). As human ages, there are minor changes on the facial features. Recently, age estimation has developed a variety of application including internet access control, underage prevention of cigarette and alcohol machines and so on. With the transformation of growth into adulthood, the development of lower parts of the faces is more pronounced than that of the upper part. Eyes occupy a higher position in an adult than in an infant. This is due to an outgrowing and dropping of the chin and jaw, but not the migration of eyes (Kwon & da Vitoria Lobo, 1994). According to Thukral, Mitral, & Chellappa (2012), they proposed a hierarchical method to estimate human age. The datasets were obtained from FG-Net website. Upon gathering the dataset, they grouped the images into 3 major groups, in the ranges of 0-15, 15-30 and 30+ years old respectively. The experiment can be divided into 3 steps, feature extraction, regression and classification. In feature extraction, facial landmarks points at corners or extremities of eyes, mouth and nose are extracted. Regression was conducted by determining the independent variable, x and the dependent variable, y. Next, they used the Relevance Vector Machine (RVM) regression model conduct machine learning according to the age groups. After that, in classification phase, they utilised 5 types of classifiers, including μ-SVC, Partial Least Squares (PLS), Fisher Linear Discriminant, Nearest Neighbour, and Naïve Bayes to classify the images into the correct age group. Results showed that if the classifiers were able to classify the images into correct age group, the age estimation task by RVM can perform more accurate, which can reach 70% of accuracy.
21
Jana, Datta, & Saha (2013) claimed that facial features can be used to estimate age group. Their experiment involved 3 stages: pre-processing, feature extraction and classification. During the pre-processing phase, they prepared datasets by taking images of 50 persons by using a digital camera (Nikon Coolpix L10). The face images were cropped, and the positions of eye pair, mouth, nose and chin were detected. During feature extraction, global features such as distances between 2 eye balls, eye to nose tip, eye to chin and eye to lip were determined. 6 types of ratios are then computed by referring to the distances obtained. After that, classification is carried out by using K-means clustering algorithm. Results showed ratio obtained using pixels (F5) was most reliable, with an accuracy of 96% when the samples are separated into 2 age groups, 84% of accuracy is obtained for 3 age groups and 4 age groups had accuracy of 62%.
22
CHAPTER 3
METHODOLOGY
3.1 Introduction
This chapter summarises the procedures of doing research on the child detection system with AI using object detection method. The main purpose of this research is to develop a framework of child detection system as an alternative to improve the accuracy of current vehicular occupant detection systems. This chapter would explain the type of programming language used and finally hardware implemented for developing the child detection system. Next, the chapter discusses the gather of information through different sources, process flow chart to achieve the objectives of this project and components (softwares and hardwares) used to accomplish the detection task.
3.2 Research Methodology
The study begun with a literature review on the development of AI and reason for choosing AI for developing child detection system using image recognition technology. The literature research was accomplished through Google Scholar, Institute of Electrical and Electronics Engineers (IEEE), Elsevier and Mendeley. Then, related works and researches done in recent years regarding the application of DL in image were studied in order to get an idea of developing the child detection system. Python Programming language was chosen as software platform because of its flexibility and cpability to support different ML packages.
23
3.3 Hardware
In this research, the built in camera of laptop is used as image input device. 2.50Hz Intel i5 Central Processing Unit (CPU), 4.00 GB Random Access Memory (RAM), 64bit Operating System is used to accomplish the training process.
3.4 Python Programming and Python Integrated Development Environment Python is a programming language with extensive supported packages and modules. It is developed by Guido van Rossum. It is derived from many other languages such as ABC, modula-3, C, C++, Algol-68, smallTalk, Unix shell and other scripting languages. Python also provides interfaces to all major commercial databases (Swaroop, 2003). Python consists of a broad standard library. This feature enables the exploration and access to various file types such as XML, HTML, WAV, CSV files. IDLE is a simple Python Integrated Development Environment (IDE) available for Windows, Linux, and Mac OS X (Lent, 2013). All commands can be typed, saved and run in Python IDLE iterative shell. As such, Python is chosen as programming language throughout the project.
Figure 3.1: Python IDLE Version 3.6
24
3.5 System Design Flow Chart
Camera
Real time visual input
Classifier
No
Is there any person presented?
Yes
No
The detected person is a child?
System speaks out “adult is detected”, bounding box is drawn on the image, labelled adult with scores
Yes System speaks out “child is detected”, bounding box is drawn on the image, labelled child with scores
End Figure 3.2: Logic Diagram of Child Detection System The process flow of the child detection system was shown in Figure 3.2. The camera was activated to scan whether there was a child or adult presented. The input was then fed into the classifier (retrained model) whenever there was a person detected. The classifier provided the response with respect to the class of a person detected that was either a child or an adult.
25
3.6 Flow Chart of Methodology Object Detection using SSD Mobilenet Setup and installation
Dataset Preprocessing Collect image from Google Images
Annotation using labelImg
Binarization
Training Compute error
Testing Compute accuracy and evaluate performance of model checkpoints
Export retrained model Choose model checkpoint with highest precision and lowest error
Complete the child detection system Input (Images, video, real time)
Classifier
Triggering system (visual and audio)
Evaluate the performance of child detection system Response time
Maximum distance of detection
Precision
Figure 3.3: Procedures in Methodology The child detection system consisted of a camera as input medium, a classifier to detect presence of child and a triggering system in audio and visual forms. The construction of model begun with object detection using a pretrained model to ensure all the required softwares and packages were installed correctly. Followed by data preprocessing. Testing was done from time to time as the training proceed. Training was ended when the precision achieved 0.95 above and loss below 1.0. Model checkpoint at the particular step was exported as system classifier. Upon completion of the system, its performance was evaluated based on the response time, maximum distance of detection and precision. 26
3.7 Object Detection using SSD Mobilenet v1
Object detection consist of a series of process as shown in Figure 3.4. The object detection is accomplished using Tensorflow Object Detection API. It relied on Protobuf version 3.5, Python version 3.6, Pillow, LXML, Tf slim, Jupyter notebook, matplolib and tensorflow to achieve the detection task. The dependencies were installed via pip, a package management system used to install and manage software packages written in Python. Installing dependencies
Protobuf compilation
Add libraries to PYTHONPATH
Testing installation
Download pretrained model
Run detection on Jupyter Notebook Figure 3.4: Process Flow to Accomplish Object Detection Method The Tensorflow Object Detection API used Protobufs to configure model and training parameters. Before the framework could be used, the Protobuf libraries must be compiled. This was done by running the following command from the tensorflow/models/research/ directory: protoc object_detection/protos/* --python_out=.
27
where * indicates the proto files in the directory models/research/object_detection/protos. Protobuf compilation was used to convert a series of proto files into python files. After the installation was done, the following command was run in the command windows: python object_detection/builders/model_builder_test.py
The
pretrained
model
was
downloaded
from
the
link
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/dete ction_model_zoo.md. The choice of pretrained model, as shown in Table 3.1 was based on the speed and accuracy of the model. Since the detection would be done on a real time basis, a fast model was preferable. Mobilenet model was designed for mobile application which would give considerably high accuracy in a short time (Huang et al., 2016). Higher accuracy of detection model might be used but it would take significant amount of time to accomplish the detection task which was not realistic in this study. Table 3.1: Pretrained Models (Adapted from Tensorflow, 2018) Speed
COCO
(ms)
mAP [^1]
ssd_mobilenet_v1_coco
30
21
ssd_mobilenet_v2_coco
31
22
ssd_inception_v2_coco
42
24
faster_rcnn_inception_v2_coco
58
28
faster_rcnn_resnet50_coco
89
30
faster_rcnn_resnet50_lowproporsals_coco
64
rfcn_resnet101_coco
92
30
faster_rcnn_resnet101_coco
106
32
faster_rfcn_resnet101_lowproporsals_coco
82
faster_rcnn_inception_resnet_v2_atrous_coco
620
faster_rcnn_inception_resnet_v2_atrous_lowproporsals_coco
241
faster_rcnn_nas
1833
faster_rcnn_nas_lowproporsals_coco
540
mask_rcnn_inception_resnet_v2_atrous_coco
771
36
mask_rcnn_inception_v2_coco
79
25
Model Name
28
37
43
For simplicity, the detection was done on Jupyter notebook to ensure all the installations had completed. The results of detection were shown in Section 4.2. The detailed procedures on setting up the system to accomplish the object detection was attached in Appendix A.
3.8 Dataset pre-processing
The dataset pre-processing step included converting human readable data, such as images into computer readable binary data. In this project, the images were converted to tfrecord format of binary data as the input for training and testing process.
3.8.1 Collection of dataset
According to Francis (2017), the ideal number of images of training was in the range of 100 to 300 images. Before the training process, a total of 300 images, comprised of 150 images of adults and children respectively were collected from Google Images (Figure 3.5(a) & (b)). The dataset comprised of person with different gender, age group and accessories were chosen to increase the variety of data. All the images were downloaded in JPEG format. The dataset was divided according to the ratio of 9:1 as the training and testing dataset. Table 3.2 summarises the distribution of testing and training images.
images
Number of
Table 3.2: Distribution of Training and Testing Dataset Training
Testing
Adult
135
135
Child
15
15
29
Figure 3.5(a): Sample Adult Images
Figure 3.5(b): Sample Child Images
3.8.2 Annotation
Annotation is the process of labelling image data and converting it into XML file. The software involved was labelImg. It was a graphical image annotation tool which support Python programming language. XML carried information of image data such as the image name, class name (child or adult), Region of Interest (ROI) of the class. Figure 3.6(a) shows the process of labelling image with class name (child) and Figure 3.6(b) shows the XML file corresponded to the labelling process. Once the xml files for all 300 images were successfully converted, they were divided into 2 files namly ‘test’ and ‘train’.
(b) XML file
(a) Labelling image
Figure 3.6: Annotation using LabelImg 30
3.8.3 Binarization
Binarization is the process of converting a pixel image to a binary image. The 300 converted XML files were converted to a single CSV file (refer to Appendix B for code of conversion XML files to CSV files) before it was further converted to binary data. TensorFlow has its own binary data format which is called TFRecord (refer to Appendix C for full code). The binary data was fed as input data during the training process of the model.
3.9 Training
The training process was run in command windows by setting Python path and input directory to models/research/object_detection as shown in Figure 3.7. Training process started running by typing the following command: python train.py --logtostderr --train_dir=training/ -pipeline_config_path=training/ssd_mobilenet_v1_pets.config
Figure 3.7: Command Window 3.9.1 Compute Loss During training phase, the module named ‘train.py’ was run in command window. The details of train module were attached in Appendix D. From the code, train_dir indicates the directory where the results of training, including total loss, model checkpoints, weights and biases were recorded. On the other hand, pipeline_config_path indicates the directory of configuration file, where the content of the configuration file includes the name and directory of pretrained model, directory of train and test datasets, batch number, fine tune checkpoint and so on. Details of the configuration codes were attached in Appendix E. Figure 3.8 shows loss values as the training process proceed. 31
Figure 3.8: Loss Values The performance of the retrained model was evaluated by the total loss contributed during each training step. All the data was displayed through Tensorboard by typing the following command: tensorboard --logdir=training/
3.10 Testing The testing was done by extracting 10 images randomly from the 35 images in ‘test’ folder, by comparison with ground truth class to get the precision for each class (child and adult). The overall precision was computed by averaging the precision of adult and child classes. The testing results gave an indication on the performance of model and determined which model to export as image classifier. The following command is run to evaluate the precision by category and overall precision at model checkpoints: 32
python eval.py --logtostderr --checkpoint_dir=training/ --eval_dir=data/ -pipeline_config_path=training/ssd_mobilenet_v1_pets.config
3.10.1 Compute Accuracy and Evaluate Performance of Model Checkpoints
Figure 3.9: Running Evaluation Command Besides loss graphs, the performance of the retrained model, which could classify adult and child were evaluated by running eval.py module. The module coding was attached in Appendix F. From the command as shown in Figure 3.9, checkpoint_dir provided the directory of the latest checkpoint of the training phase where the system could retrieve the information to evaluate the accuracy of the model. Eval_dir was the file where the evaluated results were stored.
Figure 3.10: The Computed Accuracy for Model Checkpoint at Step 19613 33
Figure 3.10 shows the computed accuracy for model checkpoint at step 19613. During evaluation, 10 images from the test data were picked randomly to estimate the output, either it is a picture of a child or an adult. The accuracy that could be achieved by the retrained model was evaluated by Tensorflow. During testing phase, 10 images from the test dataset was extracted randomly for detection. The Mean Average Precision (mAP) was computed automatically and it was shown in Tensorboard by typing the following command: tensorboard --logdir=data/
The graphs of mAP measured according to categories, which are child and adult, as well as the overall mAP is depicted in Section 4.5.
3.11 Export Retrained Model
The exportation of
a retrained model was done by implementing
‘export_inference_graph.py’ module (as attached in Appendix G) from github. The code of implementing the module is shown below: export_inference_graph.py --input_type image_tensor --pipeline_config_path training/ssd_mobilenet_v1_pets.config --trained_checkpoint_prefix training/model.ckpt-xxxxx --output_directory child_vs_adult_inference_graph The module firstly run the commands in configuration file, namely ssd_mobilenet_v1_pet.config. Trained_checkpoint_prefix indicates the name of model checkpoint to be exported. The output_directory would create a new folder named child_vs_adult_inference_graph containing the exported model. The precision of a retrain model increased with the number of training steps. When the computed accuracy exceeds 95% and the loss was less than 1, the model checkpoint was exported to be the child detection model, which acted as the classifier of the child detection system.
34
3.12 Completion of child detection system
The child detection system consisted a camera includes input, a classifier and triggering system as shown in Figure 3.11.
Camera Input
Classifier
Triggering
System
Figure 3.11: Child Detection System 3.12.1 Types of input
Camera acted as a media device to capture the images of child and adult once the system was activated. The input could be an image, a video stream or real time detection with slight modification on the programming code (Appendix H). In this project, realtime detection was chosen for simplicity of design. Open Source Computer Vision Library (OpenCV) was imported to accomplish real-time detection. OpenCV was designed for computational efficiency and with a strong focus on real-time applications. It had C++, Python and Java interfaces and supported Windows, Linux, Mac OS and Android (OpenCV Library, 2018).
3.12.2 Classifier
The classifier is composed of the exported retrained model, which achieved a desirable accuracy in differentiating adult and child. A model consists of a graph proto (graph.pbtxt),
a
checkpoint
(model.ckpt.data-00000-of-00001,
model.ckpt.index,
model.ckpt.meta), a frozen graph proto with weights fitted into the graph as constants (frozen_inference_graph.pb) which function to detect desired classes (child and adult) and give bounding boxes as output to different classes. These directly correspond to a config file in the samples/configs directory but often with a modified score threshold. In the case of the heavier Faster R-CNN models, a version of the model that uses a highly reduced number of proposals for speed was also provided.
35
3.12.3 Triggering System
The triggering system consist of audio respond and visual respond. The system would draw a bounding box around a detected person, with label of the class name. In this project, there were two classes of images to be detected, which were adult and child. Upon stating the class name, probability scores of the detected class were also shown. The time interval needed for each detection when the program was running were printed in the Python IDLE for ease of data collection. Audio respond was added to further alert the users. It was achieved by importing speech package in Python. The conditional (ifelse) function was used to different targets detected, that is either a child or an adult.
3.13 Experimental Evaluation of Child Detection System
The performance of the detection system was evaluated based on 3 experiments, including the measure of:
Response time
Maximum distance of detection
Precision
3.13.1 Response Time
The real time input was fed to the system. A child was assigned to sit at a distance of 50cm from the camera. Once activated, a stopwatch was started to record the time needed for the classifier to trigger the system. Experiment was repeated to detect the next candidate. 3 children at the ages of 2, 5 and 8 years old and 3 adults were involved in the experiment. Results were tabulated in Table 4.3.
36
3.13.2 Maximum distance of detection
Figure 3.12: Experiment Setup Diagram The purpose of this section was to determine the limiting distance of the detection between a person and the camera. A candidate was assigned to sit at a starting distance of 30 cm from the camera, with increment length of 10 cm until the system failed to detect the presence of the candidate. The experiment was repeated for the next candidate. 6 children at the ages of 1, 2, 3, 5 and 8 years old and 3 adults were involved in the experiment.
3.13.3 Precision
In order to evaluate the precision of the system, the same person was assigned to sit at a constant distance of 50 cm from the webcam for detection. Candidates were identified from 5 adults and 5 children. The system was run for 5 times and the detection and scores were recorded in Section 4.6.3.
37
3.14 Project Management Table 3.3: Grantt Chart for FYP 1 and FYP 2
No
FYP 1 (Week)
Task 1
13
FYP Briefing/Seminar Project Proporsal Search for Relevant Topics Writing Introduction Gathering Information Writing Literature Review Prepare Methodology System Development Conduct Experiments Collect and Compile Data Writing Results and discussions Writing Conclusion and Recommendations Project Presentation
14
Submit Draft
15
Submit Final Report
1 2 3 4 5 6 7 8 9 10 11 12
2
3
4
5
6
7
8
9
FYP 2 (Week) 10
11
12
38
13
14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CHAPTER 4
RESULT AND DISCUSSION
4.1 Introduction
This section discusses the outcomes from Chapter 3. Chapter 4 starts with the results of image classification using the pretrained model, SSD Mobilenet v1 and its comparison with the retrained model. Then, the loss graph during the training is shown and the result is discussed. This is followed by modeling evaluation results whereby the computed accuracy is depicted in the form of graph, and the improved accuracy with training is shown. Lastly, experimental results on the detection system is presented in the form of images and tables.
4.2 Object Detection on Jupyter Notebook There were 2 images (as shown in Figure 4.1) saved in the ‘Test Images’ folder when the master model was downloaded from Tensorflow Object Detection API (github). The
model
was
able
to
detect
91
classes
of
object.
When
the
object_detection_tutorial.ipynb was run, the bounding boxes were drawn on the detected object with class name and the confidence level of each detected class.
Figure 4.1: Results of Object Detection 39
4.3 Detection with SSD Mobilenet v1
Figure 4.2 shows the image used for the detection. In Figure 4.3(a), SSD Mobilenet v1 was able to detect both adult and child as a category that is ‘person’. Besides, other objects such as dog was also detected. After the SSD Mobilenet v1 was retrained and exported, the model was fine tuned to detect two categories only, which are adult and child (Figure 4.3(b)). The size of the bounding box was limited to the face only as during annotation, the region around the face was annotated as the desired categories (child and adult only).
Figure 4.2: Sample Image for Detection
b
a
Figure 4.3: Detection Outcome based on: (a) SSD Mobilenet_v1; (b) Retrained model
40
4.4 Loss Graph
Loss values were indication of the model performance. The lower the loss value, the better the model. Loss was the target function that the optimization algorithm try to minimize. Figure 4.4 shows the loss graph during training process.
Figure 4.4: Loss Graph during Training The loss values at the beginning of training reached 7.00. It started to drop significantly from step 1 to step 1000. From 1000 steps onwards, the loss values ranged from 1.00 to 3.00. The loss values started decreased to less than 1.00, with fluctuations which can reach 3.00. The model checkpoints with loss values less than 1 were not exported as the precisions of those checkpoint were not reach 0.95 and above. The total iteration was 20000 steps. At step 19608, the loss was 0.9042 and the model checkpoint at this step was exported as the retrain model.
Tensorflow used cross-entropy error to evaluate the quality of model during training. A computer read an image in the form of multiple array, for example [0,1,1]. Cross entropy applied the negative loglikelihood to compute the error between ground truth data (the labelled data during annotation) and the computed data (James, 2016). The general equation for cross-entropy (CE) function was: CE(𝑦̂, y) = -∑𝑛𝑖=1 𝑦𝑖 log(𝑦̂) + (1 − 𝑦𝑖 )log(1 − 𝑦̂)
Eqn.1
Where y was the ground truth data, 𝑦̂ was the computed probability for each class (child and adult), n was the number of classes in the training (Fortuner, 2018). Table 4.1 shows an example of cross entropy function application to compute loss.
41
Table 4.1: Sample Ground Truth and Computed Data Ground Truth
Computed
1 0 (Child)
0.1 0.9
0 1 (Adult)
0.8 0.2
Cross-entropy computation: 𝐶𝐸𝑐ℎ𝑖𝑙𝑑 = -∑𝑛𝑖=1 1 log(0.1) + (0)log(0.9) = 1.000 𝐶𝐸𝑎𝑑𝑢𝑙𝑡 = -∑𝑛𝑖=1 0 log(0.8) + (1)log(0.2) = 0.6990 Loss value for the model at a particular step: ∑𝑛 𝑖=1 𝐶𝐸 𝑛
=
1.0000+0.6990 2
= 0.8495
From Equation 1, the CE value decreased when the true prediction probability is high. Hence, the lower the loss value, the better the performance of the model.
4.5 Model Evaluation
During the annotation phase, the images in JPEG format were labelled with class name and bounding box with dimension (Section 3.8.2), which termed as ground truth data. In computer vision, ground truth data includes a set of images, and a set of labels on the images, and defining a model for object recognition (Krig, 2014). While checkpoint in a particular step was saved, testing image was fed to the model for detection and a predicted bounding box is drawn. The accuracy was measured by computing the area of overlap, 𝑎𝑜 , also called Intersection over Union, IoU between ground truth bounding box and predicted bounding box (Everingham, et al., n.d), as shown in Figure 4.5.
+ 𝐵𝑝
+
= 𝐵𝑝 ∪ 𝐵𝑔𝑡
𝐵𝑔𝑡
= 𝐵𝑝 ∩ 𝐵𝑔𝑡
Figure 4.5: Relationship between Predicted Bounding box, 𝐵𝑝 and Ground Truth Bounding Box, 𝐵𝑔𝑡
42
𝑎𝑟𝑒𝑎(𝐵𝑝 ∩𝐵𝑔𝑡 )
𝑎𝑜 = 𝑎𝑟𝑒𝑎(𝐵
𝑝 ∪𝐵𝑔𝑡 )
Eqn. 2
The model was evaluated in terms of accuracy by computing the mean average precision (mAP). Han, Kamber & Pei, 2012 defined the following terms as: True positives (TP): Positive tuples that were correctly labelled by the classifier True negatives (TN): Negative tuples that were correctly labelled by the classifier False positives (FP): Negative tuples that were incorrectly labelled as positive False negatives (FN): Positive tuples that were mislabelled as negatives. (p.366) Precision of a classifier was computed as the ratio of true positives to summation of true positives and false negatives as shown in equation 2: 𝑇𝑃
Precision = 𝑇𝑃+𝐹𝑃
Eqn. 3
The accuracy of model was measured by counting the percentage of pixels correctly labelled per class. In other words, the model was evaluated separately based on category by considering pixels labelled with a particular class in the ground truth annotation. In evaluation using Pascal Visual Object Classes (VOC), default IoU was 0.5. Any precision of greater than 0.5 was considered as a hit and the Average Precision (AP) of evaluation, AP of each pixel in testing image was computed (Everingham et al., 2009). AP was further averaged over all testing dataset and represented as a single score, called mean Average Precision (mAP). Figure 4.6 and 4.7 show the performance of model checkpoints for adult and child dataset.
43
Figure 4.6: Performance of Model Checkpoints by Category (Adult)
Figure 4.7: Performance of Model Checkpoints by Category (Child)
The overall performance of the model checkpoint comprising adult and child classes were evaluated by counting the average of mAP by category of each checkpoint. Figure 4.8 shows the mAP of model checkpoints.
44
Figure 4.8: Mean Average Precision (mAP) of Model Checkpoints
The precision of the exported model checkpoint is 0.977 for adult, 0.961 for child and an overall precision of 0.969. In theoretical model evaluation, 2 criteria (loss and precision) were used to rate the performance of model. The loss function was a measure of the accuracy of the model during training process. The precision was a measure of the consistency of the trained model to register the same output under the same input. Precision was expressed in the form of fraction (Equation 3), which was in the range of 0 < precision < 1. The higher the precision, the higher the ability of the model to register the same output in respond to the same input. Both criteria played a significant role to evaluate the quality of trained model. The smaller the loss, the lesser the errors between ground truth data and the computed data and higher the accuracy of the model. The precision of adult (0.977) was higher than child (0.961). This indicates the model could identify adult more consistent compared to child. Table 4.2 below shows the results of testing at step 19608. 2 adults and children are selected from the 10 testing images at step 19608 were extracted for discussion. The accuracy of the model increased with the number of steps.
45
Table 4.2: Result of Testing at Step 19608
No
Results
Description
The model correctly interpreted the person in 1
the image as an adult with a score of 99%
The model identified the person in the image 2
as both adult and child. It was considered as failed detection.
The model identified the person in the image as a child, with a score of 99%. It was 3
considered as a correct detection.
Person in the image was interpreted as a child 4
by the model with confidence of 99%. It was a correct detection.
4.6 Experimental results
3 simple experiments were carried out to evaluate the performance of the child detection system. The results are tabulated below.
46
4.6.1 Response Time
50cm
Figure 4.9: Setup of Experiment
Table 4.3 shows the duration (seconds) needed for the detection system to give a desired output once the module was run. From the table, it can be seen that the response times for children are relatively long compared to adult. It was due to real time input where children tend to move around that had caused some delay for detection. Table 4.3: Response Time for 6 candidates Candidate
Time, t (Seconds)
Adult_1
9.46
Adult_2
9.30
Adult_3
9.64
Child_1
17.94
Child_2
13.56
Child_3
14.84
47
System Response Time for Children and Adults
Time (Seconds)
20 15 10 5 0 1
2
3
Candidate Adult
Child
Figure 4.10: Response Time for 6 Candidates
Average response time, 𝑡̅ =
9.46+9.30+9.64+17.94+13.56+14.84 6
= 12.46s
4.6.2 Maximum Distance of Detection The data collected from the experiments were shown in Table 4.4 and 4.5. The maximum distances detectable by different candidates were different. For child detection, the maximum distances detectable by the system in sequences were 130 cm,120 cm, 100 cm, 90 cm, 140 cm and 90cm respectively. For adult detection, the maximum distances detectable by the system in sequences were 80 cm, 140 cm and 160 cm respectively. The shortest and furthest detectable distances were both achieved by adult, with 80 cm and 160 cm respectively. Taking the average distance: 80+90+90+100+120+130+140+140+160 9
= 117cm
Hence, the system was able to detect a target at an average distance of 117 cm. Besides, the detection scores also decreased with the distance. It means the sensitivity was affected by the distance between the camera and the target. In addition to that, there were 3 false detections of the child as an adult with the scores of 50%, 68% and 70%. To overcome this problem, the minimum score threshold could and was increased from 0.5 to 0.9 in order to reduce the false detection. 48
Table 4.4: Child Detection with Varying Distances Distance (cm)
Detection Child_1
Child_2
Child_3
Child_4
Child_5
Child_6
(Age: 8)
(Age: 5)
(Age: 5)
(Age: 3)
(Age:2)
(Age: 1)
99
83
99
99
99
92
99
98
99
99
99
82
30
Score (%)
40
Score (%)
49
50
Score (%)
99
50
99
99
99
99
99
86
99
97
83
94
99
93
97
89
71
97
60
Score (%)
70
Score (%)
50
80
Score (%)
99
88
98
93
94
89
95
63
98
52
91
72
90
Score (%)
100
Score (%)
Not Detected
96
71
70
Not Detected
99
51
110
Score (%)
Not Detected
97
70
120
Score (%)
130
Score (%)
Not Detected
68
Not Detected
Not Detected
99
Not Detected
84
Not Detected
Not Detected
94
Not Detected
63
Not Detected
Not Detected
65
52
140
Not Detected
Not Detected
Not Detected
Not Detected
Score (%) 150
Not Detected
63 All failed to detect
53
Table 4.5: Adult Detection with Varying Distances Distance (cm)
Adult_1
Adult_2
Adult_3
99
99
99
99
99
99
98
99
99
81
99
97
30
Score (%)
40
Score (%)
50
Score (%)
60
Score (%)
70
54
Score (%)
78
97
99
Score (%)
65
97
99
90
Not Detected
77
99
94
99
92
99
75
99
80
Score (%)
100
Not Detected
Score (%)
110
Not Detected
Score (%)
120
Not Detected
Score (%)
55
130
Not Detected
Score (%)
140
98
60
62
Not Detected
Score (%)
150
76
Not Detected
Not Detected
Score (%)
160
68
Not Detected
Not Detected
Score (%) 170
92 All failed to detect
4.6.3 Precision
According to the National Instruments (2006), precision is defined as the degree of reproducibility of a measurement. Precision can also be expressed in term of consistency and the ability of the device to register the same reading upon repeating the measurements. Figures 4.11 and 4.12 show the output of the results on a laptop screen.
56
Figure 4.11: Results of Child Detection
Figure 4.12: Results of Adult Detection
Table 4.6 and 4.7 show the class names (child and adult) and scores of detections on the 5 children and adults. The results on Child_1, Child_3, Child_5, Adult_2, Adult_3 and Adult_4 were very consistent. All the 5 tests on these candidates gave a confidence level of 99%. There were slight variations in the scores on Child_4 and Adult_1. Both of them were detected correctly by the system with highest score of 99% whereas lowest scores achieved were 94% (Child_4) and 97% (Adult_1) respectively. There were 2 57
candidates (Child_2 & Adult_5) with false detections. Child_2 was interpreted by the system as an adult with confidence level of 53% and 50% during the second and the fifth tests. All the testes for Adult_5 were failed except the third test with score of 88%. The system generated false detection on Adult_5 with scores of 96%, 97%, 83% and 93% in sequence.
Table 4.6: Reproducibility Test of the Child Detection No. of test
Child_1 Detect
Score
Child_2 Detect
(%)
Score
Child_3 Detect
(%)
Score
Child_4 Detect
(%)
Score
Child_5 Detect
(%)
Score (%)
1
Yes
99
Yes
95
Yes
99
Yes
99
Yes
99
2
Yes
99
No
53
Yes
99
Yes
97
Yes
99
3
Yes
99
Yes
92
Yes
99
Yes
94
Yes
99
4
Yes
99
Yes
75
Yes
99
Yes
99
Yes
99
5
Yes
99
No
50
Yes
99
Yes
98
Yes
99
Table 4.7: Reproducibility Test of the Adult Detection No. of test
Adult_1 Detect
Score
Adult_2 Detect
(%)
Score
Adult_3 Detect
(%)
Score
Adult_4 Detect
(%)
Score
Adult_5 Detect
(%)
Score (%)
1
Yes
99
Yes
99
Yes
99
Yes
99
No
96
2
Yes
98
Yes
99
Yes
99
Yes
99
No
97
3
Yes
97
Yes
99
Yes
99
Yes
99
Yes
88
4
Yes
96
Yes
99
Yes
99
Yes
99
No
83
5
Yes
99
Yes
99
Yes
99
Yes
99
No
93
Confusion Matrix
Confusion matrix is a useful tool to measure the performance of classification model (Han, Kamber & Pei, 2012). The confusion matrix consists of binary outputs, such as yes or no, true or false, correct or wrong. Table 4.8 shows a sample confusion matrix with relevant terminology.
58
Table 4.8: Terminology in Confusion Matrix
Class
Predicted
Actual Class Yes
No
Yes
True Positive
False Positive
No
False Negative
True Negative
The child detection system consisted of a 2 classes classification model. As such, 2 confusion matrices by category (child and adult) were created. True referred to the correct prediction where false referred to incorrect prediction. For example, a child was interpreted as an adult by the classifier refer to a false detection. Positive referred to all the predicted outcomes of the class of confusion matrix whereas negative referred to predicted outcomes which were not belong to the class of confusion matrix. For example, for confusion matrix of child detection, all the outcomes with child were positive whereas outcomes with adult were negative, regardless of the actual class in which the candidates belong to. Table 4.9: Confusion Matrix of Child
Predicted Class
Actual Class Child Non-child Child
23
4
Non-child
2
21
Precision (Child): =
23 23+4
= 0.852 Table 4.10: Confusion Matrix of Adult Actual Class Adult Non-adult
Predicted Class
𝑇𝑃 𝑇𝑃+𝐹𝑃
Adult
21
2
Non-adult
4
23
59
Precision (Adult): 𝑇𝑃
= 𝑇𝑃+𝐹𝑃
21 21+2
= 0.913 Table 4.11: Theoretical and Experimental Precision Theoretical
Experimental
0.961
0.852
0.977
0.913
0.969
0.883
Performance by category (Child) Performance by category (Adult) Overall precision 𝑪𝒉𝒊𝒍𝒅 + 𝑨𝒅𝒖𝒍𝒕 𝟐
From the calculation, precision of adult was higher than precision of a child. This indicates that the model could classify adult more consistent compared to child. This agreed with the theoretical values as both precisions for adult was higher than child. The overall precision of the detection system was computed by taking the average of precision for both class. The percentage error was computed by applying Equation 4. Percentage of error: Theoretical−Experimental Theoretical
× 100%
0.969 − 0.883 × 100% = 8.88% 0.969
60
Eqn. 4
4.7 Discussion
In this project, the training dataset was derived from google images, that was taken from different cameras. The resolutions of the images taken were also varied. The performance of the model can be improved by increasing the number of testing images and the number of iteration during training phase. The model precision can be further improved if the training data and target data are both derived from same digital camera (Weiss, Khoshgoftaar, & Wang, 2016). During the implementation of the detection system, the input was derived from the laptop built-in camera. This had affected the output of the detection system. Besides, the high theoretical precision achieved (0.969) during the testing phase (Refer to Table 4.11) because both testing and training dataset was derived from the same source.
4.8 Sources of error
There were a few sources of errors during the experiments. The experiments were conducted at different places and the brightness was not constant. In addition to that, parallax error occurred while measuring the distance for detection. The position of the observer’s eyes was not perpendicular to the calibration of measuring tape. At the same time, the feet were not exactly on the line of the calibration, as the candidates move from one distance to the next distance. As a result, there will be a difference between the desired distance and the actual distance.
61
CHAPTER 5
CONCLUSIONS AND RECOMMENDATIONS
5.1 Conclusions
To summarise, a framework consists of a webcam, classifier and triggering system was provided. It can act as a benchmark for the development of heat stroke detection system in the future. Besides, an image recognition system that was able to give response to the presence of child was also developed. The child detection system was able to reduce false detections under different expressions and orientations. Theoretically, the retrained model could achieve a precision of 0.969. However, the experimental results showed precision of 0.883, this gave an error of 8.88%. Based on the 8 candidates involved in the maximum distance of detection experiment, the average distance was 117 cm, with 80 cm as lowest detectable distance and 160 cm as the highest detectable distance. The longest response time among the 6 candidates involved was 17.94 seconds (Child) and shortest response time was 9.30 seconds (Adult).
5.2 Recommendations
A further improvement is needed to advance the system in order to adapt with real life application. The precision of the system needs to improve in order to produce a more reliable device. This can be done by increasing the number of iteration steps and quality of training dataset. The usage of Graphics Processing Unit (GPU) is recommended to improve the speed training process, which can save much more time. The triggering system can be improved by adding wireless alarm system such as Bluetooth connection to mobile devices. This ensures caregivers can be alerted of children being trapped inside a car from a far distance. 62
REFERENCES
Barto, A., & Dietterich, T. (2004). Reinforcement learning and its relationship to supervised learning. Handbook of Learning and Approximate Dynamic Programming, 47–64. https://doi.org/10.1002/9780470544785.ch2 Buchanan, B. G. (2006). A (Very) Brief History of Artificial Intelligence. AI Magazine, 26(4), 53–60. https://doi.org/10.1609/AIMAG.V26I4.1848 Chin, C. S., Si, J., Clare, A. S., & Ma, M. (2017). Intelligent Image Recognition System for Marine Fouling Using Softmax Transfer Learning and Deep Convolutional Neural Networks. Complexity, 2017. https://doi.org/10.1155/2017/5730419 Devikar, P. (2016). Transfer Learning for Image Classification of various dog breeds. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 5(12). Dietterich, T., & Michalski, R. S. (1983). A Comparative Review of Selected Methods for Learning From Examples. Machine Learning: An Artificial Intelligence Approach. https://doi.org/10.1007/978-3-662-12405-5 Everingham, M., Gool, L., Williams, C., Winn, J., & Zisserman, A. (2009). International Journal of Computer Vision. The PASCAL Visual Object Classes (VOC) Challenge, 88, 303-338. doi:10.1007/s11263-009-0275-4 Fortuner, B. (2018). Loss Functions. Retrieved from http://ml-cheatsheet. readthedocs.io/en/latest/loss_functions.html Francis, J. (2017, October 25). Object detection with TensorFlow. Retrieved from https://www.oreilly.com/ideas/object-detection-with-tensorflow Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques. Amsterdam etc.: Elsevier
63
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Murphy, K. (2016). Speed/accuracy trade-offs for modern convolutional object detectors. https://doi.org/10.1109/CVPR.2017.351 Jain, A. K., & Mao, J. (1996). Artificial Neural Network: A Tutorial. Communications, 29, 31–44. https://doi.org/10.1109/2.485891 James, D. (2016, December 14). Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training. Retrieved from https://jamesmccaffrey.wordpress.com/2013/11/05/whyyou-should-use-cross-entropy-error-instead-of-classification-error-or-meansquared-error-for-neural-network-classifier-training/ Jana, R., Datta, D., & Saha, R. (2013). Age Group Estimation using Face Features, 3(2), 130–134. Kidsandcars. (2017). Children Left in Cars and Heat Stroke. Retrieved December 19, 2017, from http://www.kidsandcars.org/how-kids-get-hurt/heat-stroke/ Krig, S. (2014). Ground Truth Data, Content, Metrics, and Analysis. Computer Vision Metrics, 283–311. https://doi.org/10.1007/978-1-4302-5930-5_7 Kwon, Y. H., & da Vitoria Lobo, N. (1994). Age classification from facial images. Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on, 74(1), 762–767. https://doi.org/10.1006/cviu.1997.0549 Learning, M., & Learning, M. (1999). Large Margin Classification Using the Perceptron Algorithm. Machine Learning - The Eleventh Annual Conference on Computational Learning Theory, 37(3), 277–296. https://doi.org/10.1023/A:1007662407062 Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436– 444. https://doi.org/10.1038/nature14539 LeCun, Y., Yoshua, B., & Geoffrey, H. (2015). Deep Learning. Nature, 521, 436-444. doi:10.1038/nature14539
64
Lent, C. S. (2013). Learning to Program with MATLAB Building GUI Tools. John Wiley and Sons Inc., 310. https://doi.org/10.1007/s13398-014-0173-7.2 Luger, G. (2009). Artificial Intelligence (6th ed.). Boston, MA: Pearson Education, Inc. Luppescu, G., & Romero, F. (n.d.). Comparing Deep Learning and Conventional Machine Learning for Authorship Attribution and Text Generation, 1–9. Mayo Clinic. (2017). Heatstroke. Retrieved September 19, 2017, from https://www. mayoclinic.org/diseases-conditions/heat-stroke/symptoms-causes/syc20353581 McLaren, C. (2005). Heat Stress From Enclosed Vehicles: Moderate Ambient Temperatures Cause Significant Temperature Rise in Enclosed Vehicles. Pediatrics, 116(1), e109–e112. https://doi.org/10.1542/peds.2004-2368 Michie, D., Spiegelhalter, D. J., Taylor, C. C., Michie, E. D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence, 37(4), xiv, 289 . https://doi.org/10.2307/1269742 Mnih, V., Silver, D., & Riedmiller, M. (n.d.). Dqn. Nips, 1–9. https://doi.org/10.1038/nature14236 Muller, D. (2017, October 3). Hyundai New Rear-Seat Reminder [Digital image]. Retrieved 2017, from https://blog.caranddriver.com/hyundais-new-rear-seatreminder-is-actually-a-little-different-from-nissans-and-gms/ National Instruments. (2006). Sensor Terminology. Retrieved from http://www.ni.com/ white-paper/14860/en/ Negnevitsky, M. (2002). Artificial Intelligence: A Guide to Intelligent Systems (2nd ed.). Harlow, Essex: Pearson Education Limited. Nielsen, M. A. (2015). Neural Networks and Deep Learning. Retrieved 2018, from http://neuralnetworksanddeeplearning.com/chap5.html 65
Newswire, M. (2017, October 4). Hyundai Motor Announces New Rear Occupant Alert Reducing Child Heat Hazards. Retrieved November 10, 2017, from https://www.multivu.com/players/English/75060517-hyundai-rear-occupantalert/ Null, J. (2017). Heatstroke Deaths of Children in Vehicles. Retrieved October 19, 2017, from http://noheatstroke.org/responsible.htm OpenCV library. (2018). Retrieved from https://opencv.org/ O’Riordan , A. P. (2005). An overview of neural computing. Retrieved 2017, from http://www.cs.ucc.ie/~adrian/cs5201/NeuralComputingI.htm Popescu, M. C., Balas, V. E., Perescu-Popescu, L., & Mastorakis, N. (2009). Multilayer Perceptron and Neural Networks. WSEAS Transactions on Circuits and Systems, 8(7), 579–588. Schlosser, K. (2016, March 23). Sense A Life [Digital image]. Retrieved 2017, from https://www.geekwire.com/2016/device-kids-hot-cars/ Shanmugamani, R. (n.d.). Deep Learning for Computer Vision Expert techniques to train advanced neural networks using TensorFlow and Keras. Swaroop, C. (2003). A Byte of Python. A Byte of Python, 92, 110. https://doi.org/10. 1016/S0043-1354(00)00471-1 Tamkin, A., & Usiri, I. (2013). Deep CNNs for Diabetic Retinopathy Detection, 1–6. Tan, P.N., Steinbach, M., & Kumar, V. (2005a). Chap 8 : Cluster Analysis: Basic Concepts and Algorithms. Introduction to Data Mining, Chapter 8. https://doi.org/10.1016/0022-4405(81)90007-8 Tan, P. N., Steinbach, M., & Kumar, V. (2005b). Association Analysis: Basic Concepts and Algorithms. Introduction to Data Mining, 327–414. https://doi.org/10.1111/j.1600-0765.2011.01426.x Tapas, A. (2016). Transfer Learning for Image Classification and Plant Phenotyping, 5(11), 2664–2669. 66
Tensorflow. (2018). Tensorflow/models. Retrieved from https://github.com/tensorflow/ models/blob/master/research/object_detection/g3doc/ detection_model_zoo.md Thukral, P., Mitra, K., & Chellappa, R. (2012). A hierarchical approach for human age estimation. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, (March), 1529–1532. https://doi.org/10.1109/ICASSP.2012.6288182 Weinberger, K., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. Advances in Neural Information Processing Systems, 18, 1473. https://doi.org/10.1007/978-3-319-13168-9_33 Weiss, K., Khoshgoftaar, T. M., & Wang, D. D. (2016). A survey of transfer learning. Journal of Big Data (Vol. 3). Springer International Publishing. https://doi.org/10.1186/s40537-016-0043-6 Xia, X., & Nan, B. (2017). Inception-v3 for Flower Classification, 783–787. Zhong, J., Peniak, M., Tani, J., Ogata, T., & Cangelosi, A. (2016). Sensorimotor Input as a Language Generalisation Tool: A Neurorobotics Model for Generation and Generalisation of Noun-Verb Combinations with Sensorimotor Inputs, (May), 1– 23. Retrieved from http://arxiv.org/abs/1605.03261
67
APPENDIX A: Tensorflow Object Detection API
1. Install Python 3.6.3 Download Link: https://www.python.org/downloads/release/python-363/ Version: Windows x86 executable zip file 2. Set variable named “PYTHONPATH” to operating system Directory: Advanced system setting> Environment Variables> System Variables> New… Variable name: PYTHONPATH Variable value: C:/Users/Owner/Desktop/models-master/research; C:/Users/Owner/Desktop/models-master/research/slim 3. Download dependencies of using Tensorflow Object Detection API In Command Window:
68
4. Download Tensorflow Object Detection API Download Link: https://github.com/tensorflow/models
Download zip file and extract to Desktop
5. Download Protocal Buffers (Protobuf) Download Link: https://github.com/google/protobuf/releases Version: protoc-3.5.1-win32.zip Extract the zip file to Desktop. 6. Running Protobuf Compilation Sample compilation of ssd.proto file
69
Directory: C:/Users/Owner/Desktop/models-master/research/object_detection/protos Before Compilation
After Compilation
70
7. Run Detection on Jupyter notebook
Click on object_detection_tutorial.ipynb on browser and run the object detection demo. The results are shown in Figure 4.1.
71
APPENDIX B: XML to CSV Conversion import os import glob import pandas as pd import xml.etree.ElementTree as ET def xml_to_csv(path): xml_list = [] for xml_file in glob.glob(path + '/*.xml'): tree = ET.parse(xml_file) root = tree.getroot() for member in root.findall('object'): value = (root.find('filename').text, int(root.find('size')[0].text), int(root.find('size')[1].text), member[0].text, int(member[4][0].text), int(member[4][1].text), int(member[4][2].text), int(member[4][3].text) ) xml_list.append(value) column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax'] xml_df = pd.DataFrame(xml_list, columns=column_name) return xml_df def main(): for directory in ['train','test']: image_path = os.path.join(os.getcwd(), 'images/{}'.format(directory)) xml_df = xml_to_csv(image_path) xml_df.to_csv('data/{}__labels.csv'.format(directory), index=None) print('Successfully converted xml to csv.')
72
APPENDIX C: TF Record Conversion from __future__ import division from __future__ import print_function from __future__ import absolute_import import os import io import pandas as pd import tensorflow as tf from PIL import Image from object_detection.utils import dataset_util from collections import namedtuple, OrderedDict flags = tf.app.flags flags.DEFINE_string('csv_input', '', 'Path to the CSV input') flags.DEFINE_string('output_path', '', 'Path to output TFRecord') FLAGS = flags.FLAGS def split(df, group): data = namedtuple('data', ['filename', 'object']) gb = df.groupby(group) return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)] def create_tf_example(group, path): with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid: encoded_jpg = fid.read() encoded_jpg_io = io.BytesIO(encoded_jpg) image = Image.open(encoded_jpg_io) width, height = image.size filename = group.filename.encode('utf8') image_format = b'jpg' xmins = [] xmaxs = [] ymins = [] ymaxs = [] classes_text = [] classes = [] for index, row in group.object.iterrows(): xmins.append(row['xmin'] / width) xmaxs.append(row['xmax'] / width) ymins.append(row['ymin'] / height) ymaxs.append(row['ymax'] / height) classes_text.append(row['class'].encode('utf8')) classes.append(class_text_to_int(row['class'])) tf_example = tf.train.Example(features=tf.train.Features(feature={ 'image/height': dataset_util.int64_feature(height), 'image/width': dataset_util.int64_feature(width), 'image/filename': dataset_util.bytes_feature(filename), 73
'image/source_id': dataset_util.bytes_feature(filename), 'image/encoded': dataset_util.bytes_feature(encoded_jpg), 'image/format': dataset_util.bytes_feature(image_format), 'image/object/bbox/xmin': dataset_util.float_list_feature(xmins), 'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs), 'image/object/bbox/ymin': dataset_util.float_list_feature(ymins), 'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs), 'image/object/class/text': dataset_util.bytes_list_feature(classes_text), 'image/object/class/label': dataset_util.int64_list_feature(classes), })) return tf_example def main(_): writer = tf.python_io.TFRecordWriter(FLAGS.output_path) path = os.path.join(os.getcwd(), 'images') examples = pd.read_csv(FLAGS.csv_input) grouped = split(examples, 'filename') for group in grouped: tf_example = create_tf_example(group, path) writer.write(tf_example.SerializeToString()) writer.close() output_path = os.path.join(os.getcwd(), FLAGS.output_path) print('Successfully created the TFRecords: {}'.format(output_path)) if __name__ == '__main__': tf.app.run()
74
APPENDIX D: Training Code import functools import json import os import tensorflow as tf from google.protobuf import text_format from object_detection import trainer from object_detection.builders import input_reader_builder from object_detection.builders import model_builder from object_detection.protos import input_reader_pb2 from object_detection.protos import model_pb2 from object_detection.protos import pipeline_pb2 from object_detection.protos import train_pb2 tf.logging.set_verbosity(tf.logging.INFO) flags = tf.app.flags flags.DEFINE_string('master', '', 'BNS name of the TensorFlow master to use.') flags.DEFINE_integer('task', 0, 'task id') flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy per worker.') flags.DEFINE_boolean('clone_on_cpu', False, 'Force clones to be deployed on CPU. Note that even if ' 'set to False (allowing ops to run on gpu), some ops may ' 'still be run on the CPU if they have no GPU kernel.') flags.DEFINE_integer('worker_replicas', 1, 'Number of worker+trainer ' 'replicas.') flags.DEFINE_integer('ps_tasks', 0, 'Number of parameter server tasks. If None, does not use ' 'a parameter server.') flags.DEFINE_string('train_dir', '', 'Directory to save the checkpoints and training summaries.') flags.DEFINE_string('pipeline_config_path', '', 'Path to a pipeline_pb2.TrainEvalPipelineConfig config ' 'file. If provided, other configs are ignored') flags.DEFINE_string('train_config_path', '', 'Path to a train_pb2.TrainConfig config file.') flags.DEFINE_string('input_config_path', '', 'Path to an input_reader_pb2.InputReader config file.') flags.DEFINE_string('model_config_path', '', 'Path to a model_pb2.DetectionModel config file.') FLAGS = flags.FLAGS def get_configs_from_pipeline_file(): pipeline_config = pipeline_pb2.TrainEvalPipelineConfig() 75
with tf.gfile.GFile(FLAGS.pipeline_config_path, 'r') as f: text_format.Merge(f.read(), pipeline_config) model_config = pipeline_config.model train_config = pipeline_config.train_config input_config = pipeline_config.train_input_reader return model_config, train_config, input_config def get_configs_from_multiple_files(): train_config = train_pb2.TrainConfig() with tf.gfile.GFile(FLAGS.train_config_path, 'r') as f: text_format.Merge(f.read(), train_config) model_config = model_pb2.DetectionModel() with tf.gfile.GFile(FLAGS.model_config_path, 'r') as f: text_format.Merge(f.read(), model_config) input_config = input_reader_pb2.InputReader() with tf.gfile.GFile(FLAGS.input_config_path, 'r') as f: text_format.Merge(f.read(), input_config) return model_config, train_config, input_config def main(_): assert FLAGS.train_dir, '`train_dir` is missing.' if FLAGS.pipeline_config_path: model_config, train_config, input_config = get_configs_from_pipeline_file() else: model_config, train_config, input_config = get_configs_from_multiple_files() model_fn = functools.partial( model_builder.build, model_config=model_config, is_training=True) create_input_dict_fn = functools.partial( input_reader_builder.build, input_config) env = json.loads(os.environ.get('TF_CONFIG', '{}')) cluster_data = env.get('cluster', None) cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else None task_data = env.get('task', None) or {'type': 'master', 'index': 0} task_info = type('TaskSpec', (object,), task_data) ps_tasks = 0 worker_replicas = 1 worker_job_name = 'lonely_worker' task = 0 is_chief = True master = '' if cluster_data and 'worker' in cluster_data: # Number of total worker replicas include "worker"s and the "master". worker_replicas = len(cluster_data['worker']) + 1 if cluster_data and 'ps' in cluster_data: ps_tasks = len(cluster_data['ps']) 76
if worker_replicas > 1 and ps_tasks < 1: raise ValueError('At least 1 ps task is needed for distributed training.') if worker_replicas >= 1 and ps_tasks > 0: # Set up distributed training. server = tf.train.Server(tf.train.ClusterSpec(cluster), protocol='grpc', job_name=task_info.type, task_index=task_info.index) if task_info.type == 'ps': server.join() return worker_job_name = '%s/task:%d' % (task_info.type, task_info.index) task = task_info.index is_chief = (task_info.type == 'master') master = server.target trainer.train(create_input_dict_fn, model_fn, train_config, master, task, FLAGS.num_clones, worker_replicas, FLAGS.clone_on_cpu, ps_tasks, worker_job_name, is_chief, FLAGS.train_dir) if __name__ == '__main__': tf.app.run()
77
APPENDIX E: Configuration File model { ssd { num_classes: 2 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true } } similarity_calculator { iou_similarity { } } anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 } } image_resizer { fixed_shape_resizer { height: 300 width: 300 } } box_predictor { convolutional_box_predictor { min_depth: 0 78
max_depth: 0 num_layers_before_predictor: 0 use_dropout: false dropout_keep_probability: 0.8 kernel_size: 1 box_code_size: 4 apply_sigmoid_to_scores: false conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, center: true, decay: 0.9997, epsilon: 0.001, } } } } feature_extractor { type: 'ssd_mobilenet_v1' min_depth: 16 depth_multiplier: 1.0 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.00004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { train: true, scale: true, 79
center: true, decay: 0.9997, epsilon: 0.001, } } } loss { classification_loss { weighted_sigmoid { anchorwise_output: true } } localization_loss { weighted_smooth_l1 { anchorwise_output: true } } hard_example_miner { num_hard_examples: 3000 iou_threshold: 0.99 loss_type: CLASSIFICATION max_negatives_per_positive: 3 min_negatives_per_image: 0 } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 100 } score_converter: SIGMOID } } } train_config: { batch_size: 12 optimizer { rms_prop_optimizer: { learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.004 decay_steps: 800720 decay_factor: 0.95 } } 80
momentum_optimizer_value: 0.9 decay: 0.9 epsilon: 1.0 } } fine_tune_checkpoint: "ssd_mobilenet_v1_coco_11_06_2017/model.ckpt" from_detection_checkpoint: true num_steps: 2000 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { ssd_random_crop { } } } train_input_reader: { tf_record_input_reader { input_path: "data/train.record" } label_map_path: "training/object-detection.pbtxt" } eval_config: { num_examples: 1000 # Note: The below line limits the evaluation process to 10 evaluations. # Remove the below line to evaluate indefinitely. max_evals: 30 } eval_input_reader: { tf_record_input_reader { input_path: "data/test.record" } label_map_path: "training/object-detection.pbtxt" shuffle: true num_readers: 1 }
81
APPENDIX F: Model Evaluation Code import functools import tensorflow as tf import logging from google.protobuf import text_format from object_detection import evaluator from object_detection.builders import input_reader_builder from object_detection.builders import model_builder from object_detection.protos import eval_pb2 from object_detection.protos import input_reader_pb2 from object_detection.protos import model_pb2 from object_detection.protos import pipeline_pb2 from object_detection.utils import label_map_util tf.logging.set_verbosity(tf.logging.INFO) flags = tf.app.flags flags.DEFINE_boolean('eval_training_data', False, 'If training data should be evaluated for this job.') flags.DEFINE_string('checkpoint_dir', '', 'Directory containing checkpoints to evaluate, typically ' 'set to `train_dir` used in the training job.') flags.DEFINE_string('eval_dir', '', 'Directory to write eval summaries to.') flags.DEFINE_string('pipeline_config_path', '', 'Path to a pipeline_pb2.TrainEvalPipelineConfig config ' 'file. If provided, other configs are ignored') flags.DEFINE_string('eval_config_path', '', 'Path to an eval_pb2.EvalConfig config file.') flags.DEFINE_string('input_config_path', '', 'Path to an input_reader_pb2.InputReader config file.') flags.DEFINE_string('model_config_path', '', 'Path to a model_pb2.DetectionModel config file.') FLAGS = flags.FLAGS def get_configs_from_pipeline_file(): pipeline_config = pipeline_pb2.TrainEvalPipelineConfig() with tf.gfile.GFile(FLAGS.pipeline_config_path, 'r') as f: text_format.Merge(f.read(), pipeline_config) model_config = pipeline_config.model if FLAGS.eval_training_data: eval_config = pipeline_config.train_config else: eval_config = pipeline_config.eval_config input_config = pipeline_config.eval_input_reader return model_config, eval_config, input_config
82
def get_configs_from_multiple_files() eval_config = eval_pb2.EvalConfig() with tf.gfile.GFile(FLAGS.eval_config_path, 'r') as f: text_format.Merge(f.read(), eval_config) model_config = model_pb2.DetectionModel() with tf.gfile.GFile(FLAGS.model_config_path, 'r') as f: text_format.Merge(f.read(), model_config) input_config = input_reader_pb2.InputReader() with tf.gfile.GFile(FLAGS.input_config_path, 'r') as f: text_format.Merge(f.read(), input_config) return model_config, eval_config, input_config def main(unused_argv): assert FLAGS.checkpoint_dir, '`checkpoint_dir` is missing.' assert FLAGS.eval_dir, '`eval_dir` is missing.' if FLAGS.pipeline_config_path: model_config, eval_config, input_config = get_configs_from_pipeline_file() else: model_config, eval_config, input_config = get_configs_from_multiple_files() model_fn = functools.partial( model_builder.build, model_config=model_config, is_training=False) create_input_dict_fn = functools.partial( input_reader_builder.build, input_config) label_map = label_map_util.load_labelmap(input_config.label_map_path) max_num_classes = max([item.id for item in label_map.item]) categories = label_map_util.convert_label_map_to_categories( label_map, max_num_classes) evaluator.evaluate(create_input_dict_fn, model_fn, eval_config, categories, FLAGS.checkpoint_dir, FLAGS.eval_dir) logging.basicConfig(level=logging.INFO) if __name__ == '__main__': tf.app.run()
83
APPENDIX G: Export Inference Graph import tensorflow as tf from google.protobuf import text_format from object_detection import exporter from object_detection.protos import pipeline_pb2 slim = tf.contrib.slim flags = tf.app.flags flags.DEFINE_string('input_type', 'image_tensor', 'Type of input node. Can be ' 'one of [`image_tensor`, `encoded_image_string_tensor`, ' '`tf_example`]') flags.DEFINE_string('pipeline_config_path', None, 'Path to a pipeline_pb2.TrainEvalPipelineConfig config ' 'file.') flags.DEFINE_string('trained_checkpoint_prefix', None, 'Path to trained checkpoint, typically of the form ' 'path/to/model.ckpt') flags.DEFINE_string('output_directory', None, 'Path to write outputs.') FLAGS = flags.FLAGS def main(_): assert FLAGS.pipeline_config_path, '`pipeline_config_path` is missing' assert FLAGS.trained_checkpoint_prefix, ( '`trained_checkpoint_prefix` is missing') assert FLAGS.output_directory, '`output_directory` is missing' pipeline_config = pipeline_pb2.TrainEvalPipelineConfig() with tf.gfile.GFile(FLAGS.pipeline_config_path, 'r') as f: text_format.Merge(f.read(), pipeline_config) exporter.export_inference_graph( FLAGS.input_type, pipeline_config, FLAGS.trained_checkpoint_prefix, FLAGS.output_directory)
84
APPENDIX H: Input Code import os import cv2 import numpy as np import tensorflow as tf import sys sys.path.append("..") from utils import label_map_util from utils import visualization_utils as vis_util from utils.label_map_util import load_labelmap as ll MODEL_NAME = 'child_vs_adult_inference_graph_19608' #Image Input IMAGE_NAME = 'adult_138.jpg' #Video Input VIDEO_NAME = 'xxxx.mp4' CWD_PATH = os.getcwd() PATH_TO_CKPT = os.path.join(CWD_PATH,MODEL_NAME,'frozen_inference_graph.pb') PATH_TO_LABELS = os.path.join(CWD_PATH,'training','object-detection.pbtxt') #Image Input PATH_TO_IMAGE = os.path.join(CWD_PATH,IMAGE_NAME) #Video Input PATH_TO_VIDEO = os.path.join(CWD_PATH,VIDEO_NAME) NUM_CLASSES = 2 label_map = label_map_util.load_labelmap(PATH_TO_LABELS) categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True) category_index = label_map_util.create_category_index(categories) detection_graph = tf.Graph() with detection_graph.as_default(): od_graph_def = tf.GraphDef() with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') sess = tf.Session(graph=detection_graph) image_tensor = detection_graph.get_tensor_by_name('image_tensor:0') detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0') detection_scores = detection_graph.get_tensor_by_name('detection_scores:0') detection_classes = detection_graph.get_tensor_by_name('detection_classes:0') num_detections = detection_graph.get_tensor_by_name('num_detections:0') #Image Input image = cv2.imread(PATH_TO_IMAGE) image_expanded = np.expand_dims(image, axis=0) (boxes, scores, classes, num) = sess.run( 85
[detection_boxes, detection_scores, detection_classes, num_detections], feed_dict={image_tensor: image_expanded}) #Video Input video = cv2.VideoCapture(PATH_TO_VIDEO) while(video.isOpened()): ret, frame = video.read() frame_expanded = np.expand_dims(frame, axis=0) (boxes, scores, classes, num) = sess.run( [detection_boxes, detection_scores, detection_classes, num_detections], feed_dict={image_tensor: frame_expanded}) #Real-time Input video = cv2.VideoCapture(0) ret = video.set(3,1000) ret = video.set(4,800) while(True): ret, frame = video.read() frame_expanded = np.expand_dims(frame, axis=0) (boxes, scores, classes, num) = sess.run( [detection_boxes, detection_scores, detection_classes, num_detections], feed_dict={image_tensor: frame_expanded}) vis_util.visualize_boxes_and_labels_on_image_array( image, np.squeeze(boxes), np.squeeze(classes).astype(np.int32), np.squeeze(scores), category_index, use_normalized_coordinates=True, line_thickness=8, min_score_thresh=0.80) #Image Input cv2.imshow('Child_vs_Adult_Detector', image) cv2.waitKey(0) cv2.destroyAllWindows() #Video Input cv2.imshow('Child_vs_Adult_Detector', frame) if cv2.waitKey(1) == ord('q'): break video.release() cv2.destroyAllWindows() #Real-time Input cv2.imshow('Object detector', frame) if cv2.waitKey(1) == ord('q'): break video.release() cv2.destroyAllWindows()
86