Gender Prediction of the European School’s Teachers Using Machine Learning: Preliminary Results 1st Chaman Verma
2nd Ahmad S. Tarawneh
Department of Media and Educational Informatics E¨otv¨os Lor´and University Budapest, Hungary
[email protected],
[email protected]
Department of Algorithm and Their Application E¨otv¨os Lor´and University Budapest, Hungary
3rd Zolt´an Ill´es
4th Veronika Stoffov´a
5th Sanjay Dahiya
Department of Media and Educational Informatics E¨otv¨os Lor´and University Budapest, Hungary
Faculty of Education Trnava University Trnava, Slovakia
Ch.Devi Lal State Institute of Engineering & Technology Sirsa, India
Abstract—An experiential study is conducted to solve binary classification problem on big dataset of European Survey of Schools: ICT in Education (known as ESSIE) using IBM modeler version 18.1. The survey was conducted by ESSIE at various levels [1-3] of schools ISCED (International Standard Classification of Education). To predict the gender of teachers based on their answers, the authors applied 4 supervised machine learning algorithms filtering out of 12 classifiers using auto classifiers on ISCED-1 and ISCED-2 level of schools. Out of total 158 attributes, self-reduction and auto classifier stabilized only 134 attributes for the Bayesian Network (BN) and Random Tree (RT) at level-1 and 134 attributes for logistic regression and 41 attributes for Decision Tree (C5) at level-2. The MissingValue filter of Weka 3.8.1 tool handled well 55641 in ISCED-2 level and 19415 at the ISCED-1 level and normalization is also applied as well. The outcomes of the study reveal that decision tree (C5) classifier outperformed the logistic regression (LR) after feature extraction at ISCED-2 level schools and Random Tree classifier predicted more accurately gender of the teacher as compare to the Bayesian Network at level-1 schools. Further, presented predictive models stabilized 134 attributes with 2926 instances for predict gender of teachers of level-1 schools and 134 attributes with 7542 instances for level-2 schools. Index Terms—Classification, Supervised Machine learning, Sensitivity, Teacher gender prediction.
I. I NTRODUCTION In 2011, European Commission has been conducted a megasurvey over 190,000 filled questionnaires from students, teachers and head teachers in 27 countries to analysis the Information and Communication Technology (ICT) in ISCED level-1 (primary level of education), ISCED level-2 (lower secondary level of education) and ISCED level-3 (upper secondary level of education), distinguishing level-3A academic and level-3B vocational [1]. The authors considered teacher dataset belongs to ISCED Level-1 and ISCED Level-2 to predicting the gender based on their response provided. Data mining is the process of sorting through large data sets to identify patterns and establish
c 978-1-5386-6678-4/18/$31.00 2018 IEEE
relationships to solve the variety of problems through data analysis. About data mining research, every year the research community addresses new open problems and new problem areas, for many of which data mining can provide value-added answers and results [2]. The machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data [3]. The supervised learning assumes that training examples are classified (labeled by class labels) and predictive modeling is most trending to forecast a target or dependent attribute based on the value of other attributes [2]. The C5.0 classifier produces decision tree with rules that provide the maximum information gain at each level and the response variable mandatory to be categorical [4, 5, 6]. Bayesian Network encodes probabilistic associations among different variables [7] and BN consists of several variables and set of edges between the variables, resulting in an acyclic graph [8]. Depending on train dataset, it gives the highest accuracy in classification and more discussion is available in [9, 10, 11]. Logistic regression was applied to develop the model for the early and reliable prediction of students pass or fail status of the undergraduate level [12]. The random tree classifier is an improvement over regression and classification tree comes with bagging feature and better for binary classification problem [13]. According to [14] The Auto Classifier estimates and compares models for either nominal (set) or binary (yes/no) targets, using many different methods. Classification is an important task of data mining; it is a supervised class prediction technique. The main goal is to accurately predict the class for each data [15], provided that sufficient numbers of classes are available. in the past, classification has been applied in several fields of research like terrorism prediction, finance, weather prediction, medical etc. A classification can be binary or multiclass, where binary classification is the task of classifying the two groups based on a classification rule
213
[16]. Binary Logistic regression has been applied to predict gender of European school’s students with 62% accuracy [17]. Also various supervised machine learning classifiers such as SVM, RF and decision tree also applied to predict gender and residence state of students [18], [19]. IBM SPSS Modeler is a set of data mining tools that enable you to quickly develop predictive models using business expertise and deploy them into business operations to improve decision making [5]. In the present paper, authors solved binary classification problem on European school dataset to predict the gender of teachers using supervised machine learning classifiers such as BN, C5 tree, Random Tree and Logistic Regression. The present study may help to predict more demographical features of teachers such as age, locality,expertise based on their answers provided in any ICT survey held in educational institution. The experimental study is organized into the following sections. II. METHODS AND TECHNIQUES FOR STABILIZATION OF ATTRIBUTES A. Dataset More than 2500 schools, from 27 European countries have participated in the survey held in 2011 (primary and lower secondary level of education). An experimental study is conducted to predict the gender of the teacher based on their answers provided during survey towards information and communication technology using IBM modeler 8.1. The authors trained two different datasets collected by ESSIE and funded by the European Commission Information Society and Media Directorate General which is also available online [1] in which ISCED-1 level dataset has total 3088 instances, 158 attributes and 19415 missing values and ISCED-2 level dataset has total 7897 instances, 158 attributes and 55641 missing values are found. The Gender attribute has three classes 1Female, 2-Male, X-Misplaced. Hence, 355 instances belong to X category are removed manually. The authors considered the only gender as a target attribute. The attributes of dataset belong to Experience with ICT, ICT access, Support to ICT use, ICT based activities, ICT material and obstacles in ICT use, Learning activities, Teacher skills and Teacher opinions and attitudes etc. The scale of measurement was the mixed type such as nominal, ordinal, interval and categorical etc. The type of answers given by teachers was numeric. B. Preprocessing Before use dataset, it is essential to improve the data quality [20]. There are a few numbers of techniques used for data preprocessing [21] as aggregation, sampling, dimension reduction, variable transformation, and dealing with missing values. To make quality and stabilized dataset Weka version 3.8.1 tool is applied to two different files belongs to ISCED-1 level and ISCED-2 level. The MissingValue filter counted 55641 missing values in the ISCED-2 level dataset and 19415 missing values in the ISCED-1 level dataset. According to [21, 22] to there is requirement to handle the missing data values. The missing values are handled with the ReplaceMissingValue filter which replaces missing values with mean and mode values
214
of the whole dataset. The rescaling of the dataset to the range of 0 to 1 is achieved using Normalize filter normalized all data except target attribute gender. In both of dataset, attributes numbered from 1 to 11 and from 36 to 147 are removed using Self Reduction method as they are indexed and mean values. Further, the auto classifier in IBM modeler version 18.1 is used to select significant variables and best classifiers for training and testing dataset. Hence, after applied feature reduction, 134 attributes with 2926 instances are selected for the Bayesian network and random tree at ISCED level-1 and 134 attributes with 7542 instances are elected for logistic regression and 41 attributes stabilized for decision tree at ISCED level-2.Fig.1 is reflecting the schematic diagram of research after feature extraction with used classifiers to predict gender of teacher. C. Classifiers The filtered dataset is trained using auto classifier algorithm at 10-fold cross-validation. The discard policy of auto classifier is set up as less than 80% accuracy with 0.8 AUC (Area Under Curve). Out of total 12 machine learning classifiers, the auto classifier algorithm suggested two best models C5 and LR at ISCED-2 level school’s dataset and two best models BN and RT at ISCED-1 level school’s dataset. Therefore, to predict the target variable based on 134 predictors, four optimal supervised machine learning algorithms C5, RT, BN, and LR are selected by applying auto classifier algorithm. D. Performance Evaluation Experimental results of presented models are evaluated using the following six major performance measures: (a) Classification Accuracy: The number of correct predictions of teacher gender from over all predictions. (b) Error: presents the misclassification of target classes. (c) AUC: To show the accuracy of models’ area under the curve of ROC is also appropriate. (d) ROC: Receiver operating characteristics curve presents the graphical evaluation of models which shows the true positive rate (Sensitivity) on the y-axis and false positive rate (1-Specificity) at x-axis with various thresholds. (e) Coincidence matrix: reflects the performance of a classification model on a set of test data for which the true values are known. (f) Gini: It is calculated by subtracting the sum of the squared probabilities of each class from one which computes the inequality among values of a frequency distribution. To evaluate the results, IBM modeler Analysis algorithm (node) is applied to find accuracy, Gini, AUC and coincidence matrix and evaluation algorithm is also applied to produce ROC curves. E. Optimal Frequency Distribution The auto classifier algorithm is the powerful technique which selects optimal classifier for precision based on the dataset. The authors have trained all dataset with crossvalidation by tested 12 different supervised machine learning classifiers using auto classifier. The thresholds limit of overall accuracy is applied 80%. It can be also seen that auto classifier algorithm selected two best classifiers named C5 tree and Logistic regression
8th International Advance Computing Conference (IACC).
Fig. 1: Stabilizing Features out of 12 machine learning classifiers to predict the gender of the teacher at ISCED-1 level. Fig. 2 (a) shows the best predictive distribution of gender-based on the Random tree and Bayesian network which shows the 96.27% and 89.71% accuracy over 134 attributes respectively. Fig. 2 (b) also shows the optimal distribution of gender of the teacher at the ISCED2 level based on the C5 tree having 82.76% accuracy over 41 attributes after feature reduction and logistic regression having 81.65% accuracies with 134 attributes. One hand, the distribution models infer at the ISCED-1 level, the maximum prediction is achieved by the Random tree as compared to Bayesian network and on another hand, the C5 tree won from the logistic regression in gender prediction at ISCED-2 level. F. An Experimental Results, Analysis, and Evaluation for ISCED Level-1 Schools Based on optimal distribution, dataset belongs to ISCED-1 level 2926 instances are tested through the Bayesian network including one target and 134 predictors. The Fig. 3 shows the significant role of each predictor to predict the gender of teacher. It can be seen from the network the dark blue circles have highest importance values such as 1.0, 0.8 and 0.6 etc. All 134 predictors have more than 0.4 percent contribution in the Bayesian network. It is found that the Bayesian network performed well in a classification of gender-based on the dataset. The graph board node is very powerful algorithm which take executable predictive model and produces classification tables. Fig. 4 shows classification table of teacher gender based on the
Bayesian network for ISCED level-1. It is clear from above classification table that Bayesian networks classified accurately 2302 female teachers and only a few numbers of females went to the male category. In case of male prediction, less count of teachers falls into the female category which is 181. Fig. 5 shows classification table of the random tree at ISCED level-1 teachers. It is also visible that random tree classified much more accurately female teachers (2402) and only tiny number of males (28) are misclassified. This classifier also performed well for male gender prediction too. Further, to validate the presented models and classification tables ROC curves are build using evaluation node of IBM Modeler. The ROC curve consists of true positive (TP) rate across Yaxis which is the sensitivity of model and false positive (FP) rate which equals to 1-specificity across X-axis. The sensitivity of model signifies accurate prediction of female over actual females and specificity tells accurate prediction of male over actual males. It can be seen from Fig.6 that Bayesian network shows increasing TP rate for the female which starts from 0.36 and maximum up to 0.99 with varying thresholds.It is found 0.90 TP rate (sensitivity) with 0.19 FP rate at 0.2 thresholds; 0.99 TP rate and 0.7 FP rate at 0.8 thresholds which reveals the significance of the predictive model. Further, Fig.7 shows random tree model validation using ROC which reflects significant TP rate starts from 0.54 and ends to 0.99 with updating thresholds. Also, can be seen at thresholds 0.1 the sensitivity is high 0.97 and FP rate is 0.07 which concludes the significance of model too. As thresholds
8th International Advance Computing Conference (IACC).
215
Fig. 2: (a) Best predictive distribution based on Random tree and Bayesian network of ISCED Level-1 using Auto classifier (b) Best predictive distribution based on C5 tree and logistic regression ISCED Level-2 using Auto Classifier
Fig. 3: Significant of predictors towards Teachers Gender
reach 0.4, the model sensing at point 0.98 and FP rate is 0.18. Therefore, random tree outperformed the Bayesian network to predict male teachers at ISCED-1 level. G. An Experimental Results, Analysis, and Evaluation for ISCED Level-2 Schools To predict gender at the ISCED-2 level, auto classifier performed feature extraction and selected decision tree (C5) with 41 attributes and 7542 instances and logistic regression (LR) with 134 attributes with 10-fold cross-validation. Fig. 8 shows classification table of teacher gender based on logistic regression for ISCED level-2 teachers. It can be seen from classification table that LR classifier correctly classified 5649 female and incorrectly classified 270 male teachers and only 1114 females wrongly went to the male category. Total 509 males are correctly classified. The accuracy of C5 tree classifier is improved by applying feature extraction using auto classifier. At ISCED-2 level, the C5 classifier is tested with 10 -fold cross validation with 41 attributes and 7542 and logistic regression (LR) with 134 attributes with 10-fold cross-validation. Fig. 9 shows classification table of teacher gender based on logistic regression for ISCED level -2 teachers. It can be seen from classification table that LR predicted 5649 female and 509 male teachers. Data from Fig.10 (a) shows the sensitivity of LR model is directly proportional to the thresholds. The TP rate is varying from 0.19 to 0.98. The LR produced significant TP rate (sensitivity) 0.97 and 0.85 thresholds proved the significance of the
216
Fig. 4: Gender classification by Bayesian Network of ISCED-1 level teachers
predictive model. From Fig.10 (b) C5 model built using 41 attributes and validated using ROC which reflects significant TP rate 0.84 at 0.8 cutoffs.Therefore,feature extraction of C5 tree outperformed the LR to predict in gender prediction at ISCED-2 level.
8th International Advance Computing Conference (IACC).
Fig. 5: Gender classification by Random Tree of ISCED-1 level teachers
Fig. 7: ROC of the Random tree at ISCED level-1
Fig. 8: Gender classification by LR of ISCED level-2 teachers Fig. 6: ROC of the Bayesian network at ISCED level-1 III. C OINCIDENCE M ATRICES AND E VALUATION M ATRICES To produce significant coincidence matrices and evaluation matrices, analysis algorithm of IBM Modeler also played a vital role to evaluate results. The results of an experiment based on using 10-fold cross validation represented in Table 1
which is a combination of four coincidence matrices generated by respective models. At ISCED level-2 schools, the maximum number of correct prediction for female teachers (5752) is provided by the C5 tree with 41 attributes but it fails to correctly classify of male teachers (167). Subsequently, LR with 134 attributes prediction ratio (correct predicted/actual) for female teachers is 5649/5919 and for male teachers is
8th International Advance Computing Conference (IACC).
217
(a)
Fig. 9: Gender classification by C5 tree of ISCED level-2 teachers
509/1623. After feature extraction, C5 tree outperformed the LR in the prediction of gender. At ISCED level-1 schools, RT with 134 attributes predicted correctly 2414 males out of total 2483 and predicted correctly 410 females out of 443. Whereas, BN classifier with 134 attributes has classified correctly 2302 males and 323 females. Therefore, RT performed excellently in the prediction of the gender of teachers at ISCED level-1 schools. Data from Table 2 maximum accuracy 96.27% is achieved by random tree classifier with 134 attributes to predict the gender of teachers at ISCED level-1 schools and C5 tree with 41 attributes gained the highest accuracy 82.76% for gender prediction at ISCED level-2 schools. The maximum area under the curve (AUC) is found 0.988 by RT with 134 attributes to prove the significant relevance of overall accuracy of the model for prediction and it is above the benchmark of ROC curve. Further, LR has maximum AUC 0.804 as compare to C5 AUC which is 0.697, hence, it does not mean that LR predicting better as compare to C5. After feature reduction, C5 provided higher accuracy as compared to LR at ISCED level-2 schools. RT has highest Gini value 0.976 proved inequality of male and female to satisfy perfect prediction of gender and low misclassification error 3.73%. IV. CONCLUSION The presented paper has stabilized the attributes of European commission dataset to predict the gender of teachers at both levels of education such as ISCED level-1 and ISCED level-2 using supervised machine learning classifiers. The experimental study is carried out with the help of popular data
218
(b)
Fig. 10: (a) ROC of Logistic regression at ISCED Level-2 (b) ROC of C5 tree at ISCED Level-2 with 10-fold
mining tools Weka and IBM Modeler on big dataset available online [1]. The auto classifier of Modeler suggested 4 best classifiers to apply on both types of dataset. During the first phase of the experimental study, RT and BN are tested and trained ISCED level-1 dataset with k-fold cross-validation to predict the gender of the teacher. During the second phase of the study, C5 and LR are applied on the ISCED level-2 dataset. The maximum accuracy is achieved with 134 attributes by RT (96.27%) as compare to BN (89.71%) to predict teacher gender at ISCED level-1. The C5 tree classifier obtained the highest accuracy (82.76%) with 41 attributes after applied feature extraction by auto classifier algorithm to predict gender at ISCED level-2 school teachers. The binary logistic regression is also gained better accuracy (81.65%) with 134 attributes to present the significant predictive model for ISCED level-2 schools. ACKNOWLEDGMENT The authors would like to thank European Commission to provide dataset online. The present study is funded by E¨otv¨os Lor´and University and sponsored by Tempus Public Foundation, Budapest, Hungary.
8th International Advance Computing Conference (IACC).
Models Prediction Gender F Actual M
Bayesian network ISCED-1 LEVEL Prediction F M 2302 181 120 323
Random tree ISCED-1 LEVEL Prediction F M 2402 81 28 415
C5 tree ISCED-2 LEVEL Prediction F M 5752 167 1133 490
Logistic regression ISCED-2 LEVEL Prediction F M 5649 270 1114 509
TABLE I: Coincidence matrices at 10-folds cross-validation Models AUC Gini Accuracy Error
BN-134 ISCED level-1 0.937 0.874 89.71% 10.29%
RT-134 ISCED Level-1 0.988 0.976 96.27% 3.73%
C5 tree-41 ISCED Level-2 0.697 0.393 82.76% 17.24%
LR-134 ISCED Level-2 0.804 0.607 81.65% 18.35%
TABLE II: Evaluation matrices R EFERENCES [1] European Commission, https://ec. europa.eu/digital-single-market/news/ ict-education-essie-survey-smart-20100039, note = Accessed: 2018-02-14. [2] Johannes F¨urnkranz, Dragan Gamberger, and Nada Lavraˇc. Foundations of rule learning. Springer Science & Business Media, 2012. [3] G MeeraGandhi. Machine learning approach for attack prediction and classification using supervised learning algorithms. Int. J. Comput. Sci. Commun, 1(2), 2010. [4] IBM Knowledge center, https://www.ibm.com/support/ knowledgecenter/en/ss3ra7 15.0.0/com.ibm.spss. modeler.help/nodes treebuilding.htm, note = Accessed: 2018-02-14. [5] IBM Modeler Users Guide, ftp://public.dhe.ibm.com/ software/analytics/spss/documentation/modeler/18.1/en/ modelerusersguide.pdf, note = Accessed: 2018-08-18. [6] Kemal Polat and Salih G¨unes¸. A novel hybrid intelligent method based on c4. 5 decision tree classifier and oneagainst-all approach for multi-class classification problems. Expert Systems with Applications, 36(2):1587– 1592, 2009. [7] Javed Ashraf and Seemab Latif. Handling intrusion and ddos attacks in software defined networks using machine learning techniques. In Software Engineering Conference (NSEC), 2014 National, pages 55–60. IEEE, 2014. [8] V Muralidharan and V Sugumaran. A comparative study of na¨ıve bayes classifier and bayes net classifier for fault diagnosis of monoblock centrifugal pump using wavelet analysis. Applied Soft Computing, 12(8):2023–2029, 2012. [9] Mark A Friedl and Carla E Brodley. Decision tree classification of land cover from remotely sensed data. Remote sensing of environment, 61(3):399–409, 1997. [10] Jie Cheng and Russell Greiner. Comparing bayesian network classifiers. In Proceedings of the Fifteenth
[11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
conference on Uncertainty in artificial intelligence, pages 101–108. Morgan Kaufmann Publishers Inc., 1999. Finn V Jensen. An introduction to Bayesian networks, volume 210. UCL press London, 1996. Gerard JA Baars, Theo Stijnen, and Ted AW Splinter. A model to predict student failure in the first year of the undergraduate medical curriculum. Health Professions Education, 3(1):5–14, 2017. IBM, https://www.ibm.com/support/knowledgecenter/ en/ss3ra7 17.1.0/modeler mainhelp client ddita/ clementine/rf general.html, note = Accessed: 2018-0506. IBM Watson , https://dataplatform. ibm.com/docs/content/spss-visualization/ spss-viz-auto-classifier-overview.html?audience=dr& context=refinery, note = Accessed: 2018-05-06. Tulips Angel Thankachan and Kumudha Raimond. A survey on classification and rule extraction techniques for datamining. IOSR Journal of Computer Engineering, 8(5):75–78, 2013. Ghada M Tolan and Omar S Soliman. An experimental study of classification algorithms for terrorism prediction. International Journal of Knowledge EngineeringIACSIT, 1(2):107–112, 2015. Chaman Verma, Veronika Stoffov´a, Zolt´an Ill´es, and Sanjay Dahiya. Binary logistic regression classifying the gender of student towards computer learning in european schools. In THE 11TH CONFERENCE OF PHD STUDENTS IN COMPUTER SCIENCE, page 45, 2018. Chaman Verma Ahmed S. Tarawneh Veronika Stoffov´a, Zolt´an Ill´es. Forecasting residence state of indian student based on responses towards information and communication technology awareness: A primarily outcomes using machine learning. In International Conference on Innovations in Engineering, Technology and Sciences. IEEE In Press, 2018. Chaman Verma Zolt´an Ill´es, Veronika Stoffov´a. An
8th International Advance Computing Conference (IACC).
219
ensemble approach to identifying the student gender towards information and communication technology awareness in european schools using machine learning. International Journal of Engineering and Technology, 2018 in Press. [20] What Is Data Mining. Data mining: Concepts and techniques. Morgan Kaufinann, 2006. [21] Hongbo Du. Data mining techniques and applications: An introduction. Cenage Learning EMEA, 2010. [22] MDRV Gimpy. Missing value imputation in multi attribute data set. International Journal of Computer Science and Information Technologies, 5(4):1–7, 2014.
220
8th International Advance Computing Conference (IACC).