Lung Disease Prediction.edited.docx

  • Uploaded by: Mohammad Farhan
  • 0
  • 0
  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Download & View Lung Disease Prediction.edited.docx as PDF for free.

More details

  • Words: 4,450
  • Pages: 35

Dissertation Part-I Progress Report of Master of Technology (Computer Science) 3rd Semester


Mr. Abdul Wahid Associate Professor (Dean) Department of CS&IT, MANUU, Hyderabad



Certificate This is to certify that the Dissertation Part-1 Progress Report entitled “LUNG DISEASE PREDICTION SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” submitted by MD FARHAN HAIDER bearing Roll No A160025 in partial fulfillment of the requirements for the award of Master of Technology (CS) Degree during 2017-2019 at the Department of CS&IT is an authentic work carried out by him/her under my guidance and supervision. The results presented in this report have been verified and are found to be satisfactory. The results embodied in this dissertation have not been submitted to any other University or Institute for the award of any other degree or diploma.

Supervisor’s Signature

DRC Member’s Signature

Head Department of CS&IT


I hereby declare that the thesis work presented in this Dissertation Part-1 Progress Report entitled “LUNG DISEASE PREDICTION SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” towards the partial fulfillment of the requirement for the award of the degree of Master of Technology (Computer Science) submitted in the Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, Telangana, India is an authentic record of my own work carried out under the guidance of MR. ABDUL WAHID, Associate Professor (Dean), Department of CS&IT, Maulana Azad National Urdu University, Hyderabad (Telangana).

I have not submitted the matter embodied in this progress report for the award of any other degree or diploma to any other University or Institute.

Date: Place:



I express my sincere gratitude towards my Supervisor MR. ABDUL WAHID, Associate Professor (Dean), Department of CS&IT, MANUU Hyderabad for consistently providing me with the required guidance to

help me in the timely and successful completion of this report. I am deeply indebted to Coordinator MR. MOHAMMAD ISLAM, Department of CS&IT, MANUU for his valuable suggestions and support. In spite of his extremely busy schedules in Department, he was

always available to share with me his deep insights, wide knowledge and extensive experience. Again I sincerely thank Professor Abdul Wahid, Dean School of Computer Science & Information Technology, Dr. Pradeep Kumar, Head Department of Computer Science & Information Technology and all other faculty members of our department for their valuable feedback during internal evaluations.


Data mining techniques are starting gaining its popularity nearly three decades ago. Till last few years data mining approach was not in been used in health care organization. Researchers have started paying attention towards this field, it has been found by the researcher health care sector is possessing a very large volume of data but all this are highly unorganized. If this organized in a proper way using data mining technique. It can be easily used for the prediction of various diseases. I will develop a hybrid approach by using two technique Naïve Bayes and K means algorithm. Different 14 parameters are considered for prediction of the lung disease. It helps in predicting lung disease using various attributes and it predicts the output as in the prediction form. For the grouping of various attributes, it uses k-means algorithm and for predicting it uses naïve Bayes algorithm.






List of Figures


List of Tables


1. Introduction


1.1) Introduction 1.2) K-means clustering 1.3) Naïve Bayes 1.4) K-Means – Naïve Bayes Hybrid 2. Objectives


2.1) Objectives 3. Literature Survey


3.1) Literature Survey 4. Proposed Method


4.1) Methodologies 4.2) Proposed Method 4.3) Performance Measurement 5. Time Table (Plan of Work)


5.1) Plan of Work 6. Tools


6.1) Tools 7. Tentative Outcomes


7.1) Tentative Outcomes I





Name of the Figure

Page No.

Figure 4.1

K-means clustering process


Figure 4.2

Taking dataset and preprocess


Figure 4.3

Clustering using k-means


Figure 4.4

Classification using Naïve Bayes


Figure 6.1

WEKA tool




Table No.

Name of the Table

Page No.

Table 3.1

Base Papers


Table 3.2

Research Papers using naïve bayes


Table 3.3

Research Papers using k-means clustering


Table 3.4

Research Papers using WEKA


Table 5.1

Plan of Work





1.1) Introduction

Lung cancer is the leading cause of cancer-related death and is responsible for more than a quarter of all deaths due to cancer in the United States. It accounts for 13-14% of all cancer diagnoses, making it the second most commonly diagnosed malignancy in both men and women (not counting skin cancers). Until the 20th century, however, lung cancer was a relatively rare disease. That changed with the advent of wide-scale cigarette smoking, which remains the leading cause of lung cancer today. There are two main types of lung cancer: non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). The majority of lung cancer patients have NSCLC, which usually grows and spreads more slowly and has a better 5-year overall survival rate than SCLC. In the real world, Lung cancer accounts for more deaths than any other cancer in both men and women. Lung Cancer disease is the fifth leading cause of death in the world over the past 10 years (World Health Organization 2016). According to the WHO (World Health Organization) report lung Disease is the leading cause of death across the world accounting for 1.58 million, accounting for about 27 % of all cancer deaths. Death rate began declining in 1991 in men and in 2003 in women. Early detection of lung cancer is essential in reducing life losses. However, earlier treatment requires the ability to detect lung cancer in early stages. Early diagnosis requires an accurate and reliable diagnostic procedure that allows physicians to distinguish benign lung disease from malignant ones.

Health data is rapidly increasing in the world. Health data is very large and complex due to this processing of data using traditional data processing techniques is very difficult. For simplicity, machine-learning techniques like KNN, SVM, D.T have been used. Some tool like Python (pandas) and Weka are widely used in the data analytics field.

The two main concepts that we will come across repeatedly throughout this work are:  K-Means Clustering  Naïve Bayes



K-Means Clustering

K-means is the simplest learning algorithm to solve the clustering problems. The process is simple and easy, it classifies given data set into a certain number of clusters. It defines k centroids for each cluster. They must be placed as much as possible far away from each other. Then take each point belonging to the given data set and relate into the nearest centroid. If no point is pending then a group age is done. Then we re-calculate knew centroid for the cluster resulting from previous steps. When we get the k centroid, a new binding is to be done between sane data points and nearest centroid. A loop is been generated because of this loop key centroid change the location step by step until no more changes are done.

The advantages of k means clustering algorithms are simplicity and speed. Algorithm:-

1) Select k center from the problem (random) 2) Divide data into k clusters by grouping points. 3) Calculate the mean of k cluster to find new centers. 4) Repeat steps 2 and 3 until centers do not change.


Naïve Bayes

Naïve Bayes classifier is based on Bayes theorem. It has strong independence assumption. It is also known as an independent feature model. Naïve Bayes is mainly used when the inputs are high. It gives output in more sophisticated form. The probability of each input attribute is shown from the predictable state.

Bayes theorem:P(H|X) = P(X|H) P(H) P(X) Where P(H|X ) is a posterior probability of H conditioned on X P(X|H) is a posterior probability of X conditioned on H P(H)is a prior probability of H P(X) is a prior probability of X 3

Naïve Bayes will basically predict the output whether the patient will have chances of getting the lung disease or not.


K-Means – Naïve Bayes Hybrid:

The k-means clustering and naïve Bayes hybrid approach has been used for some other disease prediction and has been shown to produce better results than the simple approaches. The model dataset that we get after applying the K-Means algorithm will compare the values of a dataset with a trained dataset. It will apply the Bayes theorem and the probability will be obtained whether the patient will have lung disease or not.  K-means clustering has the ability to handle massive data and cluster those data efficiently and quickly.  Naive Bayes algorithm will be used as a classification.






The objectives of this research are as follows:  To study different disease prediction algorithms and literature review.  To study and analyze existing systems for lung disease and identify issues and challenges.  To develop a k-means – naïve Bayes hybrid system for lung disease.  To design a system for lung disease prediction based on patient data.  To design a system for more accuracy in lung disease prediction than already existing systems.  To implement a system using hybrid algorithms for increasing efficiency.  To test and validate the proposed system.

Prediction of the lung disease is a very complicated task, and in the current world, it mainly depends upon the individual medical practitioner. If all individual medical practitioners are combined on one data set, it will be very useful for the younger generation of the medical practitioner and ultimately it will help the people. In this paper for heart attack prediction hybrid approach is been used, the combination of the most popular clustering technique called ‘K-Means' and as a Classifier ‘Naïve Bayes' algorithm are used. Because of a hybrid approach, this technique is most suitable for any complex problem and it produces results with very good accuracy.





Literature Survey:

A variety of research papers were studied and analyzed during the literature survey for the research on the various disease methods that have been employed over these years using k-means clustering, or Naïve Bayes. The methodologies used in the research studies and their findings are presented below.

Table 3.1: Base Papers

Table 3.2: Research Papers using k-means clustering


Table 3.3: Research Papers using naïve Bayes

Table 3.4: Research Papers using WEKA

[1] Data mining technique widely used for computational and discovering patterns in large data sets. Data mining approach was found by researchers in the middle of 90’s, and its been observed that it is very important technique for fetching unknowns patterns and vital information from large data set.


[2] Rucha Shinde, proposed heart disease prediction system using naïve bayes and k-means clustering. We are using k-means clustering for increasing the efficiency of the output. This is the most effective model to predict patients with heart disease. This model could answer complex queries, each with its own strength with respect to ease of model interpretation, access to detailed information and accuracy.

[3] Priyanka D proposed a system to implement K-Means Clustering algorithms. This performs certain number of iterations randomly, which access the nearest observations into k, to attain the high-speed time consumption and offers stability of the accurate result. Here, this research approaches the Compactness and Connectedness for accuracy result. The compactness and connectedness for complementary measures are used and it is found that the efficiency and effectiveness of the method for predicting Heart Disease is better than the other three techniques through software prototype.

[4] The main aim of this analysis is to develop a prototype Health Care Prediction System using, Naive Bayes. The System will discover and extract hidden data related to diseases (heart attack, cancer and diabetes) from a historical heart disease database. It will answer complicated queries for diagnosing sickness and so assist care practitioners to form intelligent clinical selections, which ancient call support systems cannot. By providing effective treatments, it conjointly helps to reduce treatment prices. To reinforce visualization and easy interpretation.

[5] Some implementations of K-means only allow numerical values for attributes. In that case, it may be necessary to convert the data set into the standard spreadsheet format and convert categorical attributes to binary.It may also be necessary to normalize the values of attributes that are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and numerical attribute. The WEKA SimpleKMeans algorithm uses Euclidean distance measure to compute distances between instances and clusters.






The methodologies used in our proposed system will be based on the combination of k-means clustering and naïve Bayes algorithm.

i) Clustering ii) Classification  To analyze data related to lung diseases for data mining through Weka.  K-means clustering and naïve Bayes techniques will be used.  K-means clustering has the ability to handle massive data and cluster those data efficiently and quickly.  A simple and straightforward iterative method will be used to partition the data set into k-number of clusters.  Naive Bayes algorithm will be used as a classification algorithm.

Table 4.1: k-means clustering process



Proposed Method

Firstly, I will preprocess the data because data in the real world is dirty, incomplete and noisy. Incomplete in lacking attributes values and lacking attributes of interest or containing only aggregate value noisy in terms of containing errors or outliers and inconsistent containing discrepancies in names or codes. And then apply clustering algorithm on dataset after applying clustering algorithm we use classification for predicting lung disease. Data preprocessing steps in Weka: Firstly, Run Weka software, launch the explorer window and select the ―Preprocess‖ tab. Then Open the lung dataset, and enter what information do you have about the data set (e.g. number of instances, attributes and classes)? What type of attributes does this dataset contain (nominal or numeric)? What are the classes in this dataset? Which attribute has the greatest standard deviation? What does this tell you about that attribute? After entered the data set under ―Filter, choose the Standardize filter and apply it to all attributes. What does it do? How does it affect the attributes’ statistics? Click ―Undo to understanding the data and now apply the ―Normalize, filter and apply it to all the attributes. What does it do? How does it affect the attributes’ statistics? How does it differ from ―Standardize? Click Undo again to return the data to its original state. At the bottom, right of the window there should be a graph, which visualizes the dataset, making sure ―Class: class (Nom) is selected in the drop-down box .click Visualize All. What can you interpret from these graphs? Which attribute(s) discriminate best between the classes in the dataset? How do the

Figure 4.2: Taking dataset and preprocess


Standardize and Normalize filter affects these graphs? Under Filter, choose the Attribute Selection filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory above? How does its behavior change as you alter its parameters?

Clustering in WEKA: This pattern divides the records in database into different groups. In the same group, the groups have the similar properties. Between groups the differences should be as bigger as possible, and in the same group, the differences should be as smaller as possible. There is no predefined class that’s why its comes under the unsupervised learning .

Steps involved in WEKA Load the data file browsers .arff into WEKA using the same steps we used to load data into the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns, the attribute data, the distribution of the columns, etc. With this data set, we are looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. Click Choose and select technique from the choices that appear.

Figure 4.3: Clustering by using k-means


Classification in WEKA: Classification is the process of finding a set of models that describe and distinguish data classes and concepts, for the purpose of being able to use the model to predict the class whose label is unknown. Classification is a two step process, first, it build classification model using training data. Every object of the dataset must be pre-classified i.e. its class label must be known; second the model generated in the preceding step is tested by assigning class labels to data objects in a test data set. Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute. The model is represented as classification rules, decision trees, or mathematical formulae. Second step is model usage. It is for classifying future or unknown objects. It estimates accuracy of the model. The known label of test sample is compared with the classified result from the model. Model construction describe a set of predetermines classes. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set, otherwise over-fitting will occur. If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.

Steps involve in WEKA Basically there are four steps involved in WEKA for classification.  Preparing the data  Choose classify and apply algorithm  Generate trees  Analysis the result or output Firstly, Prepare the data, load the data and the data should be in .arff format. After loaded the data choose classify then choose classification algorithm and generate the trees.

Figure 4.4: Classification by using naïve bayes



Performance Measurement

After the training process is completed, the system will be tested for its performance. The testing will be done on the Test Dataset. Test Dataset will be part of text Dataset that hasn’t been used for the training purpose. The ratios for Training Dataset vs Testing Dataset can be 75:25, 65:35, 60:40 and so on based on the size of the available dataset. The ratio is decided on the basis of the size of the dataset so that enough of a dataset is available for both training the system and then testing it as well.

The performance of the proposed method will be then measured using a confusion matrix. As both the input data required and the output by the system is discrete, therefore confusion matrix makes the best choice for evaluating the final performance of our system. The final performance of the system will be measured by comparing the total number of True Positives and True Negatives with the total number of False Positives and False Negatives as predicted by the system and thus giving a clear idea of the performance of the system.  True Positive (TP) : Observation is positive, and is predicted to be positive.  False Negative (FN) : Observation is positive, but is predicted negative.  True Negative (TN) : Observation is negative, and is predicted to be negative.  False Positive (FP) : Observation is negative, but is predicted positive. Classification Rate or Accuracy is given by the relation:







Plan of Work:

Table 5.1: Plan of Work

Academic Calendar



Week 1-2

Literature Searching


Week 3-12

Literature Survey and Review


Week 12-17

Start work on the first draft. Aim to

In Progress

complete chapter 1.

Week 18

Submit draft of chapter 1 to the



Week 18-28

Work on the first draft of the remaining Pending chapters.

Week 29

Submit the first draft to the supervisor.


Receive feedback on previous work.

Week 30

Receive feedback on the first draft of the


main chapters.






WEKA Tool is used to implementing K-Means Clustering and Naïve Bayes will be:  K-means for clustering  Naïve Bayes for classification

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. WEKA contains “clusters” for finding groups of similar instances in a dataset. The clustering schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Clusters can be visualized and compared to “true” clusters (if given). Evaluation is based on log-likelihood if clustering scheme produces a probability distribution. In the ‘Clusterer’ box click on the ‘Choose' button. In pull-down menu select WEKA Æ Clusterers, and select the cluster scheme ‘SimpleKMeans’. Some implementations of K-means only allow numerical values for attributes; therefore, we do not need to use a filter.

Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning schemes available in WEKA include decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, and Bayes' nets. “Meta”- classifiers include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.


Figure 6.1: WEKA Tools





Tentative Outcomes

The proposed system aims to use the hybrid algorithm for lung disease using K-Means and NB that’s more efficient and can be trained easily compared to the existing simple algorithms in a lesser time and gives better output results. The results from the proposed system are expected to be more precise and more accurate than those that are produced from the existing simple algorithms for lung disease.

This research aims to extend the capabilities of K-Means Clustering with the help of NB, by taking some pre-trained K-means Clustering and classify them for a lung disease problem.

The primary outcomes that are expected from the proposed system are as follows:  Lung disease prediction system will be developed by combining Naïve Bayes and K-Means algorithm.  Weka Tools would be used to reduce the execution time of algorithms.  The prediction system may be faster, less computationally expensive, time efficient and produce more accurate results.  The proposed system will help doctors to efficiently predict lung diseases in the initial stages for better treatment.




World Health Organization (2011) The top ten causes of death. World Health Organization (2013) Deaths from coronary heart disease.

 [1] P. V. Maral, “Heart Disease Prediction Using Naive Bayes and K-Means Techniques,” Novat. Publ. Int. J. Res. Publ. Eng. Technol., vol. 3, no. 6, pp. 2454–7875, 2017.  [2] R. Shinde, S. Arjun, P. Patil, and P. J. Waghmare, “An Intelligent Heart Disease Prediction System Using K-Means Clustering and Naïve Bayes Algorithm,” Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 637–639, 2015.  [3] D. Priyanka and M. S. S. Banu, “Prediction on Lung Disease Using K means Algorithm,” vol. 1, no. 11, pp. 239–242, 2015.  [4] G. Singh, K. Bagwe, S. Shanbhag, S. Singh, and S. Devi, “Heart disease prediction using Naïve Bayes,” Int. Res. J. Eng. Technol., vol. 4, no. 3, pp. 1–3, 2017.  [5]

S. Jain, M. Aalam, and M. Doja, “K-means clustering using weka interface,” Proc.

4th Natl. Conf., 2010.  [6]

W. Zhang and F. Gao, “An improvement to naive bayes for text

classification,” Procedia Eng., vol. 15, pp. 2160–2164, 2011.  [7]

K. Vanitha and G. R. L. Rani, “Analysis of Classification and Clustering

Algorithms using Weka For Banking Data,” no. 0976, pp. 104–107.  [8]

S. Singhal and M. Jena, “W-06. Study on WEKA Tool for Data Preprocessing

, Classification and Clustering,” India - WEKA, vol. 2, no. 6, pp. 250–253, 2013.  [9]

P. Ramachandran, N. Girija, T. Bhuvaneswari, and A. Professor, “Early

Detection and Prevention of Cancer using Data Mining Techniques,” Int. J. Comput. Appl., vol. 97, no. 13, pp. 975–8887, 2014.  [10]

S. Vijiyarani, S. Sudha, and M. P. Research Scholar, “Disease Prediction in

Data Mining Technique – A Survey,” Int. J. Comput. Appl. Inf. Technol., vol. II, no. I, pp. 2278–7720, 2013.  [11]

T. Karthikeyan and P. Thangaraju, “PCA-NB Algorithm to Enhance the

Predictive Accuracy,” vol. 6, no. 1, pp. 381–387, 2014.  [12]

D. Kavinya, “Lung Disease Classification Using Support Vector Machine,”

vol. 3, no. 3, pp. 84–86, 2015.  [13]

A. Trivedi, “International Journal of Advanced Research in Computer Science

and Software Engineering Evaluation of Student Classification Based On Decision Tree,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 4, no. 2, pp. 111–112, 2014. 25

 [14]

U. Sharma, “Suitability of neural network for disease prediction : a

comprehensive literature review,” vol. 5, no. 6, pp. 12–20, 2017.  [15]

C. H. Chen, W. T. Huang, T. H. Tan, C. C. Chang, and Y. J. Chang, “Using K-

nearest neighbor classification to diagnose abnormal lung sounds,” Sensors (Switzerland), vol. 15, no. 6, pp. 13132–13158, 2015.  [16]

M. Makinaci, “Support vector machine approach for classification of

cancerous prostate regions,” Int. Enformatika Conf., vol. 1, no. 7, pp. 166–169, 2005.  [17]

A. Kumar, M. Kamaleshwar, S. K. K, S. K. R. S, and J. Arunnehru, “An

Improved Disease Prediction System Using Machine Learning,” no. 4, 2018.  [18]

P. Mirajkar and A. Pradesh, “An Integrated Cancer Prediction System Using

Data Mining Techniques,” vol. 3, no. 1, pp. 1497–1501, 2018.  [19]

A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary, “A

Lung Cancer Outcome Calculator Using Ensemble Data Mining on SEER Data Categories and Subject Descriptors,” Kdd, 2011.  [20]

R. Ada, & Kaur, “A Study of Detection of Lung Cancer Using Data Mining

Classification Techniques,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3, pp. 131–134, 2013.  [21]

B. Sciences, B. G. Krishna, and A. Pradesh, “a Predictive Model for Heart

Disease Using clustering techniques,” vol. 8, no. 3, pp. 529–534, 2017.  [22]

V. Krishnaiah, G. Narsimha, N. Subhash, and C. #3, “Diagnosis of Lung

Cancer Prediction System Using Data Mining Classification Techniques,” Int. J. Comput. Sci. Inf. Technol., vol. 4, no. 1, pp. 39–45, 2013.  [23] Nur Hafieza Ismail, Fadhilah Ahmad, Azwa Abdul Aziz, “Implementing WEKA as a data mining tool to analyze students academic performance using naïve Bayes classifier”,


Related Documents

Diffuse Lung Disease
October 2019 6
November 2019 34
May 2020 12

More Documents from ""