WEB-BASED DATA MINING IN ACADEMIC WEBSITES
WEB-BASED DATA MINING IN ACADEMIC WEBSITES Guide: Mr. D. George Washington
Name: Prasanna Kumar Palepu Reg No: 200536314
Abstract: Proposed system is engaged in a discussion over applications of Web mining to help in discovering pedagogically relevant knowledge contained in databases obtained from Web-based educational systems. These findings can be used both to help effective utilization of resources and minimization of webtraffic, intruders. Analysis and reasoning of the mass of information in education website are made by the technology of Web mining, which can dig out potential modes reduce the risk and make right decisions. The Intended goal is: To mine the web log and find drawbacks in web sites To build an interface to analyze the web log.
Previous Status of The Project: Worked on filtering the log file and keeping them in a database and updating it day-by-day web log data.
Present Status of The Project:
Designed database structure for log file. Collected IP to country database Collected GMT to country database Collected USER_Agent database Created User Interface design with UML diagrams Created reports format and table structures Generated a code for Parsing the Log file. Trying to eliminate bugs in it.
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 1
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Architecture and Design: Introduction: This following page describes the system design in terms of packages, classes, relationships, and behavior. Several attached worksheets address specific aspects of the overall system design, such as user interface and database design. The most important facts of Design:This design is intended for helping in creating a rich interface for web administrators to analyze the web log data and find anomalies in websites.
UML Structural Design The system's structural design is described in the following UML model: WebLogModelStructure The system's structural design is described in the following UML structural diagrams: * PACKAGE WeblogModelStructure OVERVIEW DIAGRAM * WebLogModel o AddLog Diagram o ParseLog Diagram o ExportLog Diagram
UML Behavioral Design The system's behavioral design is described in the following UML model: WebLogModelBehavioral. The system's design is described in the following UML diagrams: Referrer Statistics Class Diagram Access Statistics Class Diagram User Agent Statistics Class Diagram OuterView Of Project UML Activity Diagram
UML Design Checklist Correctness: The generated Design is correct in its fullest and any modifications in it will not lead to drastic change in entire system. Feasibility: As per the Gantt chart the amount of time spend on design is accurate and it is feasible. Understandability: Since I am using Describe UML tool which is user-friendly and easily understandable.
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 2
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Implementation phase guidance: The designed modules are easily implemented. Modularity: There is no particular software for parsing Web Log data and it is unique. And this design comprises of all modules separated distinctly. Extensibility: It is very easy to add new code to intended system as it is written in VB.NET, which is user friendly. Testability: It is very easy to test the system by Testing tools. Manual testing is also done for verification and validation on each module individually and also on whole. Efficiency: The system consumes an acceptable amount of time, storage space, bandwidth, and other resources.
Architecture Overview Software architecture style is being used: Single web service: app-server, database. What are the ranked goals of this architecture? 1. Ease of integration 2. Extensibility 3. Capacity matching
Components The components of this system:The components of this system are listed below by type: * Presentation/UI Components o C-00: WeblogUI * Application Logic Components o C-10: WebLogLogic * Data Storage Components o C-20: WebLogStorage
Deployment The Components are deployed as follows:* All-in-one server o WebLogFront End + C-00: WebLogUI + C-10: WebLogLogic o Database process + C-20: WebLogStorage Aspects/resources of their environment are shared as follows: Everything is on one oracle server so all machine resources are shared by all components. Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 3
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
The database will be updated constantly using export function. The database could be moved to a different machine with a fairly simple change to a configuration file. Otherwise, nothing can be changed about the deployment. We have the ability to move the database process to a separate machine. We have the ability to add more front-end servers. The application logic running on the application server cannot be split or load-balanced.
Integration The components are integrated and they communicate:All of our code uses direct procedure calls. The database is accessed through a driver. Components within the same process use direct procedure call. Communication with the database uses a ODBC driver. Communication between the front end-and back-end servers uses ODBC.
Architectural Scenarios The following sequence diagrams give step-by-step descriptions components communicate during some important usage scenarios: * * * *
of how
System startup System shutdown ParsingLog ExportingLog
Architecture Checklist Ease of integration: It uses the mechanisms been provided for all needed types of integration and all of the new components are designed to work together. And, the reused components are integrated via fairly simple interfaces.
Source Code Organization and Build System Overview It roughly follows documentation.
the
standard
proposed
in
the
Visual
Studio
.NET
Ranked goals of this source code organization and build system:1. Separation of files by type 2. Separation of version-controlled files from files generated by the build process 3. Compatibility with standard build processes
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 4
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Key Directories and Files in Working Copies Path
Description
Logs/
Web Log File Directory For Parsing
Src/
VB.Net Source Files
src/Model/
VB.Net Model Form Source File
src/Report/
VB.Net Report Source File
src/VBNET/[Nested packages]/
VB.Net source code of classes in each package
src/VBNET/[Nested packages]/test/
VB.Net source code of unit tests for classes in each package
conf/
Configuration files,
data/
Initial data to load into database and/or file system
lib/
Libraries reused by this project
build/
Output of build process
help/
Project documents
Build Targets Target
Description
compile Compiles VB.NET source code and creates and creates an Executable file. Load
Loads the intended Log file into Application
Parse
This is the main target of the application, the log file has to be parsed and stored in a temporary space.
Export
It will export the parsed data to database and remove the temporary space used by it at the time of parsing.
Analyze Analyze the exported data from database.
Build Configuration Options Property
Description
WebLogAnalysis
This is the tool going to be created for exporting the raw web log to database for analysis.
1.0
Version number of this release.
User Interface Overview The ranked goals for the user interface of this system:
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 5
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
1. 2. 3. 4.
Understandability and learn ability Task support and efficiency Safety Consistency and familiarity
This UI design follows Microsoft UI guidelines.
Task Models Only Web administrators will use this software for finding drawbacks in web site.
Technical Constraints / Operational Contextualization Output devices:This “WebLogAnalyzer" system has a 320x200 16-color display as a model window. Windowing systems, UI libraries, or other UI technologies will you used:Standard .NET with no extra libraries.
User Interface Checklist Understandability and learn ability
There are no misunderstanding by labels and icons used in this system as it uses standard ones. The advanced options clearly separated from the most commonly used options There is no invisible options or commands Safety
This is one way export process from front end to database. But still it we can rollback using database administration. Consistency and Familiarity
The UI elements in this system work the same as they do in the existing example systems I identified. And all elements in this system that appear the same, actually function the same.
Persistence Central Database Database access controls will be used:A database user account has been created that has access to the needed application database tables. The username and password for this account is stored in a configuration file read by the application server. This application's central database accessible to other applications:No. This database should always be accessed through this application. All relevant pieces of information are available through the application interfaces. The database itself does not protect against data corruption that could be caused by other applications.
File Storage Nothing is stored in files, everything is in the database. The server stores most data in the database; all user documents are stored in files on their computer hard disk.
Persistence Mechanisms Checklist Expressiveness: Database can easily understandable. Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 6
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Ease of access: Database is accessible by login id and password only. Reliability: The database is highly reliable. Capacity: Database server is having more than 80GB free space. Security: The database is highly secure. Performance: Intel based systems with more than 512MB ram will work faster for this system.
Physical structure of the Database:All tables described below are deployed in Oracle and they are normalized. Any modification of database during will not give much impact in entire design of the project.
Main_Parsed Table1:Field Name Unique_ID Client_IP RFC_Name
LogName Log_Date Req_method Req_Path Req_Protocol Stat_Code
Req_Bytes
Referrer
User_agent
Data Type Length Description AutoNumber 50 Unique Number to Identify the records. This is the address of the computer making VARCHAR2 50 the HTTP request. The server records the IP The field is designed to identify the VARCHAR2 20 requestor. If this information is not recorded, a hyphen (-) holds the column in the log. If using local authentication and registration, VARCHAR2 20 the user's log name will appear; likewise, if no value is present, a "-" is substituted. The format is DD/Mon/YYYY:HH:MM:SS TIMESTAMP +GMT VARCHAR2 20 Request Method is GET, PUT, POST, or HEAD VARCHAR2 256 Path is the path and file retrieved VARCHAR2 20 It defines the protocol used by the Client HTTP completion code. 200: OK 3xx: Some VARCHAR2 3 sort of Redirection 4xx: Some sort of Client Error 5xx: Some sort of Server Error For GET HTTP transactions, this field is the number of bytes transferred. For other VARCHAR2 10 commands this field will be a hyphen (-) or a zero (0) The referrer URL indicates the page where VARCHAR2 50 the visitor was located when making the next request. The user agent is information about the VARCHAR2 200 browser, version, and operating system of the reader. The general format is:
GMT Table2:Field Name GMT Zone
Data Type SMALLINT VARCHAR2
Length Description 5 Greenwich Mean Time in number format 2 Zone of the GMT
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 7
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Military_Code Country City
VARCHAR2 VARCHAR2 VARCHAR2
10 15 15
Millitary Code for the Time Zone Country Name City Name
IP2Country Table3:Field Name
Data Type
Length
IP_From
NUMBER
12
IP_To
NUMBER
12
Registry
VARCHAR2
10
Country_Code Country
VARCHAR2 VARCHAR2
3 20
Description Starting IP address (Numerical representation of IP address) Ending IP address (Numerical representation of IP address.) This is having reserved address numbers. It contains “apcnic, arin, lacnic, ripencc, afrinic” Code of the country Full Description of the country
IP Example: (from Right to Left) 1.2.3.4 = 4 + (3 * 256) + (2 * 256 * 256) + (1 * 256 * 256 * 256)= 16909060
User_agent Table4:Field Name
Data Type
Length
U_Agent_String
VARCHAR2
100
U_Agent_Type Browser Platform
VARCHAR2 VARCHAR2 VARCHAR2
2 10 10
Description User Agent String with all information about the Client system. S-Spiders, R-Robots, C-Crawler, B-Browser Browser Version Platform of User
Req_Resourse Table5:Field Name Req_URL Req_File Req_Bytes
Data Type VARCHAR2 VARCHAR2 NUMBER
Length Description 100 Requested URL path 50 Requested file 10 Requested file Size in bytes
Status_Code Table6:Field Name Stat_Code
Data Type NUMBER
Stat_C_Desc
VARCHAR2
Length Description 3 HTTP completion code. 200: OK 3xx: Some sort of Redirection 4xx: 25 Some sort of Client Error 5xx: Some sort of Server Error
Host_Summary Table7:Field Name
Data Type
Length
Client_IP
VARCHAR2
50
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
Description This is the address of the computer making the HTTP request. The server records the IP 17-Oct-2008 Page 8
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Country_Code
VARCHAR2
3
No_Of_Occurances
NUMBER
5
No_Of_Pages
NUMBER
5
Bandwidth Date
NUMBER DATETIME
10
Code of the country The number of times client visited the website. The number of times client visited the webpages. Bandwidth in bytes Date the client visited the website.
Referrar_Code Table8:Field Name Ref_URL Ref_Site
Data Type VARCHAR2 VARCHAR2
Key_Word1
VARCHAR2
Key_Word2
VARCHAR2
Key_Word3
VARCHAR2
Key_Word4
VARCHAR2
Key_Word5
VARCHAR2
Search_Engine Dom_Name
VARCHAR2 VARCHAR2
Length Description 100 Referral URL 100 Referring WebSite Keywords used to search the 20 website Keywords used to search the 20 website Keywords used to search the 20 website Keywords used to search the 20 website Keywords used to search the 20 website 20 Name of the Search Engine 5 Name of the Domain
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
content in content in content in content in content in
17-Oct-2008 Page 9
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
UML Activity Diagram
Parse Log Data
Finding Country by IP Address
Parsing Time Zone by splitting the date time and GMT
Parsing the Arguements in Request Field
Parsing Status Code
Parsing Referrer
Parsing User Agent Details
Update in database
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 10
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
OuterView Of Project
Access_Stats
Host_Stats WebAdmin
Referrer_Stats
User_Agent_Stats
UserAgent Class Diagram User_Agent Attributes
Private Private Private Private
U_Agent_URL As Character Type As Character Browser As Character Platform As Character
Public Public Public Public Public Public Public Public Public
Function Class_Initialize() Function getU_Agent_URL() As Character Sub setU_Agent_URL( val As Character ) Function getType() As Character Sub setType( val As Character ) Function getBrowser() As Character Sub setBrowser( val As Character ) Function getPlatform() As Character Sub setPlatform( val As Character )
Operations
U_A_Browser
U_A_OS Attributes
Private NoOfHits As Integer Private Bandwidth As Integer Private NoOfPages As Integer Operations
Public Public Public Public Public Public Public
Function Class_Initialize() Function getNoOfHits() As Integer Sub setNoOfHits( val As Integer ) Function getBandwidth() As Integer Sub setBandwidth( val As Integer ) Function getNoOfPages() As Integer Sub setNoOfPages( val As Integer )
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
Attributes
Private NoOfHits As Integer Private Bandwidth As Integer Private NoOfPages As Integer Operations
Public Public Public Public Public Public Public
Function Class_Initialize() Function getNoOfHits() As Integer Sub setNoOfHits( val As Integer ) Function getBandwidth() As Integer Sub setBandwidth( val As Integer ) Function getNoOfPages() As Integer Sub setNoOfPages( val As Integer )
17-Oct-2008 Page 11
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Access Statistics Class Diagram
ClientRequests Attributes
Private RequestedFile As Character Private ReqestedURL As Character Private RequestedBytes As Character Private ClientIP As Character Operations
Public Function getRequestedFile() As Character Public Sub setRequestedFile( val As Character ) Public Function getReqestedURL() As Character Public Sub setReqestedURL( val As Character ) Public Function getRequestedBytes() As Character Public Sub setRequestedBytes( val As Character ) Public Function getClientIP() As Character Public Sub setClientIP( val As Character ) Public Function Class_Initialize()
By_Pages { From Access_Stats } Attributes
Private NoOfVisitors As Integer Private Bandwidth As Integer Private NoOFHits As Integer Operations
Public Function Class_Initialize() Public Function getNoOfVisitors() As Integer Public Sub setNoOfVisitors( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOFHits() As Integer Public Sub setNoOFHits( val As Integer )
By_Files Attributes
Private NofOfVisitors As Integer Private Bandwidth As Integer Private NoOfHits As Integer Operations
Public Function Class_Initialize() Public Function getNofOfVisitors() As Integer Public Sub setNofOfVisitors( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer )
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
By_Paths Attributes
Private NoOfVisitors As Integer Private NoOfHits As Integer Private Bandwidth As Integer Operations
Public Function Class_Initialize() Public Function getNoOfVisitors() As Integer Public Sub setNoOfVisitors( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer )
By_ResponseCode Attributes
Private NoOfVisitors As Integer Private Bandwidth As Integer Private NoOfHits As Integer Operations
Public Function Class_Initialize() Public Function getNoOfVisitors() As Integer Public Sub setNoOfVisitors( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer )
17-Oct-2008 Page 12
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Referrer Statistics Class Diagram
ReferrerStats Attributes
Private ReferrerURL As Character Private RefSite As Character Private Keyword1 As Character Private Keyword2 As Character Private Search_Engine As Character Private Dom_Name As Character Operations
Public Function Class_Initialize() Public Function getReferrerURL() As Character Public Sub setReferrerURL( val As Character ) Public Function getRefSite() As Character Public Sub setRefSite( val As Character ) Public Function getKeyword1() As Character Public Sub setKeyword1( val As Character ) Public Function getKeyword2() As Character Public Sub setKeyword2( val As Character ) Public Function getSearch_Engine() As Character Public Sub setSearch_Engine( val As Character ) Public Function getDom_Name() As Character Public Sub setDom_Name( val As Character )
ByRef_Site
By_Keyword
Attributes
Private NoOfHits As Integer Private Bandwidth As Integer Private NoOfPages As Integer Operations
Public Function Class_Initialize() Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfPages() As Integer Public Sub setNoOfPages( val As Integer )
Attributes
Private NoOfHits As Integer Private Bandwidth As Integer Private NoOfPages As Integer Operations
Public Function Class_Initialize() Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfPages() As Integer Public Sub setNoOfPages( val As Integer )
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
By_SearchEngine Attributes
Private NoOfHits As Integer Private NoOfPages As Integer Private Bandwidth As Integer Operations
Public Function Class_Initialize() Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer ) Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer ) Public Function getNoOfPages() As Integer Public Sub setNoOfPages( val As Integer )
17-Oct-2008 Page 13
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
A normal web log is a raw file as follows: Here 1,2,3 and 4 are line number representation 1. 65.55.208.12 - - [09/Sep/2007:04:13:04 +0530] "GET /academic/curri2002ft-welding.doc HTTP/1.0" 200 52224 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" 2. 74.6.28.105 - - [09/Sep/2007:04:13:17 +0530] "GET /academic/D508.doc HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 3. 69.123.246.252 - - [09/Sep/2007:04:13:33 +0530] "GET /images/newlogo.jpg HTTP/1.1" 304 "http://collinfo.annauniv.edu:6060/annauniv/courseall/branchwise.asp?brnam e=B.E-Bio-Medical Engineering&brcode=121°rcode=11" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 4. 69.123.246.252 - - [09/Sep/2007:04:13:33 +0530] "GET /images/annatext.gif HTTP/1.1" 304 "http://collinfo.annauniv.edu:6060/annauniv/courseall/branchwise.asp?brnam e=B.E-Bio-Medical Engineering&brcode=121°rcode=11" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" Format Of log File:
<method><protocol><user_agent> Fields: Client IP: 128.101.228.20 Authenticated User ID: - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" (Other common methods are POST and HEAD) Status: 200 (– 200: OK – 3xx: Some sort of Redirection – 4xx: Some sort of Client Error– 5xx: Some sort of Server Error) Bytes: Referrer: “-” Agent: "Mozilla/4.61 [en] (WinNT; I)" Common Log Format: Remotehost: browser hostname or IP # Remote log name of user (almost always "-" meaning "unknown") Authuser: authenticated username Date: Date and time of the request "request”: exact request lines from client Status: The HTTP status code returned Bytes: The content-length of response
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 14
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Sample Reports Access Statistics Pages Hits
Page 1 2 3 4
/ /coe/schedule.htm /result/results_revs.html /academic/
166 61 32 25
% 27.48 10.10 5.30 4.14
Visitors % 144 26.33 53 9.69 31 5.67 23 4.20
Bandwidth 2.30 MB 615.87 KB 117.11 KB 137.72 KB
% 30.75 8.04 1.53 1.80
Entry Points Hits
Entry Point 1 2 3 4
/ /academic/ /academic /academic/lakescr.txt
135 15 9 8
% 57.45 6.38 3.83 3.40
Visitors % 135 57.45 15 6.38 9 3.83 8 3.40
Bandwidth % 86.84 2.09 0.11 0.00
2.22 MB 54.74 KB 2.85 KB 8
Paths Visitors % 53 22.55 16 6.81 11 4.68 10 4.26
Path 1 2 3 4
No No No No
Referrer Referrer Referrer Referrer
-> -> -> ->
File Type 1 .gif 2 .jpg 3 .html
/ / -> /coe/schedule.htm / -> /result/results_revs.html /academic/ Hits 1616 653 440
% 40.34 16.30 10.98
Response Code 1 2 3 5 6
200 304 404 301 405
-
Visitors % 173 15.27 177 15.62 221 19.51
OK Not Modified Not Found Moved Permanently Method Not Allowed
Hits
Pages 196 63 73
Visitors % % 2415 60.28 240 44.53 1057 26.39 109 20.22 411 10.26 120 22.26 22 0.55 22 4.08 15 0.37 15 2.78
Bandwidth 721.30 KB 549.09 KB 249.79 KB 2.88 KB
% 9.42 7.17 3.26 0.04
Bandwidth % 31.01 9.97 11.55 Pages % 566 71.37 140 17.65 46 5.80 6 0.76 2 0.25
% 7.08 7.86 6.54
3.93 MB 4.36 MB 3.63 MB Bandwidth 44.63 MB 0 119.31 KB 6.90 KB 4.78 KB
% 80.48 0.00 0.21 0.01 0.01
Visitor Statistics Hosts Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 15
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Host
Hits
Country
1 122.164.245.135 2 121.246.25.137 3 59.92.9.1
India India India
128 121 119
Pages % 3.20 3.02 2.97
Bandwidth % 1.83 2.92 3.93
54 86 116
% 0.88 0.62 0.91
499.78 KB 352.62 KB 514.89 KB
Visitors Visitors 1 122.164.245.135 2 121.246.25.137 4 122.164.169.105 Country 1 India 2 United States 3 Kuwait
Hits
Country India India India
128 121 113
Hits 3882 74 25
% 3.20 3.02 2.82
Visitors % 278 89.39 22 7.07 1 0.32
% 96.90 1.85 0.62
Pages % 54 1.83 86 2.92 67 2.27 Pages 599 59 23
Bandwidth 499.78 KB 352.62 KB 1.02 MB
% 0.88 0.62 1.84
Bandwidth % 85.21 8.39 3.27
45.54 MB 8.36 MB 162.64 KB
% 82.11 15.08 0.29
Referrers Statistics Hits
Visitors Pages Bandwidth % % % % 1134 28.31 143 15.29 25 2.55 5.95 MB 10.73 17.25 553 13.80 249 26.63 104 10.61 31.10 MB
Referrer 1 http://www.annauniv.edu / 2 No Referrer http://www.annauniv.edu /schedule.htm http://www.annauniv.edu 4 /circular.html 3
/coe /coe
457 11.41 59 6.31 19 1.94 2.25 MB 4.05 197 4.92 19 2.03 18 1.84 1.11 MB 2.00
Referring Sites Hits
Referring Site
% 1 http://www.annauniv.edu / 3311 82.65 2 No Referrer 553 13.80 3 http://collinfo.annauniv.edu :6060 / 68 1.70 4 http://www.google.co.in / 25 0.62 5 http://www.google.com / 13 0.32
Visitors % 195 38.09 249 48.63 19 3.71 21 4.10 8 1.56
Pages % 548 76.97 104 14.61 8 1.12 17 2.39 6 0.84
Bandwidth 33.38 MB 17.25 MB 196.97 KB 1.98 MB 663.29 KB
% 60.19 31.10 0.35 3.57 1.17
Keywords Keyword 1 anna university 2 annauniversity
SE Page 1 1
Hits % 11 28.95 5 13.16
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
Visitors % 7 22.58 3 9.68
Pages % 3 13.04 1 4.35
Bandwidth 153.23 KB 42.43 KB
% 34.60 9.58
17-Oct-2008 Page 16
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
3 annauniv.edu 4 annauniv
1 2 3 4
1 1
4 2
Search Engine
SE Page
Google.com Yahoo.com MSN.com live.com
1-2 1 1 1
10.53 5.26
Hits % 28 70.00 8 20.00 3 7.50 1 2.50
3 9.68 2 6.45 Visitors % 22 70.97 6 19.35 2 6.45 1 3.23
2 8.70 1 4.35
76.75 KB 42.43 KB
Pages % 13 68.42 2 10.53 3 15.79 1 5.26
17.33 9.58
Bandwidth 306.83 KB 84.86 KB 32.00 KB 21.21 KB
% 68.96 19.07 7.19 4.77
User Agent Stats Operating System 1 2 3 4 5
Windows XP Windows 2000 Windows 98 Unknown Linux Browser
1 2 3 4 5
MS Internet Explorer 6 Firefox MS Internet Explorer 7 MS Internet Explorer 5 Opera 9
Hits 3154 449 277 65 16
% 78.99 11.24 6.94 1.63 0.40
Visitors % 185 62.08 32 10.74 12 4.03 62 20.81 3 1.01
Hits % 2707 68.85 597 15.18 368 9.36 106 2.70 52 1.32
Visitors % 163 68.49 30 12.61 19 7.98 5 2.10 5 2.10
Pages 452 207 83 28 14
Bandwidth % 55.73 25.52 10.23 3.45 1.73
40.86 MB 2.74 MB 4.35 MB 498.96 KB 66.85 KB
Pages % 453 46.89 245 25.36 141 14.60 53 5.49 37 3.83
% 83.90 5.62 8.93 1.00 0.13
Bandwidth % 66.89 8.71 21.17 0.51 0.65
32.22 MB 4.20 MB 10.20 MB 253.71 KB 318.61 KB
Error Stats Errors Hits
Error 1 2 3 4 5
%
/coe/TITLEflowers.gif http://www.annauniv.edu /coe /schedule.htm /favicon.ico No Referrer /coe/fd_1.jpg http://www.annauniv.edu /coe /top.htm /campustour/images/leftboxcorner_top.gif http://www.annauniv.edu /campustour /index.htm /academic/ No Referrer Error
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
97
22.77
87
20.42
35
8.22
27
6.34
15
3.52
Hits %
17-Oct-2008 Page 17
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
1 2
404 - Not Found 405 - Method Not Allowed
411 15
96.48 3.52
Sample Code #include #include <stdio.h> #include <stdlib.h> #include <string.h> #ifndef _DEBUG #define PRIVATE static #else #define PRIVATE #endif #define MAX_FILE_SPECS (10) #define INITIAL_BUFFER_LEN (100) PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE PRIVATE
struct log_entry_filter log_filter; char* file_specs[MAX_FILE_SPECS]; void filter_file(FILE* log_file); void parse_command_line(int argc, char** argv); void execute_all_tests(void); char* all_tests(void); void read_file_specs_from_cl(int argc, char* argv[]); void filter_files(glob_t* glob); void free_file_specs(void); void print_version(void); void filter_file_specs(void);
int main(int argc, char** argv) { parse_command_line(argc, argv); if (file_specs[0] != NULL) { filter_file_specs(); free_file_specs(); } else { filter_file(stdin); } filter_free(&log_filter); return EXIT_SUCCESS; } PRIVATE void filter_file(FILE* log_file)
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 18
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
{
struct log_file_entry* entry; char* line = NULL; size_t length = INITIAL_BUFFER_LEN; line = buffer_allocate(line, INITIAL_BUFFER_LEN + 1);
}
while (getline(&line, &length, log_file) != -1) { assert(line != NULL); entry = parse_line(line); if (entry) { if (filter_entry(&log_filter, entry)) { fputs(line, stdout); } free_entry(entry); } } free(line);
PRIVATE void free_file_specs(void) { int counter = 0; while (file_specs[counter] != NULL) { free(file_specs[counter]); counter++; } } PRIVATE void filter_file_specs(void) { int counter = 0; int flags = 0; int status; glob_t glob_buf; assert(file_specs[0] != NULL); while (file_specs[counter] != NULL) { status = glob(file_specs[counter], flags, NULL, &glob_buf); switch (status) { Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 19
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
case GLOB_NOSPACE: // Out of memory error exit_with_diagnostic("Ran out of memory whilst globbing...\n"); break; case GLOB_NOMATCH: // The pattern didn't match any files exit_with_diagnostic("No files match file spec\n"); break; default: // Everything went ok, just carry on... break; } flags |= GLOB_APPEND; counter++; } assert(glob_buf.gl_pathc > 0); filter_files(&glob_buf); globfree(&glob_buf); } PRIVATE void filter_files(glob_t* glob) { int i; FILE* log_file;
}
for (i = 0; i < glob->gl_pathc; ++i) { log_file = fopen(glob->gl_pathv[i], "r"); if (!log_file) { exit_with_diagnostic("Unable to open log file\n"); } filter_file(log_file); fclose(log_file); }
PRIVATE void usage(void) { exit_with_diagnostic( "usage: " PACKAGE_NAME " [-hiTv] [-b browser] [-c client] [-f filter(s)]\n" " [-I identity] [-m method] [-p protocol] [-r referer] [-s status]\n" " [-u uri] [-U user] [-z size] logfile [logfile...]\n" "\n" " -b browser filter for user agent (browser) string\n" " -c client filter for client address\n" Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 20
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
" -h get usage message\n" " -i do case-insensitive string searches\n" " -I identity ?? filter on second field of log file\n" " -m method filter on request method (e.g. GET, POST...)\n" " -p protocol filter on HTTP protocol version field (e.g. HTTP/1.1)\n" " -r referer filter on document referer string\n" " -s status filter on request status value (e.g. 200, 404...)\n" " -T run internal test suite\n" " -u uri filter on document URI\n" " -U user filter on user name used in request, if any\n" " -v show program's version number\n" " -z size filter on document size\n" "\n"); } PRIVATE void parse_command_line(int argc, char** argv) { int choice; if (argc <= 1) { usage(); } memset(file_specs, 0, MAX_FILE_SPECS * sizeof(char*)); while (((choice = getopt(argc, argv, "b:c:hiTI:m:p:r:s:tu:U:vz:")) != -1)) { switch (choice) { case 'b': save_ua_filter(&log_filter, optarg); break; case 'c': save_client_filter(&log_filter, optarg); break; case 'h': usage(); break; case 'i': // Perform case insensitive matches case_sensitive = 0; break; case 'I': save_identity_filter(&log_filter, optarg); break; case 'm': save_method_filter(&log_filter, optarg); Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 21
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
break; case 'p': save_protocol_filter(&log_filter, optarg); break; case 'r': save_referer_filter(&log_filter, optarg); break; case 's': save_status_filter(&log_filter, optarg); break; case 'T': execute_all_tests(); break; case 'u': save_uri_filter(&log_filter, optarg); break; case 'U': save_user_id_filter(&log_filter, optarg); break; case 'v': print_version(); break; case 'z': save_size_filter(&log_filter, optarg); break; default: usage(); exit_with_diagnostic("\nUnknown command line option"); break; }
}
} read_file_specs_from_cl(argc, argv);
PRIVATE void read_file_specs_from_cl(int argc, char* argv[]) { int cl_counter; int file_spec_counter = 0; char* file_spec; assert(file_specs[0] == NULL); Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 22
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
}
for (cl_counter = optind; cl_counter < argc; ++cl_counter) { file_spec = malloc(strlen(argv[cl_counter]) + 1); if (!file_spec) { exit_with_diagnostic("Failed to allocate buffer for file spec"); } strcpy(file_spec, argv[cl_counter]); file_specs[file_spec_counter++] = file_spec; }
PRIVATE void print_version(void) { printf("%s version %s\n", PACKAGE_NAME, VERSION); exit(EXIT_SUCCESS); } PRIVATE char* all_tests(void) { mu_run_test(entry_all_tests); mu_run_test(filter_all_tests); return 0; } PRIVATE void execute_all_tests(void) { int exit_code = EXIT_SUCCESS; char *result; result = all_tests(); if (result != 0) { printf("%s\n", result); exit_code = EXIT_FAILURE; } else { printf("ALL TESTS PASSED\n"); } printf("Tests run: %d\n", tests_run); exit(exit_code); }
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 23
WEB-BASED DATA MINING IN ACADEMIC WEBSITES
Reference: [1] Vranic, M.Pintar, D. Skocir, "The use of data mining in education environment" in 9th International Conference on Telecommunications, 2007. ConTel 2007; June 2007; PP: 243-250 [2] Qianhui Althea LIANG , Jen-Yao CHUNG , Steven MILLER , Yang OUYANG; "Service Pattern Discovery of Web Service Mining in Web Service RegistryRepository" in IEEE International Conference on e-Business Engineering (ICEBE'06); October 2006 [3] Georgios Lappas; "An Overview of Web Mining in Societal Benefit Areas" in The 9th IEEE International Conference on E-Commerce Technology and The 4th IEEE International Conference on Enterprise Computing, E-Commerce and E-Services (CEC-EEE 2007); July 2007; pp. 683-690 [4] Hafidh Ba-Omar , Ilias Petrounias , Fahad Anwar; "A Framework for Using Web Usage Mining to Personalise E-learning" in Seventh IEEE International Conference on Advanced Learning Technologies (ICALT 2007); July 2007; pp. 937-938 [5] Leticia dos Santos Machado , Karin Becker; "Distance Education: A Web Usage Mining Case Study for the Evaluation of Learning Sites" In Third IEEE International Conference on Advanced Learning Technologies (ICALT'03); July 2003; pp. 360 [6] Carlos G. Marquardt , Karin Becker , Duncan D. Ruiz; "A Pre-Processing Tool for Web Usage Mining in the Distance Education Domain" in International Database Engineering and Applications Symposium (IDEAS'04); July 2004; pp. 78-87 [7] Xiangzhu Gao , San Murugesan , Bruce Lo; "Extraction of Keyterms by Simple Text Mining for Business Information Retrieval" in IEEE International Conference on e-Business Engineering (ICEBE'05); October 2005; pp. 332339 [8] Ajith Abraham; "Natural Computation for Business Intelligence from Web Usage Mining" in Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05); September 2005; pp. 3-10
Guide: Mr. D. George Washington Prasanna Kumar Palepu (200536314)
17-Oct-2008 Page 24