KHULNA UNIVERSITY OF ENGINEERING & TECHNOLOGY COMPUTER SCIENCE & ENGINEERING Report Submission Sheet (No assessment will be accepted without this)
Student to Complete
Actual Hand-In Date: 22.06.2008
Student Name: M. S. A. SHAHNAWAZ CHOWDHURY ROLL-0507014
&
S. M. ABU SALEH SHAWON ROLL-0507016
This submission is the result of our own work. Primary and secondary sources of information and any contributions to the work by third parties, other than my tutors, have been fully and properly attributed. Should this statement prove to be untrue I recognise the right and duty of the University to take appropriate action in keeping with the regulations regarding candidates’ use of unfair means during assessment.
Academic Staff to Complete Supervisor Name: RUSHDI SHAMS Course Title: SOFTWARE DEVELOPMENT PROJECT-2
Course No: CSE-3100
Report Title: A TOOL FOR CORPUS ANALYSIS Date Issued: 25.02.2008 Received:
Hand-In Date: 22.06.2008
On Time
Late (within 5 working
days of issue) Mark Allocated: ______________________________________________________ Comments:
_____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________
1
Contents Chapter
Title
Page
1.
Acknowledgements
3
2.
Introduction
4
3.
Analysis
5
3.1 3.1.1 3.1.2 3.1.3 3.1.4 3.2 3.2.1 3.2.1.1 3.2.1.2 3.2.1.3 3.2.1.4 3.3 4. 5. 6. 7.
Corpus Corpus requirement Corpus creation Type of corpus Why corpus is needed Corpus Analysis Tools Analysis Tools Design a parser Save corpus into database Search & Show any data Representation of corpus as TREE System requirements Snapshots Future plan Conclusion Refference
1
5 5 5 6 6 7 7 8 9 9 9 10 11-16 17 17 18
Chapter-1 Acknowledgements With due homage and honor we are wishes to express our gratitude to Almighty Allah. We are expressing our special thanks to RUSHDI SHAMS Sir(Lecturer, Department of Computer Science & Engineering) who gave us the idea of this project. We express our indebtedness with reverential acknowledgement to our honorable supervisor RUSHDI SHAMS Sir(Lecturer, Department of Computer Science & Engineering) for his friendly & excellent guidance. We also express our gratitude to all our teachers, senior students and our cordial batch mates for their invaluable support.
1
Chapter-2 Introduction Corpus-based approaches to dialogue have become an increasingly important part of dialogue agent design, providing a scope of the real issues that need to be dealt with in order to engage in natural dialogue with humans, as well as providing the basic data for statistical methods for language processing. This paper describes the development of a multi-modal corpus based on language interaction. Our software is developed to analysis corpus & represent multimodaly. Here we use xml file as a corpus and represent it.
Chapter-3 1
Analysis 3.1Corpus: A corpus is a large body of machine-readable texts. In linguistics and lexicography, a corpus is a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. The main purpose of a corpus is to verify a hypothesis about language. Corpora are ideal for functionally based analyses of language, they have other uses as well.Now computer corpora may store many millions of running words, whose features can be analyzed by means of tagging and the use of concordance programs. 3.1.1Corpus requirement: Corpus creation Import of existing data Support of state of the art linguistic software Corpus analysis Linguistically relevant queries Generation of sub-corpora Corpus extension Simple corpus extension Revision mechanisms Corpus dissemination Within a working group and extensibility 3.1.2Corpus creation: • • • •
Electronic form: adaptation of material already in electronic form, scanning, keyboarding. Permissions: copyright, safeguard against exploitation and piracy. Design: types and proportions of material in it (spoken and written language, from formal to informal, from literary to ordinary, etc). Characteristics: o quantity => large o quality => authentic o simplicity => plain text o documented => document
1
3.1.3Type of Corpus: There are many type of corpus. Such as__
XML RDF TOPIC MAP CONCEPT MAP
3.1.4Why Corpus is needed: There are many reasons why corpus is neccessary. They are followed by__ To verify a hypothesis about language.
To take representatitive decision. To study computational linguistic. To analysis a natural knowledge based topic.
1
3.2Corpus Analysis Tools We have used Netbeans 6.0 IDE for implementing our software . We used MySql Query Server as a DBMS. We needed a connector jar file for connecting the program to database. We copied that connector file to JDK library.
3.2.1Analysis Tools: Our first job is to purse the corpus(XML file). So we had to design a PARSER to do that. For that we first determined the TAGS of the XML file. After getting tags we purse the corpus and save it to the database using Mysql as a DBMS. Then we can search & show any data within the corpus. After that we represent the total corpus in a TREE view. In a short view our implementation divided into 4 parts: 1. 2. 3. 4.
Design a parser. Save corpus into database. Search & Show any data. Representation of corpus as a TREE.
1
3.2.1.1Design a parser: To design a parser we have to implement a program that took the Tags of the corpus. Then these tags are used to purse the corpus. Efficient parsing of XML documents is more and more critical as XML gets adopted more widely. It is very important to have an efficient way to parse XML data, especially in applications that are intended to handle large volumes. Improper parsing can result in excessive memory usage and processing times that can hurt scalability. Several types of XML parsers are available. An XML parser takes as input a raw serialized string and performs certain operations on it. First it checks the syntactic well-formedness of the XML data, making sure that the start tags have matching end tags and that there are no overlapping elements. Most parsers also implement validation against the Document Type Definition (DTD) or the XML Schema to verify that the structure and content are as you specified. Finally, the parsing output provides access to the content of the XML document via programmatic APIs. There are three popular XML parsing techniques for Java: • • •
Document Object Model (DOM), a mature standard from W3C Simple API for XML (SAX), the first widely adopted API for XML in Java and a de facto standard Streaming API for XML (StAX), a promising new parsing model introduced in JSR-173
Each of these techniques has benefits and drawbacks. We worked here with SAX parsing techniques.
1
3.2.1.2Save corpus into database: To save corpus into the database we have to connect Java with Mysql. Then we use sql queries to save corpus into the database. After saving corpus into the database a confirmation message will be shown in the window.
3.2.1.3Search & Show any data: To search any data within the corpus we have to select which content of the corpus you want. Then a window of the searching data will be shown in the window. To see total data in the corpus you have to push a buuton. Then a window of all data will be shown.
3.2.1.4Representation of Corpus as a TREE: Finally we represent the total corpus as a TREE. In the TREE we click the head nodes and then the child nodes will be shown.
1
System Requirements
Operating Systems: Supported operating systems are: • • • •
Windows XP Professional or Home editions; all language versions. Microsoft Windows Millennium edition. Microsoft Windows 98 all editions. Windows 95 or earlier is not supported.
Software Requirements: NetBeans IDE 6.0 MySQL 5.0 JDK 1.6
1
Snapshots
1
Future Plan Our future plan is to make the “Text to knowledged mapping prototype” software more friendly, more easier and more comfortable for the user. It is not only theoritical but also attractive representation.
Conclusion The interface of the software is user friendly. This software is applicable for Windows XP operating system. As maximum computer users feel comfort to use this operating system. Our main target was to develop a software which can really help people to take decision or understand the subject of the corpus. If we were provided sufficient technical help as well as enough time, we could develop this software more effectively. We are looking forward to improve our software to make it truly platform independent.
1
Refferences http://www.oracle.com/technology/ Parsing XML Efficiently - Dev xml.html http://www.corpus analysis.html http://www.cambridge.org { robinson, martinovski, stephan, traum}@ict.usc.edu Database System Concepts --Abraham Silberschatz --Henry F. Korth --S. Sudarshan Database Systems Lab Sheets --Rushdi Shams(Lecturer,CSE,KUET)
1