Narrator Ebook to Audiobook Converter
A Research paper presented to the faculty of CAVITE STATE UNIVERSITY Carmona Campus
In partial fulfillment Of the requirements for the degree Bachelor of Science in Information Technology
By
Ayna A. Bast
November 2018
Chapter 1 THE PROBLEM AND THE SETTING
Introduction Have you ever thought of listening to your books, articles, and other documents instead of reading them? Text Speaker reads your text documents aloud on your PC and converts them to audio files in MP3 or WAV format. Listen to the audio files on your MP3 player, iPod, iPhone, and mobile phone while you do other tasks at home or at work. Text Speaker offers a great selection of high quality, human sounding voices. The continuous growing of people’s music library requires more advanced ways of computing playlists through algorithms that match tracks to the user’s preferences. Several approaches have been made to enhance the user’s listening experience. The application of background music in the way of reading may open up a new era of learning possibilities. For centuries, educators have used music as a learning tool that connects the concept to be acquired with a catchy song or rhythm (Beentjes, J.W.J. et all 1996). An electronic book (also referred to as an “E-book”) is an electronic version of a traditional print book (or other printed material Such as, for example, a magazine, newspaper, and So forth) that can be read by using a personal computer or by using an E-book reader. Unlike PCs or handheld computers, E-book readers deliver a reading experience comparable to traditional paper books, while adding powerful electronic features for note taking, fast navigation, and key word Searches. However, such actions, irrespective of whether or not they
are performed on a PC, handheld computer, or E-book reader, generally require the user to read the text from a display. Thus, the use of an E-book generally requires the user to focus his or her visual attention on a display to read the text content (e.g., book, magazine, newspaper, and So forth) of the E-book. Moreover, reading of an E-book is generally performed without any music playing in the background, particularly without any music playing from the E-book itself. The same is true for other types of hand-held devices Such as personal digital assistants (PDAS) and so forth. In order to increase the naturalness of oral communications between humans and machines, all speech aspects must be involved. Speech does not only transmit ideas and concepts, but also carries information about the attitude, emotion and individuality of the speaker (Y. Chen, et all 2003). Speech is the most used and natural way for people to communicate. From the beginning of the man-machine interface research, speech has been one of the most desired mediums to interact with computers. Therefore, speech recognition and text-to-speech capability have been studied to make the communication with machines more human likely. In order to increase the naturalness of oral communications between humans and machines, all speech aspects must be involved. Speech does not only transmit ideas and concepts, but also carries information about the attitude, emotion and individuality of the speaker. Speaker identity, the sound of a person’s voice, is a key factor in oral communications. Background of the Study Audiobook has been used since the time e-books had been released. Audio book has been used by parents and also their children in helping them read. This study focuses on making ebooks to audiobooks for pdf, txt, docs and zip file that you want to listen rather than read. It
would be desirable and highly advantageous to have a hand-held device that allows a user to assimilate content without having to look at a display. Objectives Our intentions are to provide a new application in smartphones to have an easier reading and a pleasant listening experience at the same time that will help the users to able to study while doing other task at home or in school for their school works, generally relates to hand held devices and, more particularly, to mixing music and text-to-speech (TTS). Significance of the Study The students will be the beneficiary of this application they will able to learn proper intonation of sentences by listening to converted Audiobook, especially in pronunciation exercise, this can also increase the usability and productivity of the Google Drive. The application improves listening experience. They don’t have to download a video, PDF, TXT, Docs or zip file in order to access it. By this application they can listen to long articles with a soft background track. When converting your document to MP3 format, you can combine speech with music. The file formats supported for the background music are MP3, WAV, AIFF, WMA, MPA, ASF, MPEG, MPG, and M1V. The result of this application may help to the users to give them an easily and conveniently reading experience. Lastly, the development of this study will also take benefit for the future researchers. They might think of making this system more complex which may results to the development of another system.
Time and place The study was conducted from October 2018 to December 2018 in My Value Max Inc. located at Cavite State University Carmona – Campus. Scope and Limitations One of the functions of this application the user can also see the document and be able to read it while listening to the text voice that reads the text file. It will continue playing while in Sleep Mode. The player can also modify the way a voice speaks, by speeding up or slowing down the speech, changing the pitch, and changing the volume. The user can also pick play background music while the application reads your document fluently, including Free Classical music artist like Mozart, Beethoven, Bach, Chopin, etc. The user can also enable the option Add background music to the output file. With the Test Button you can listen to how your audio file sounds. You can adjust the volume of the background music with the help of the slider. Definitions of Terms The following terms as used by the researchers are operationally defined: Audio Files refers to a computer file that contains digitized audio either in the Compact Disc (CDDA)
format
or
in
an
MP3,
AAC
or
other
compressed
format.
See codec
examples, file and sampling. E-Book Reader refers to handheld computer devices like Amazon's Kindle, Barnes and Noble's NOOK and Apple's iPad that make it possible for books in digital form to be viewed and read by users
Human Sounding Voices refers to voice (or vocalization) is the sound produced by humans and other vertebrates using the lungs and the vocal folds in the larynx, or voice box. Voice is not always produced as speech, however. Infants babble and coo; animals bark, moo, whinny, growl, and meow; and adult humans laugh, sing, and cry. iPad is a portable music player developed by Apple Computer support a wide variety of audio formats, including MP3, AAC, WAV, and AIFF. PDA short for personal digital assistant a hand held device that combines computing, telephone/fax, Internet and networking features. A typical PDA can function as a cellular phone, fax sender, Web browser and personal organizer. PDAs may also be referred to as a palmtop, hand-held computer or pocket computer. WAV refers to an audio file format, created by Microsoft that has become a standard PC audio file format for everything from system and game sounds to CD-quality audio. A Wave file is identified by a file name extension of WAV (rarely, Audio for Windows). Text Speaker refers to your own text and sample some of the languages and voices that we offer for speech-enabling websites, giving a voice to your online documents and mobile apps, or making your online/offline content more accessible with text to speech. Text to Speech abbreviated as TTS, is a form of speech synthesisthat converts text into spoken voice output. Text to speech systems were first developed to aid the visually impaired by offering a computer-generated spoken voice that would "read" text to the user. TTS should not be confused with voice response systems.
Chapter II REVIEW OF RELATED LITERATURE According to Jianlei Xie et all. (2002), there is provided an E-book. The E-book comprises a memory device, a text-to-speech (TTS) module, and a music module. The memory device stores files. The files include text and music. The TTS module Synthesizes Speech corresponding to the text. The music module plays back the music. The at least one speaker outputs the Speech and the music. According to Clark Quinn, professor, author, and expert in computer-based education, defined mobile learning as the intersection of mobile computing (the application of small, portable, and wireless computing and communications devices) and e-learning (learning facilitated and supported through the use of information and communications technology).he predicted that mobile learning would one day provide learning that was truly independent of time and place and facilitated by portable computers capable of providing rich interactivity, total connectivity, and powerful processing. in May 2005, Ellen Wagner, senior director of Global Education Solutions at Mac-romedia, proclaimed that the mobile revolution had finally arrived. Wherever one looks, evidence of mobile penetrations is irrefutable: cellphones, PDA's MP3 players, portable game devices, handhelds, tablets, and laptops abound. No demographic is immune from this phenomenon. From toddlers to seniors, people are increasingly connected and are digitally communicating with each other in ways that would have been impossible only a few years ago. Music capabilities allow an Ebook user to enjoy digital music output from the Ebook. TTS capabilities allow an Ebook user to listen to Synthesized text output from the Ebook. The
combination of music and TTS allow an Ebook user to listen to the text along with background music. The majority of the evidence tends to support background music due to its positive implications. Cool, Yarbrough, Patton, Runde, and Keith (1994) conducted a study that proved radio noise generally was considered to be somewhat helpful to students while studying. It kept them focused and on task. Howard Gardner, a Harvard graduate, wrote, Frames of Mind, in the early 1980’s. It has since become one of the most influential books for education. Gardner believes that music creates a positive and relaxing environment that allows for sensory integration to take place and improves concentration abilities. Sensory integration is essential for establishing long-term memory. He has also seen background music successfully used to mask outside traffic sounds, release stress before an exam, and to reinforce subject matter (Campbell, 1997). Jensen (1998) reported that music can deliver as much as sixty percent more content in five percent of the time usually taken to deliver the same materials.
Based on the article written by Bossard,
L.
(2008), Several
solutions
already
use
intelligent
playlists
embedded
in
music
players
installed
on
computers.
There
are
also
online
solutions,
the
most
popular
of
which
is last.fm,
which
acts
as
a
personalized
radio
station
that
plays
preferred music.
On
the
other
hand
it
does
not
allow
playback
of
a
certain
track.
There
are
also
other
solutions,
like
the
genius
function
of
iTunes
or
the
Music
Explorer;
both
use
the
user’s
music
collection
to
generate
playlists.
The
biggest
disadvantage
of
the
latter
solution
is
that
the
user
can
use
only
tracks
that
he/she
already
has
on
his/her
PC
to
generate
playlists.
Of
course
this
limits
the
power
or
the
algorithm
very
much.
According to Lorenzi
(2007)
proposes
a
way
of
representing
the
similarity
between
tracks
in
a
10‐dimensional
Euclidian
space
(further
called
music
space),
where
the
closeness
of
tracks
is
approximately
proportional
to
their
similarity.
7M
songs
currently
appear
in
the
database,
but
only
500K
of
them
have
enough
user
statistics
to
be
mapped
in
the
graph.
Using
this
simplified
and
computationally efficient
way
of
finding
similar
tracks,
several
applications
can
explore
new
ways
of
computing
playlists.
Most
of
them
offer
support
in
playlist
generation
but
none
also
provides
the
tracks
to
be
played.
This
could
be
seen
as
a
disadvantage because
not
all
people
possess
all
tracks
that
are
suggested
by
the
space.
Klusacek [59] proposed a conditional pronunciation modeling method. It uses timealigned streams of phones and phonemes to model a speaker’s specific pronunciation. The
system uses phonemes drawn from a lexicon of pronunciations of words recognized by an automatic speech recognition system to generate the phoneme stream and an open-loop phone recognizer to generate a phone stream. The phoneme and phone streams are aligned at the frame level and conditional probabilities of a phone, given a phoneme, are estimated using cooccurrence counts. A likelihood detector is then applied to these probabilities for the speaker detection task. This approach achieves a relatively high accuracy in comparison with other phonetic methods in the SuperSID project at the Johns Hopkins 2002 Workshop [114] [90]. According to H. Gish, et all (1986), A majority of the speaker models, including the Gaussian mixture models, are based on modeling the underlying distribution of feature vectors from a speaker. When the speech is corrupted, the spectral based features are also corrupted and so their distributions are modified. Thus, a speaker model trained using speech from one type of corrupt environment will generally perform poorly in recognizing the same speaker using speech collected under different conditions since the feature distributions are now different. Various studies of speaker recognition systems using degraded or distorted speech have shown a dramatic decrease in performance [47] [38]. Current speaker recognition researches mainly focus on recognition under controlled conditions such as Switchboard telephone speech, which is close-talking speech. A large amount of effort is still needed in research about speaker recognition robustness under unlimited conditions in open environment with distant microphones.
Chapter III RESEARCH METHODOLOGY This chapter discusses the research design, the selection of the participants as well the instrumentation and validation, data gathering procedures, treatment and analysis of data. Materials Various hardware and software were used for the study. A Windows Operated, Personal Computer, printer and 8gb flash-drive were the hardware utilized for the development of the study. For the software requirements, the following were used; Adobe Photoshop CC and Adobe Illustrator CS6 for the graphical user interface of the application; Java for the programming language; MySQL for the database; Sublime text and Notepad ++ for coding; Google Chrome,
Torch r20, Mozilla Firefox for the browser of the study and Microsoft Office 2010 to create the documentation. Methods The application design is about developing the NARATOR E-book to Audiobook Converter application using which the user can do the following things.
Read the Documents by just Listening.
Converts EBooks files to Audiobook file
Change the GUI Color Scheme.
Change the Music background.
Change the reader voice personality,
Change the mode (Day/Night Mode) in which the page is being displayed.
Search for some content in the document using keywords.
Auto flag document pages and sections
Read .PDF , .DOCX , .TXT files from google Drive
Share the content of a book on a Facebook wall.
Set an alarm as a remainder to read a particular book in the future.
SOFTWARE DEVELOPMENT MODEL: (WATER FALL MODEL) The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to the systems development life cycle, the waterfall model describes a development method that is linear and sequential.
Waterfall development has distinct goals for each phase of development. Imagine a waterfall on the cliff of a steep mountain. Once the water has flowed over the edge of the cliff, gravity is in control, and water cannot run uphill. It is the same with waterfall development. Once a phase of development is completed, the development proceeds to the next phase and there no or little interplay between phases [12, 24] (Figure 1). Requirements This is the first phase of the software development life cycle. Here we gather all the requirements that have to be fulfilled by the developed software Application [12]
Figure 1. Definitions of different phases of the water fall model. Source: CrackMBA. Waterfall Model, 2011. http://crackmba.com/ waterfall-model/, accessed Nov. 2018.
Design After gathering the requirements we will design this particular project. Here we will design the system according to the requirements we gathered in the first phase. We use UML to document aspects of the design of the system [12]. Construction Here the code is implemented. This is the phase where we implement the actual system according to the design. This phase is also called the coding phase [12] Testing We will test, after coding part is finished. In this testing phase, we will test the coding part by using different testing methods. We will execute the code with a variety of tests until there are no errors. Once integration is done, we have to again test the system for proper functionality [12]. Installation After testing the application we have to deploy or install the software or application in the real time environment to make use of it. In this deployment process the customer is involved. He is
seeing all the coding, testing and executing part. If he wants any changes, again it will be modified [12]. Maintenance If we have any issues, when we are using the software/application, we will handle them in the maintenance phase. After deployment process, if they are not satisfied with that particular project, again it will be modified. So the project team is maintaining all these phases, in consultation with the customers [12]