Sorting out the DL ingredients...
Monica Vladoiu PG University of Ploiesti,
To keep in mind while developing.. Building a library is a major undertaking that needs to be carefully planned Distributing any kind of information collection carries certain responsibilities
• legal issues of copyright: being able to access the docs doesn’t necessarily mean you can give them to others • social issues: collections should respect the community’s customs • ethical issues: some things simply should not be made available to the others 2/32
What to start with? The technical part of building a DL is easy if • you know what you want • matches what can be done with the available tools
Begin by fixing some broad parameters of your DL
• the raw material should be in machine-readable form • metadata (summary info about the docs) is needed • what kind of access your DL will support (possibilities, pitfalls)
If you plan to digitize the material for collection, this is an overwhelming task • this involves manually handling of physical material • manual correction of computer-based text recognition
Most builders outsource digitization to a specialized team 3/32
Fundamental questions what its purpose is? what the principles are for including documents? when one document is different from another?
4/32
Where the DL material originates I. an existing library to be converted to a digital form
II. a collection of material to be offered as a DL III. an organized portal into a focused subset of material from the Web these are neither exclusive or exhaustive in practice one often encounters a mixture
5/32
Ideology What is that you intend to achieve with your proposed DL, in terms of: • collection’s purpose (the objectives to achieve) • its principles (the directives that will guide decisions on what should be included and – equally important – what should be excluded) these decisions are difficult ones librarians exercise wisdom when they make those decisions
6/32
Whenever you build a DL you should .. .. formulate its purpose and state it clearly as an introduction to the collection .. explain to the users what principles that govern what is included .. include a description of how the collection is organized 7/32
What to do about different manifestations of a single work? Work vs. Document in traditional libraries a work is the disembodied content of a message a work might be thought as pure information a document is a particular physical object (say a book), that embodies or manifests the work In DLs, a doc is a particular electronic encoding of a work digital reps of a work are easier to copy and change archival and historical records must not be allowed to change, but errors in collections of practical or educational information must be correctable when to docs are the same and when they are different? what to do with new versions? Overwrite or Keep?
8/32
I. Converting an existing library (1) is the most ambitious and expensive kind of DL project it involves the digitization (scanning, OCR) of an existing paperbased collection, which involves huge effort before starting, you need to consider carefully whether there is really a need and the cost vs. the benefits of the project DLs have 3 main advantages over conventional ones: • easier to access remotely • more powerful searching and browsing facilities • foundation for new value-added services
the customer base of the library tells whether DL is better will it expand if the info can be made available electronically? what new services the DL could support? (e. g. automatic notification of new docs that match clients’ interest profiles)9/32
I. Converting an existing library (2)
will the new DL coexist with the existing physical one, or supplant it? maintaining the 2 libraries is very costly at what rate the collection is growing or changing? in many cases digitization of material is an ongoing operational expenditure should you outsource the whole DL operation? can user needs be (partially) satisfied in alternative ways? (e.g. supplying services on subscription) converting an existing library is such a large and expensive task that all alternatives should be carefully explored!
10/32
I. Converting an existing library (3) how to prioritize the material for conversion? library materials can be divided into three classes:
• special collection and unique materials (rare books, manuscripts) • high-use items that are used for teaching and research • low-use items that include less frequently used research materials
Criteria for digital conversion:
intellectual content or scholarly value of the material, the desire to enhance access to it, and available funding opportunities the educational vaule: classroom support, background reading, distance education need to reduce handling of fragile originals institutional ones: promoting unique collections of primary source material or resource sharing partnerships with other libraries cost and space savings copyright issues 11/32
I. Converting an existing library (4) principles for development of library collections:
– priority of utility: usefulness is a crucial reason – local imperative: local collections are built to support local needs, using local resources and having local benefits – preference for novelty: only limited resources can be devoted to the collection and maintenance of older material – implication of intertextuality: to add an item to a collection is to create a relationship between it and others – scarcity of resources: development has to balance scarce resources (funding, staff time, shelf space, user time and attention) – commitment to the transition: more and more information will become available in digital form. Libraries are responsible for promoting this transition and assisting users to adjust to it.
12/32
II. Building a new collection DLs are about new ways of dealing with knowledge , of achieving new human goals by changing the info is used in the world, rather than destroying old libraries many DL projects build new collections of new material the natural organization to create a DL is the copyright holder; otherwise the builders need written approval the burden is overwhelmingly bigger if digitization is needed generating the metadata and converting it (manually) to electronic form is likely to be a major task as well 13/32
III. Virtual libraries (1) provide a portal to info that is e-available elsewhere the library doesn’t itself hold content web lacks essential DL features of selection and organization information portals usually concentrate on a specific topic or focus on a particular audience commercial web search engines are unable to produce consistently relevant results, given their generalized approach, the immense territory they cover and the large number of audiences they serve 14/32
III. Virtual libraries (2) Virtual libraries present new challenges – the value that is added by imposing a DL organization on a subset of web is twofold: selection of content and provision for further metadata Content selection: define a purpose or theme for the DL and then discover and select related material on the web; it can be done manually (seeking and filtering) or automatically (crawling). How to direct and control the search? Further metadata: done either manually or automatically. Categorizing and classifying web pages helps to connect researchers and students with important scholarly and educational resources 15/32
III. Virtual libraries (3) in general, the higher the scholarly or educational value of a resource, the greater the amount of expert time to invest in its description a scenario for semi-automated resource discovery: – automatically generated, with URL, author-supplied metadata, significant keywords and phrases extracted from the full text, and generalized subjects assigned automatically – manually reviewed by a human expert who edits and enriches the automatically derived metadata, checking it for accuracy and adding annotations and subject headings – intensively described by a human expert
info could move from the 1st to the 2nd level if it is judged to be sufficiently central to the collection’s focus on the basis of automatic classification information 16/32 from the 2nd to the 3rd level on the basis of sufficiently high usage
Bibliographical information organizing information on a large scale is far more difficult that it seems at first sight librarians have experience in classifying and organizing docs in a way that makes relevant information easy to find most library users locate info by finding one relevant book on the shelves and then looking around of others in the same area 17/32
Objectives of a classical bibliographic system Finding: enables a person to find a book of which either author, title or subject is known Collocation: shows what the library has by a given author, on a given subject, or in a given kind of literature (info that is nearby in one of several info spaces) Choice: assists in the choice of a book either bibliographically (in terms of its edition) or topically (in terms of its character) 18/32
Objectives of a DL bibliographic system To locate entity or entities in a file or DB as the result of a search using entity’s attributes or relationships (same work, same edition, given author, given subject, other criteria) To identify an entity, that is to confirm that the entity described in a record corresponds to the entity sought, or to distinguish among several similar entities To select an entity that is appropriate to the user’s needs wrt content, physical format, and the like To acquire or obtain access to the entity described, through purchase, loan, and so on, or by online access to a remote computer To navigate a bibliographic DB to find works related to a given one by generalization, association, and aggregation or to find attributes related by equivalence, association, and hierarchy
19/32
Bibliographic entities: documents, works, editions, authors, and subjects (1) sets of these such as document collections, the authors of a particular document, and the subjects covered by a document, are also bibliographic entities Documents are the basic inhabitants of the physical bibliographic universe (books, classical docs, audios, videos, digital objects etc.) a document: a particular physical object (say a book), that embodies or manifests the work in DLs, a document is a particular electronic encoding of a work. Since one document can be an integral part of another, docs have a hierarchical structure that can be more faithfully reflected in a DL than in a classical one digital docs have a kind of impermanence and fluidity that makes them hard to deal with; though DLs need to present users with an image of stability and continuity
20/32
Bibliographic entities: documents, works, editions, authors, and subjects (2) Works are the basic inhabitants of the intellectual bibliographic universe; disembodied content of docs when 2 docs are sufficiently alike to represent the same work? for books, revisions, updating, expansion, and translation preserve the identity of a work migration to another medium may or may not do it (e.g. an audio of a book is likely the same work, while a video would generally not) if different manifestations have the same intellectual content, they may be considered the same work 21/32
Bibliographic entities: documents, works, editions, authors, and subjects (3) Editions are the book world’s traditional technique for dealing with two difficult problems: different presentation requirements (printing details), and managing change of content (corrected and updated content) e-documents generally indicates successive modifications to the same work using terms such as version, release, and revision rather than edition in DLs is far more easier to have access to earlier versions of a work 22/32
Bibliographic entities: documents, works, editions, authors, and subjects (4) Authors seem to be the primary attribute to identify works since medieval times authorship is not always straightforward - some works emanate from organization or institutions – are they authors? Others are anthologies – can the editor be considered an author? authorship has significant drawbacks – differences often arise in how names are spelled and formatted (e.g. the name of the Libyan leader Muammar Qaddafi can be written in 47 ways!); more, sometimes authors use pseudonyms. 23/32
Bibliographic entities: documents, works, editions, authors, and subjects (5) One solution to the author name problem is to create standardized names for authors, and all the other variants to be grouped under the standard name; of course, appropriate cross-references from all the variant spellings are needed this is a particular case of using a controlled vocabulary or set of preferred terms to describe entities
24/32
Bibliographic entities: documents, works, editions, authors, and subjects (6) Titles are attributes of the works rather than entities in their own right Controlled vocabulary is needed as well For example, different editions of Hamlet use 15 title variants!
25/32
Bibliographic entities: documents, works, editions, authors, and subjects (7) Subjects rival authors as the predominant gateway to library content computer-based catalogs have brought new life into subject searching subjects are far harder to assign objectively than authorship, and involve a degree of.. subjectivity e-docs represent what they are about by including some kind of subject descriptors there are 2 basic approaches to automatically ascribing subject descriptor: key-phrase extraction and key-phrase assignment 26/32
Bibliographic entities: documents, works, editions, authors, and subjects (8) key-phrase extraction: phrases that appear in the document are analyzed in terms of their lexical and grammatical structure, and wrt phrases from a corpus of docs in the same domain; phrases, particularly noun phrases, that appear frequently in this doc but rarely in others in the same domain are good candidates for subject descriptor key-phrase assignment where docs are automatically classified wrt a large corpus of docs for which subject descriptors have already been determined (usually manually), and docs inherit descriptors that have been ascribed to similar docs 27/32
Bibliographic entities: documents, works, editions, authors, and subjects (9) The Library Congress Subject Headings are a comprehensive and widely used controlled vocabulary for assigning subject descriptors they currently occupy 5 large printed volumes (~6000 pages each, 2 million entries) – the big red books the aim is to provide a standardized vocabulary for all categories of knowledge, in order to allow that all books on a given subject to be easily retrieved subject headings’ relationships: equivalence, hierarchical relationships, related topics etc. 28/32
Bibliographic entities: documents, works, editions, authors, and subjects (10) Subject classifications – books on library shelves are usually arranged by subject; each work is assigned a single code or classification, and books that represent the work are physically placed in lexicographic code order classification code is not the same as subject headings; any particular item has several subject headings but only one classification code the aim is to place works into topic categories, so that volumes treating the same or similar topics fall close to one another subject classification systems: Library Congress Classification, the Dewey Decimal Classification etc. DLs are not restricted to a single linear arrangement of books; the entire digital collection can be rearranged to reflect their hierarchical structure, at the click of a mouse, in different ways depending on desire. Consequently the need for classification vanishes; it still holds for those who do not have access to full texts
29/32
Access modes (1) purpose of libraries is to give people access to information DLs have the potential to increase access tremendously now DLs come to you, in your office and home, in your hotel while traveling, at the café, on the beach or in the plane.. most DLs are on the web (although many restrict access to in-house use) so one can use them whenever and wherever a physical, permanent, immutable copy of the library contents can be valuable A DL system can provide a basic end-user service over the web, as well as a more intimate connection with the DL server using an established protocol (e.g. CORBA), to access services as search, browse, delivery of docs and metadata some DL appls (video libraries and reactive interfaces involving sophisticated visualizations) need higher bandwidth
30/32
Access modes (2) even though the library is delivered by the web, access may have to be restricted to authorized users software firewalls, password protection, digital watermarking ... access can be restricted to the library as a whole, or to certain collections/docs/pages/paragraphs within it another form of access is communicating with other digital library systems by using an int. standard communication protocol (Z39.50) to give users and other libraries remote access to their catalog records a DL system may be implemented as a distributed system 31/32
Research themes Subject classification systems: Library Congress Classification, the Dewey Decimal Classification document scanning OCR – optical character recognition
32/32