1.7 Application of Metadata in Open Access
Application guidelines for metadata encoding are required to mitigate the detrimental effects of divergent interpretation of the metadata standards that exist in the open access landscape. There are national level initiatives that provide required guidelines in interpretation of encoding rules, use of standards in encoding and rendering of metadata elements. These guidelines may help you in different aspects of metadata application in managing OA contents - to reduce ambiguity, to boost the extent to which metadata can be harvested efficiently, and to enhance the accuracy and value of services built on metadata harvesting. This section includes two components – first one is related to guidelines developed by different initiatives and the second one shows application of the guidelines in OA content organization.
1.7.1 Guidelines and Initiatives
Most of the guidelines (as developed in US and UK) advocate to categorize metadata elements into four categories - Required, Required if Applicable, Recommended and Optional. The basic purpose of the categorization is to identify the elements necessary for a user in a shared metadata environment. Guidelines are not format-specific; rather they identify those elements commonly needed across all formats. An analysis of existing suggestions and guidelines shows the following categorization of metadata elements -
- Date Created or Date Published (dc:date)
- Identifier (dc:identifier)
- Institution Name (dc:publisher)
- Title (dc:title)
- Type of Resource (dc:type)
Required if Applicable
- Creator (dc:creator)
- Extent (dc:format)
- Language of Resource (dc:language)
- Related Item (dc:relation)
- Description (dc:description)
- Access or Use Restrictions (dc:rights)
- Format of Resource (dc:format)
- Place of Origin (dc:coverage)
- Rights Information (dc:rights)
- Subject (dc:subject)
- Citation (dc:relation)
- Collection Name
- Contributor (dc:contributor)
- Genre (dc:type)
- Keywords or Tags (dc:subject)
- Language of Metadata Record (no dc map)
- Notes (dc:description)
- Publisher (dc:publisher)
Application of metadata to describe OA resources are guided by four principles that are independent of metadata schema – i) Content Standards for Metadata (to guide what information should be recorded when describing a particular type of resource and how that information should be recorded); ii) Data Value Standards for Metadata (to help to normalize data element sets to ensure consistency between records); iii) Structural Standards for Metadata (to guide in selecting fields or elements where the data resides; and iv) Syntax Standards for Metadata (to guide in encoding for data values so that they can be processed by different systems).
Content Standards for Metadata
Content Standards improve the ability to share metadata records and the discoverability of OA resources. Consistent description of metadata records helps users to understand and analyze search results efficiently. Metadata that is formatted inconsistently (e.g. names recorded both as “Last name, First name” and “First name / Last name”) impacts indexing and sorting and users suffer with confusing or incomplete results. OA content management software adopted different levels of content standards in describing OA resources, for example, in Greenstone digital library software includes no content standards for encoding DC.Creator (Figure 5) whereas DSpace and EPrint provides scope for giving Last Name and First Name of creator separately. EPrint (Figure 6) also provides help button (? mark) to help submitters in encoding a particular metadata element or field. DSpace apart from maintaining contents standards provides examples and links to help file to support resource description (Figure 7).
Library professionals apart, content standards provided in software may follow standards like Anglo-American Cataloguing Rules (AACR2) that covers description of different formats, and the provision of access points, Resource Description and Access (RDA) that guides content management by using FRBR principles (work/expression/manifestation/item), Cataloging Cultural Objects (CCO) that covers encoding of cultural heritage resources and Describing Archives for managing single and multi-level descriptions of archives, personal papers, and manuscripts etc.
Data Value Standards for Metadata
Standardization of data values are important for retrieval and sharing of OA contents. These standards aim to prescribe normalized list of terms to be used for certain data elements. It advocates use of controlled terms to ensure consistency and to achieve collocation of resources related to the same topic or person through the application of thesauri, controlled vocabularies, and authority files. The recommended data entry standardization tools are -
- Getty Art and Architecture Thesauri (AAT) is a structured vocabulary for terms used to describe art, architecture, decorative arts, material culture, and archival materials.
- Getty Thesaurus of Geographic Names (TGN) is a structured vocabulary for names and other information about places.
- Getty Union List of Artist Names (ULAN) is a structured vocabulary for names and other information about artists.
- Library of Congress Subject Headings (LCSH) comprises a thesaurus of subject headings, maintained by the United States Library of Congress.
- Library of Congress Name Authorities (LCNA) includes Corporate Names, Geographic Names, Conference Names, and Personal Names.
- Thesaurus of Graphic Materials I: Subject Terms (TGM-I) consists of terms and numerous cross references for the purpose of indexing topics shown or reflected in pictures.
- Thesaurus of Graphic Materials II (TGM-II) is a thesaurus of terms to describe Genre and Physical Characteristic Terms.
Many OA repository software support data value standards, for example, e-Print software includes entire Library of Congress Subject Areas to support standard encoding of the field DC.Subject; DSpace includes research category list (although required to be activated through configuration file of DSpace) to help in populating DC.Subject field (see Figure8). These data standards are available to both cataloguer/indexer and searchers.
Structural Standards for Metadata
Metadata structure consists of elements for description of data. Structural standards define fields, scope of the fields and type of information that need to be stored (see Table 3 for DCMES). As a matter of rule it is always better to apply metadata structure that has a high level of granularity. The reason is simple – it is always easier to transfer metadata from granular structure to a more simple structure. In some cases Structural Standards mandate what Syntax Standards should be used (for example, W3C encoding rules for date and times42 based on ISO 8601). Structural standards for generic and domain-specific schemas generally follow some broad principles such as - Fields/elements should be unambiguous; Fields/elements may be required; Some fields/elements may be repeatable; Some fields/elements may be mandatory; Some fields/elements may have unique value to identify record (e.g. use of DOI in DC.Identifier); and Some fields may have defined relationships with other fields, e.g. qualifiers or subfields. UK Metadata Guidelines for Open Access Repositories (2013) in its document entitled “Phase 1: Core Metadata (Version 0.9)” published in March 2013 prescribed following minimum fields/elements as structural standard for OA resources (M – Mandatory, R – Repeatable and O - Optional) (Figure 9):
This standard mostly recommends simple DCMES for OA repositories with Qualified DC for two instances (dc terms: issued and dc terms: Relation). These sets of recommendation also include two new elements specific to OA resources – project ID (a unique identifier normally provided by the funder) and funder name. Most of the elements have namespace 'dc' and the two new elements have ‘rioxxterms’ namespace. This UK-specific Guideline is based on the Driver project, OpenAIRE Guidelines (OpenAIRE project43) and UKETD_DC (the metadata core set recommended by the British Library’s Electronic Theses Online Service EthOS44). Please see section 1.5.3 for structural standards in different domains.
Syntax Standards for Metadata
These standards aim to make the metadata machine readable. Structural standards generally prescribe syntax standard(s) for fields/elements. In case structural standard does not advise syntax standard, library professionals should follow syntax that enable sharing of OA resources. Generally HTML, XML (Extensible Markup Language) and SGML (Standard Generalized Markup Language) are used as syntax standard for OA resources. UK Metadata Guidelines for Open Access Repositories (2013) recommended syntax standard for each metadata element listed in previous section. One example may be cited here for your understanding:
scope: The creator of a resource may be a person, organisation or service. Where there is more than one creator, use a separate dc:creator element for each one. Enter as many creators as required.
standard: The dc:creator element should take an optional attribute called “id”.
(data value) This will hold a machine-readable unique identifier, where available, for the creator. Ideally the element will include a machine-readable id and a text string in the body of the element.
syntax: <dc:creator id=http://”identifier-for‐this-creator-entity”>name‐of-this-creator-entity</dc:creator>
Where the creator is a person, the recommended format is Last Name, First Name(s) and to include an ORCID ID, if known, in its HTTP URI form, such as:
<dc:creator id=http://orcid.org/0000-0002-1395-3092>Mishra, Sanjay</dc:creator>
Note: If the creator is a person and you wish to record that person’s affiliation, the affiliation should be recorded using the dc:contributor element. You may consult UK Metadata Guidelines for Open Access Repositories (2013): Phase 1- Core Metadata (Version 0.9) from rioxx.net. Other related initiatives in this direction are given as below:
- CrossMark: An initiative to support non-bibliographic metadata schema by CrossRef.
- HowOpenIsIt?: An initiative of PLOS, SPARC and OASPA to set criteria to measure openness (extent of rights for different stakeholders) and quality of OA resources46.
- Vocabularies for OA(V40A): An initiative of JISC/UKOLN to develop vocabulary control devices, category lists and authority files for OA resources.
- RIOXX: Developing Repository Metadata Guidelines: An initiative to define a standard set of bibliographic metadata for UK Institutional Repositories.
- ONIX-PL: An initiative to standardize license expression information necessary for OA publishing.
- Linked Content Coalition: An initiative to develop rights management metadata for OA resources.
- Open Discovery Initiative: A NISO initiative to develop library discovery services for non-commercial and OA resources through indexed search.
- Incentives, Integration, and Mediation: Sustainable Practices for Populating Repositories: An initiative of Confederation of Open Access Repositories (COAR) to develop guidelines for populating OA repositories including guidance for metadata management.
- NISO Specification for Open Access Metadata and Indicators: A NISO initiative to develop standard metadata set specifically meant for OA resources.
- RSLP: A UKOLN initiative for Collection Level Descriptions (CLDs) as a tool for providing an overview of the content and coverage of OA collections.
1.7.2 Software-level applications
Most of the repository management software (such as Greenstone, DSpace, ePrint) include predefined standard metadata schemas. For example, Greenstone includes simple DCMES, qualified DCMES, AGLS, nzgls and dls schemas (see Figure 10). Collection developer may use any one of them at the time of data entry activities. DSpace comes with only DCMES but allows customizing submission interface to include domain-specific metadata schemas. ePrint is more sophisticated in metadata handling in comparison with other OA content management software. Initiatives are also supporting software in managing metadata in standard manner. For example, UK Metadata Guidelines for Open Access Repositories, supported by UKOLN, JISC and RCUK developed a plug-in for ePrints repositories (versions 3.3.x) and a patch for DSpace repositories (version 1.8.2; version 3.x onwards) for management of content standards, data value standards, structural standards, and syntax standards of metadata. These patches are available as open source scripts and can easily be integrated with the target DSpace has a metadata registry with all data elements of DCMES in qualified format. It allows repository manager to add, edit, refine and delete metadata element (Figure 11).
DSpace uses a qualified version of the Dublin Core schema based on the Dublin Core Libraries Working Group Application Profile55 (LAP). EPrint software provides six metadata sets related with OA knowledge objects, OA metadata, users of OA, search related metadata, and import related metadata and metadata for bit streams (files) (see Figure 12). As a whole the metadata management component of ePrint is a smart solution in view of different requirements of OA content management such as usage data, file format data etc.
For example, the bit stream of file metadata in ePrint is more comprehensive in comparison with other open source OA repository management software.
1.7.3 Authority Control in Gold OA and Green OA
As a library professional you know the importance of authority files such as name authority, title authority and subject authority. These authority files are required for collocation of data values entered against DC.Creator, DC.Contributor, DC.Subject etc. In the library world VIAF (Virtual Internet Authority File) is available as a huge name authority file. It aggregates name authority data from 25 national libraries. OCLC made available VIAF as Linked Open Data (LOD). It means that this dataset can be linked dynamically with the DC.Creator metadata field in different repository software. Apart from traditional name authority files like LC Name Authority File (NAF), LC Subject Authority File (SAF), VIAF etc, there are some emerging standards for populating name fields in controlled manner such as AuthorClaim56, LATTES57, NARCIS58, ArXiv59 Author ID, Names Project60, Researcher ID61, ORCID62 etc. The details of all these standards for controlled data value standards are discussed at length in Unit 2 (section 2.3.5). In case of subject authority, most of the OA repository management software is applying standard controlled vocabulary Devices, such as ePrint is using LC subject categories, DSpace is using research subject categories by default (Figure 13) but allows inclusion of any standard subject category list such as Dewey Decimal Classification (see Unit 3 section 3.5), if formatted in SKOS (Simple Knowledge Organization System – a W3C standard).