3.2 Information Representation and Retrieval in OA Context
Manning et al (2008) reported that information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Library professionals all over the world are increasingly taking part in dissemination of open contents by setting up open access repositories, by publishing online open access journals and by creating single-window web-scale discovery services for open contents (Crow, 2002; Yang & Hofmann, 2011). All of these dissemination services are essentially based on information representation, processing and retrieval. The process of Information Representation and Retrieval (IRR) involves three primary stakeholders – the users, the intermediary (submitters, editors and content managers in open access retrieval system) and the information retrieval system. These three intertwining elements act jointly in developing and functioning of IRR system. However, in any type or size of IRR setup (including IRR for open contents), the last primary element i.e. information retrieval system consists of four major components –the database (includes information represented and organized through systematic process – both metadata and full-text objects); the search mechanism (determines how information stored in databases can be retrieved); the language (a crucial component in information representation and query formulation that can either be natural or controlled language; determines specificity, flexibility and artificiality in IRR); and the interface(that allows users to interact with the IR system and thereby represents human dimension in IRR). Open access knowledge systems (such as Open Access Journals and Open Access Repositories) are essentially information representation and retrieval (IRR) systems where full-text knowledge objects are stored and made available for open and free access to the end users.
3.2.1 Retrieval: From Conventional to Neo-conventional System
Information representation is essential pre-requisite for information retrieval. Open knowledge objects, irrespective of forms and formats, need to be represented in a standard manner before it can be retrieved. Quality of information representation has direct impact on retrieval efficiency. This aspect has drawn attention of the experts in the field over the ages. Now, the processes of information representation and retrieval have changed fundamentally with the advent and application of ICT. Open access contents are no exceptions. Before making a quantum jump into retrieval of open contents, let’s have a brief discussion on evolution of information representation and retrieval (IRR). The term information retrievalwas first coined by Calvin Mooers in 1952 but research and development on Information Representation and Retrieval (IRR) started right from the time of Panizzi. The conventional processes of information representation include two major activities – a) Identification and extraction of elements (concepts important for retrieval) from the documents e.g. keywords, phrases etc. representing the concepts; and b) assignment of terms (appropriate for retrieval) to a document e.g. descriptors or subject headings. Information representation, in other words, is a combination of these two processes and is an array of activities like indexing, abstracting, categorization, summarization etc.
Indexing is a widely adopted method for information representation. It involves selection and use of terms (derived or assigned) to represent important facets of the original document (bibliographical or full text). Indexing may be grouped on the basis of how the terms are obtained.
- Derivative indexing: Here, terms are extracted from the original documents. It can also be treated as similar to keyword indexing and there is no control of the terms. As a result, there is no need of any vocabulary control mechanism either at the indexing or at the retrieval stage.
- Assignment indexing: Here, terms are assigned to represent the documents. Scheme(s) of controlled vocabulary is/are used for choosing appropriate terms which can also be used at the time of search.
- Automated and Automatic indexing: The automatic indexing was developed by H.P.Luhn with the invention of Key Word in Context (KWIC) indexing system and subsequently the methods developed by him primarily using statistical techniques. Presently, mechanical activities related with indexing (such as alphabetizing, formatting, chronological sorting etc.) can be done by using computers but intellectual activities are accomplished by human beings though various methods are being experimented for selection of terms from the documents without human efforts. If computers are applied only for mechanical operations of indexing and human indexers are employed for intellectual activities of indexing, we call it automated indexing. If computer systems are applied to perform both mechanical and intellectual operations, we call it automatic indexing (also termed as machine indexing).
- Hyper structure indexing: In Web environment, index terms are recorded as hyperlinks that embody both the index terms and the locator mechanism (Chu, 1997). In other words, indexing of Web documents uses hyperlink names as index terms and help users to locate index terms in hyper documents.
IRR for organizing open access resources utilizes all four major indexing methods as mentioned above. It extracts terms from the body of knowledge objects through derivative indexing, manages metadata assigned by indexer, sorts and arranges browsing keys by automated process, and highlights and hyperlinks search elements (keyword or phrase) by hyper structure indexing.
In library world, it is another widely adopted method for information representation. In simple words, it may be termed as successive and hierarchical representation of information by categories. We generally use established classification schemes (e.g. DDC, LCC, CC etc.) for traditional information resources. But in Web environment where documents are of mixed quality, huge in quantity and ephemeral, application of the library classification scheme(s) for information representation becomes expensive and inappropriate. Categorization of Web documents is done by taxonomy based on loosely structured categories. Most of the institutional repository software support organization of open contents through the use of subject taxonomies.
It is the process of developing condensed copy of the original document. Different types of summarization are possible on the basis of degree of condensation. These are – Abstracts (a concise and accurate representation of the contents of a document), Summaries (a restatement of the main points of the original document) and Extracts (comprises one or more selected part(s) of a document to represent it). Application of computers in summarization is fairly successful for extracts e.g. Internet retrieval systems like Google, Altavista, NorthernLight etc. employed auto-extraction process for information representation. But computerized summarization is moderately successful for auto-summary and not at all satisfactory for auto-abstracting.
Citations are bibliographical information about documents, and therefore can be considered as a source for information representation. As a result, citations can be used as means of information representation by citing authors for their own publications. Eugene Garfield introduced this method of information representation through publication of citation indexes. Citation indexing can be carried out entirely by computers without human intervention. Most of the open content search services like Google Scholar (includes both open access and restricted-access document), OAIster85, and OAN-Search86 include “Cited By” as value-added feature of retrieval systems on the basis of citation indexing methods. Moreover, in some retrieval systems citation frequency is a major parameter for evaluation of the quality of the documents.
It is based on the concept of representing a document by a suitable phrase or a statement, or in some cases by a group of phrases. String indexing is a special kind of indexing. There are different types of string indexing (e.g. Chain indexing, PRECIS, POPSI, NEPHIS etc.) and each of these systems includes two basic steps – a) human indexer creates an input string to summarize the content/theme of a document; and b) computer generates index entries from input string on the basis of rules of string indexing system. String indexing, as an integration of manual selection of input string and computer generated index entries, is particularly useful for generating printed index and not quite an attractive option in information representation for digital open contents.Chu (2009) framed a comparison chart for conventional methods of information representation on the basis of four parameters namely types and entity of representation, framework of representation and production mode. The chart (Figure 21) framed by Heting Chu is quite helpful in selecting suitable method(s) for different purposes.
Retrieval systems related to open contents are also following the above-mentioned four conventional processes including neo-conventional processes like citation indexing, string indexing etc.
Full text information representation
The advent of ICT in general and storage technology in particular over the last decade made full text representation of digitally stored objects a bit easier. Fugmann (1993) advocated that full text representation should avoid two extremes i.e. “every word a descriptor” and “no indexing is necessary”. Most of the open contents retrieval systems are full text information retrieval systems. These systems generally have two levels of information representation. The first level contains metadata representation (see unit 2 for resource description) and second level that includes full text representation. Presently, almost all the open source institutional repository software (such as DSpace, Greenstone, E-Print archive) support full text representation including generation of thumbnail image of the format (e.g. PDF, HTML, ASCII text, MSWORD etc.) in which the full text information object is available within the system. These software also support automatic association of the format with appropriate software for display and reading of the full text document. Representation of full text is a sort of derivative indexing, where retrieval software can extract keywords automatically after exclusion of junk words (on the basis of a predefined list of stop words). Naturally, full text information representation and retrieval systems are limited by low precision, high recall and cross-disciplinary semantic drift. These problems of full text retrieval are under active investigation by researchers working in the domain of AI (artificial intelligence), NLP (natural language processing), and semantic Web.
Multimedia information representation
Open access resources do not contain only textual materials (although the percentage of textual resources is still very high in all types of digital content management systems). The domain of OA is increasingly populated by slides, MP3 files, video clips, animated pictures, photographs etc.
On the other hand, a single open digital object may contain text, image, video, and audio in linked environment. You already know from the previous unit the use of OAI-ORE in sharing and exchange of compound digital objects. These compound digital objects are also called multimedia digital object and retrieval processes are different from textual retrieval systems. Multimedia information retrieval systems for open contents are maturing day-by-day. For example, Open -i project of NLM (National Library of Medicine, US) aims to provide image search service for open access biomedical resources (Figure 22). It includes biomedical articles from the full text collections such as PubMed Central and retrieves both the text and images in the articles. The support is provided on the basis of extensive image analysis & indexing and deep text analysis & indexing.
3.2.2 Retrieval Approaches
Retrieval approaches may be categorized as structured retrieval and unstructured retrieval. As a whole, the retrieval methods as classified by Luhn (1958) are:
- Browsing: Retrieval of information by look-up in an ordered array of stored records;
- Searching: Retrieval of information by finding/locating in a non-ordered array of stored records; and
- A combination of searching and browsing.
As you know, searching is the prime retrieval approach for most of the IR systems. Fenichel & Hogan (1981) identified a total of four types of searching. These four basic search strategies are quite relevant for retrieving open contents. These are
- Building block approach: It starts with a single concept search. In case of a complex query, this strategy advises to decompose search statement into required number of single concepts and then integrating retrieved result sets through appropriate search operators. This strategy is very helpful for novice users.
- Snowballing approach: This strategy advice searcher to conduct a search first and then modify the search query on the basis of the retrieved results.
- Successive fraction approach: This strategy advice searcher to start a search with a broad concept and then narrow down the search by applying different limiting techniques.
- Most specific facet approach: This approach directs that in case of multiple concept query string, identify the most specific term/concept first and conduct search against it.
Convenient approach: If full-text IR users just enter terms by leaving a space in between (space automatically incorporate default Boolean operator e.g. AND or OR) or pick up different filtering parameters (file type, language, year range etc.) from drop down lists. This method is termed as Quick or Convenient search approach.
3.2.3 Retrieval Techniques
Retrieval techniques are search operators or devices that help users in resource discovery through searching. A typical online IR for open contents supports different retrieval techniques. These techniques may broadly be divided into two groups – basic set and advance set.
These retrieval techniques are supported by most of the information retrieval systems. These are:
- Boolean operators: The Boolean search operators help addition of concepts, exclusion of concepts and inclusion of concepts through AND operators, NOT operators and OR operators respectively.
- Truncation: Truncation technique supports retrieval of different forms of a term but all with one part in common. For example, DSpace uses * as Wildcard Operator. The characters “arch*” matches with archive, archival, archiving etc.
- Proximity operators: These operators help to specify distance between two search terms precisely. DSpace uses tilde symbol, "~", at the end of a phrase as proximity operator. For example, the query “library science”~3 in DSpacewill rretrieve records where the words ‘library’ and ‘science’are separated by three spaces.
- Case sensitive search: It helps searchers to specify case of a search term i.e. upper case or lower case. Range search: It helps in selecting/filtering records within certain data ranges. The search query author:[rao TO rath] in DSpace will retrieve documents authored by names that fall between ‘rao’ and ‘rath
- Field search: It helps searchers to limit the search in one or more fields.
- String search: It is a kind of free-text searching that allows searchers to search those terms that a searcher thinks but have not been indexed.
These techniques are provided selectively in some modern retrieval systems. Most of these techniques are still in research bed and their efficiency level is increasing day-by-day. These are
- Fuzzy searching: It is a unique search technique that can tolerate errors committed during data entry or query input. This technique can detect and correct spelling errors, errors related to OCRing and text compression. DSpace uses tilde symbol, "~" for Fuzzy searching. For example, a search by Fredarick~ (a misspelled author name) will retrieve the author named "frederick" (exact name of author).
- Weighted searching: The weightage technique helps to assign different weights to search terms during query formulation to indicate proportionality of their significance or the emphasis the user placed upon them. Both symbols (e.g. * in ERIC system) and numerals (e.g. 1 to 10 in GSDL) may be used to indicate relative weighting.
- Query expansion: It allows searchers to improve search results by modifying query string on the basis of retrieved result set.
- Multiple database searching: This method helps in searching two or more information retrieval systems simultaneously. It helps to get rid of different query syntax (e.g. use of different symbols for different operators), different encoding standards (e.g. ASCII, Unicode), and different data formats (e.g. MARC, CCF, Dublin Core etc.) in different retrieval systems. Distributed searching on the basis of Z 39.50 protocol is a classical example of multiple databases searching (that use different retrieval software and different data formats like MARC, CCF, UNIMARC etc). For example, the search on author: Ranganathan, S. R. can be forwarded to three or more Z 39.50 servers at the same time (see Figure 23).
3.2.4 Retrieval Models
IR models are theory based approaches to cover different aspects of information retrieval systems. Different IR models have been developed over the years but matching mechanisms form the basis of all these models. Matching can be done between terms or between similarity measurement (e.g. distance, term frequency etc.).
Term matching is a direct matching of terms derived from or assigned to documents, document representation and queries. There are four types of term matching as mentioned below:
- Exact match: It means query representation exactly matches with document representation in IR system e.g. case sensitive search and phrase search;
- Partial match: In this case part of the term being matched with the document representation in information retrieval system e.g. truncation;
- Positional match: It takes into consideration the positional information of what is being matched in retrieval process e.g. proximity search; and
- Range match: It takes into consideration what is being matched in a given range e.g. searching of bibliographic records by publication dates.
It is an indirect matching process in which final matching is made on the basis of similarity measurement. For example, in Vector Space model matching is based on the distance between vectors or degree of vector angle. Again, in probabilistic model, similarity is measured on the basis of term frequency. It determines the probability of relevance between queries and documents. Beaza-Yates and Ribeiro_Neto (1999) grouped IR models into two categories – system oriented models and user oriented models. The classification may be represented as in Figure 24.
Most of the software that manage open access repositories are using open source text retrieval engines like Lucene, Solr MGPP etc. Vector Space Model (or it’s modified version) is probably the most common in these retrieval engines. These text retrieval engines (based on Vector Space model) works in the following manner – I) Extract tokens from content or primary bit-stream; ii) Transform extracted tokens on the basis of indexing parameters as set by indexer; iii) Stemming of tokens; iv) Expand with synonyms (to support query formulation); v) Remove tokens which are stop words or junks; vi) Add metadata elements in indexing; vii) Store tokens and related metadata as structured data for search optimization; and viii) Creation and maintenance of Inverted Index. The process of information representation, query formulation and matching is shown in Figure 25.
A comparative study for three basic information retrieval models may be presented as Figure 26.
3.2.5 Evaluating Retrieval Systems
Researchers (Keen, 1971; Large, Tedd & Hartley, 1999) formulated a common set of evaluation parameters irrespective of any type or size of IR systems. This set is equally applicable for different kinds of IRR (including IR systems related to open access resources) and includes parameters like accuracy (exact representation of original documents through surrogates), brevity (briefness of representation), consistency (uniform representation), objectivity (authentic description of original document), and other parameters (clarity, readability and usability). These parameters may be termed as Generic measures. Other evaluation measures can be grouped into two categories – measures related to retrieval performance and measures related to retrieval process. The evaluation measures that concentrate on retrieval performance are as follows:
- Recall and Precision: This measure (proposed first by Kent in 1955) is a combination of two factors. The Recall factor measures retrievability of an IR system and Precision factor measures the ability of an IR system in separating the non-relevant from the relevant items. Salton (1992) observed that these two factors, although not quite perfect, have formed the basis for many evaluation projects. There are many extensions of these two factors such as E-measure (Swets, 1969), Average recall and precision (Harman, 1995), Normalized recall and precision (Foskett, 1996; Korfhage, 1997), and Relative recall and precision (Harter & Hert, 1997).
- Fallout ratio: This measure, proposed by Swets (1963), is ratio between non-relevant documents retrieved and all non-relevant documents in a system database. The smaller fallout value ensures better IR system.
- Generality measure: It is defined as the proportion of documents in a system database that is relevant to a particular topic. Lancaster & Warner (1993) reported that the higher generality number is associated with the easier searching.
- Single measure: Recall and precision (including their extensions and modifications) factors are criticized for their incompleteness as evaluation measures. In view of this limitation Cooper (1973) suggested a utility measure on the basis of user’s subjective judgment about usefulness of an IR system.
- Other measures: Griffith (1986) proposed that only three numbers namely relevant retrieved, non-relevant retrieved and total number of documents in an IR system should be considered in evaluating.
But retrieval performance is not the only factor to evaluate an IR system completely. The evaluation studies for an IR system are again designed in different ways by different researchers considering different evaluation parameters. A sum up table may be designed to list common evaluation parameters for open access IR systems (Table 10).
Many projects have been accomplished to evaluate different types of IR but till date we don't have any specific evaluation study related to OA retrieval systems. However, TREC (Text Retrieval Conference) and FIRE (Forum for Information Retrieval Evaluation) initiated some evaluation studies related with OA retrieval systems such as TREC TRACK-8 for Web-enabled and Integrated IR, TREC TRACK 10 for video retrieval and Federated search TREC. The major evaluation projects may be categorized under two groups –Accomplished projects (Table 11) and Ongoing projects. In the first group, Cranfield tests may be considered as the most influential and in the second group, TREC (Text Retrieval Conference) is the most comprehensive evaluation project in the history of IRR.
Apart from these two major IRR evaluation projects, there were SMART project conducted in 1964 (Salton, 1981), MEDLARS project in 1967 (Lancaster, 1968) and STAIRS in 1985 (Blair & Maron, 1985) for evaluating IR systems.
The TREC (Text Retrieval Conference) is an ongoing evaluation project jointly sponsored by NIST and DARPA. The TREC structure includes two major categories – CORE (main activities of TREC) and TRACKS (subsidiary activities of TREC). CORE category of TREC experiments are again divided into two groups - Ad hoc (related to retrospective retrieval) and Routing (related to SDI type services). Ad hoc retrieval search is an unknown item search where the user is not aware of the existence of the documents and wants to retrieve them. Such kind of search produces a ranked list of items from databases. On the other hand in routing search user’s interest remains stable but the document set changes. Such a search is useful for researchers who want to keep track of the latest developments in their field of interest. In ad hoc search, an IR system searches a static set of documents using new questions. In routing IR system it makes a decision whether or not a particular document is of relevance to the user’s query. It produces an unordered set of documents. The area of major retrieval experiments (TRACKS) of TREC are as given in Table 12.
Apart from TREC, there are some other ongoing IR evaluation projects like
- CLEF (Cross-Language Evaluation Forum)
- NTCIR (NII Test Collection for IR Systems) Project
- Chinese Web test collection
- FIRE (Forum for Information Retrieval Evaluation)