My Content
Information Retrieval System - Features, Components, Models and Evaluation
Models of Information Retrieval System
Models Based on input/output
  Data Retrieval Model
  Information Retrieval Model
  Knowledge Retrieval Model
Models Based on Theories and Tool
  Boolean Retrieval Model (Geogge boole 1847)
  Fuzzy Logic Model (1965 Lotfi Zodeh)
  Set Theoretic Model
  Vector Space Model
  Probabilistic Retrieval Model
  Linguistic Model
  Economic Model
  Hypertext Linkage Model
Components of Information Retrieval System
Features of an Information Retrieval System
Search Strategy Basic Search Techniques search
  Known item
  Keyword and Phrase Search
  Boolean Search
  Truncation
Proximity Search
  Field Specific Search
  Limiting search
  Range Search
  Search Tools
  Classification Schemes
  Catalogue Codes
  Standard Bibliographic Record Formats
  Vocabulary Control Devices
Evaluation of IRS
Evaluation Methodology
Criteria for Evaluation
Recall and Precision
Precision Ratio
Noise ratio
Fall out ratio
Novelty ratio
Indexing Exhaustivity
Cost
Response Time

Information Retrieval System - Features, Components, Models and Evaluation

In 1951, Calvin Mooers coined the term 'Information retrieval' and described it as "searching and retrieval of information from storage according to specific subject". The word retrieval means to discover and bring to the notice of the users the documents in which information is embedded. Again B.C. Vickery has described it as "retrieval is essentially concerned with the structure of the operation of the device to select documentary information from the store of information in response to several questions.

Objectives of information Retrieval System : The general objective of an information eetrieval system is to minimize the overhead of a user locating needed information. Overhead can be expressed as the time a user spends in all of the steps leading to reading an item containing the needed information (e.g. query generation, query execution, scanning results of query to select items to read, reading nonrelevant items). The success of an information system is very subjective, based upon what information is needed and the willingness of a user to accept overhead.

Models of Information Retrieval System

i. Models Based on input/output (3 types)

ii. Models Based on Theories and Tool (10 types)

i. Models Based on input/output

On the basis of input and the output. Information Retrieval Models can be grouped into three basic categories :

a. Data Retrieval Model

b. Information Retrieval Model

c. Knowledge Retrieval Model

a. Data Retrieval Model

Data retrieval model essentially handles data. For the purpose of our understandings. data can be taken as unprocessed information or preliminary phase of information. Data is an unbiased fact which can be used to form an information. For example, we can say that the population of the city of Jaipur is eighty (80) lakhs. This is a data. Thus, a census system is a data retrieval system. Similarly, National Sample Surrey Organization and Central Statistical, Organisation can be taken to be numerical data systems. A data retrieval model calls for organisational structure based on various criteria such as properties, clusters and other different entities. There is a need for a taxonomic presentation of these aspects. Such a taxonomic presentation must also be accessible from other types of associations. A searcher of a data comes for a specific information retrieval. There fore, the expression of information need should be very precise. Therefore, the data retrieval model is a simple model of information retrieval needing specific matching techniques viz. a taxonomic structure of the various entities involved and their properties.

b. Information Retrieval Model

Information is data oriented to a purpose. It actually combines several data into a relational structure. Information retrieval is therefore, a more complex model. It has to generally comprehend multi - dimensional relationships. It is not amenable easily to a taxonomic structure. The representation of information is to be based on a relational data base structure using some associative mathematics. The expression of information need is also complex and time consuming. It draws out for a long conversational or retrieval model must incorporate such facilities and interfaces.

c. Knowledge Retrieval Model

Knowledge is a kind of integration of general types of information. It normally occurs in the human mind. The human mind infers and integrate several coordinates with the information received by it. So,, knowledge is assimilated information. In order to facilitate decision-making and problem solving, intelligent knowledge based information retrieval models are coming up. Such systems comprise three basic aspects:

i. The so-called knowledge base or a store of accumulated set of rules for converting information into knowledge. It also incorporates knowledge acquisition system.

ii. The second aspect of the system is inference engine. An inference engine is capable of deriving appropriate information from the combination of rules for deriving a synthesized knowledge. This process of deriving is based on inferential logic using quantitative and non-quantitative techniques.

iii. A user interface, i.e. conversational process in the model which is capable of receiving information in the conversation mode and converting it into database signals for interaction purposes. Thus, a knowledge retrieval model is a sophisticated model of information processing, organization and retrieval.

ii. Models Based on Theories and Tools

Based on theories and methods / tools available in other disciplines, a number of models have been developed in onder to find satisfactory solutions for information retrieval problems.

a. Boolean Retrieval Model. (1847)

b. Fuzzy Logic Model (1965- Lotfi Zadeh)

c Set Theoretic Model

d. Vector Space Model (Dr. Gerald Salton)

e. Probabilistic retrieval Model (Vam Rijsbergen)

f. Linguistic Model g. Mathematical Model

h. Psychological Model

i. Economic Model

j. Hypertext Linkage Model

a. Boolean Retrieval Model (Geogge boole 1847)

i. Standard Boolean Retrieval Model

Boolean logic allowes a user to logically relate multiple concepts together to define what information is needed.

Typically, the Boolean function apply to processing tokens identified anywhere within an item. The typical Boolean operators are AND, OR, and NOT. These operations are implemented using set intersection, set union and set difference procedures. A few systems introduced the concept of "Exclusive OR" but it is equivalent to a slightly more complex query using the other operators and is not generally usefull to users since most users do not understand it.

The normal Boolean operations produce the following results :

"A AND B" retrieves those items that contain both terms A and B.

"A OR B" retrieves those items that contain the term A or the term B or both.

"A NOT B" retrieves those items that contain term A and not contain term B.

ii. Weighted Boolean Retrieval Model

The two major approaches to generating queries are Boolean and natural language. Natural language queries are easily represented with in statistical model and are asable by the similarity measures. Issues arise when Boolean queries are associated with weighted index systems. Some of the issues are associated with hour the logic (AND, OR, NOT) operator function with weighted values and how weight are associated with the query terms. If the operators are interpreted in their normal interpretation, they act too restrictive or too general. Salton (1979) showed that using the strict definition of the operators would sub-optimise the retrieval expected by the user. closely related to the strict definition problem is the ranking which is missing in pure Boolean process. Salton provided additional insight into the issues of merging Boolean queries and weighted query terms under the assumption that there are no weight available in the indexes. The objective is to perform the normal Boolean operations and then refine the results using weighting techniques.

Weighting of index terms is not common in manual indexing systems. Weighting is the process of assigning an important to an index terms use in an item. The weight should represent the degree to which the concept associated with the index term is represented in the term. The weight should help in discriminating the extent to which the concept is described in items of the database. The manual process of assigning weights adds additional overhead on the indexer and requires a more complex data structure to store the weights. In a weighted indexing system, an attempt is made to place a value on the index terms representation of its associates concept in the document. An index term's weight is based upon a function associated with the frequency of occurrence of the term in the item. Typically, values for the index terms are normalised between zero and one. The higher the weight, the more the term represents a concept discussed in the item. The weight can be adjusted to account for other information such as the number of items in the database that contain the same concepts.

The query process uses the weights along with any weights assigned to term in the query to determine a scalar value (rank Value) used in predicting the likelihood that an item satisfies the query. The results are presented to the user in onder of the rank value from highest number to lowest number.

If weights are assigned to the terms between the valuers0.0 to1.0, they may be interpreted as the significance that users are placing on each term. The value 1.0 is assumed to be the strict interpretation of a Boolean query. The value 0.0 is interpreted to mean that the user places little value on the term under these assumptions, a term assigned a value of 0.0 should have no effect on the retrieved set. Thus

"AI OR BO" should return the set of items that contain A as a term.

'AI AND BO' will also return the set of items that contain term A.

"AI NOT BO' also return set A.

This suggests that as the weight for term B goes from 0.0 to 1.0 the resultant set changes from the set of all items that contains term A to the set normally generated from the Boolean operation. The process can be visualised by use of the Venn diagram.

b. Fuzzy Logic Model (1965 Lotfi Zodeh)

An Information Retrieval System has software component that has the features and functions required to manipulate "information" items versus a DBMS that is optimized to handle "structured" data. Here information is regarded as fuzzy text. The term "fuzzy" is used to imply the results from the minimal standards or controls on the creators of the text items. The author presents concepts ideas and abstractions along with supporting facts. As such, there is minimal consistency in the vocabulary, and styles of items. The searcher has to be omniscient to specify all search term possibilities in the query.

Fuzzy logic supports values – true and false as well as other values in between.

The conceptual fuzzy logic was introduced by professor Lotfi A zadeah. The basic objective of the fuzzy logic is to develop a model that could be close to natural language process. It is an appropriate tool for modeling the kind of uncertainty associated with vagueness with imprecision.

Fuzzy retrieval provide the capability to locate spellings of words that are similar to the entered search term. This function is primarily used to compensate for errors in spelling of words. Fuzzy retrieval increases recall at the expense of decreasing precision. In the process of expanding a query term, fuzzy retrieval includes other terms that have similar spellings, giving more weight to words in the database that have similar word lengths and position of the characters as the entered term. A fuzzy search on the term "computer" would automatically include the following words from the information database "computer", "Computer", "Computer', 'Computer'. An additional enhancement may lookup the proposed alternative spelling and if it is a valid word with a different meaning, include it in the search with a low ranking or not include it at all (e.g. computer) systems allow the specification of the maximum number of few terms that the expansion includes in the query.

Fuzzy retrieval has its maximum utilisation in a system that accepts items that have been optical character recognized. In the OCR process, a hard copy item is scanned into a binary image. The OCR process is a pattern recognition process that segments the scanned in image into meaningful sub-regions, often considering a segment - the area defining a single character. The OCR process will then determine the character and translate into an internal computer encoding. Based upon the original quality of the hardcopy this process introduces errors in recognising charaters. With decent quality input, systems achieve in the 90-99 per cent range of accuracy. Since these are character errors throughout the text, fuzzy retrieval allows location of items of interest compensating for the erroneous characters.

c. Set Theoretic Model

The set theoretical view of information retrieval is based on the recognition that information requests are normally formulated by choosing collections or sets of item identifiers, or keywords. The keywords sets "in turn" lead to the retrieval of record subsets chosen from among the stored collection of records. The fundamental data of retrieval theory are provided in this view by the relations which exist between the set of item descriptions and the corresponding record sets.

d. Vector Space Model

Offen weighted systems are discussed as vectorised information systems. This association comes from the SMART system at Cornell University, created by Dr Gerald Salton (1979). The System emphasises weights as a foundation for information detection and stores these weights in a vector form. In systems based upon a vector models, the semantics of every item are represented as a vector. A vector is a one-dimensional set of values, where the order/position of each value in the set is fixed and represents a particular domain. Each vector represents a document and each position in a vector represents a different unique word (processing token) in the database. There are two approaches to the domain of values in the vector binary and weighted. Under the binary approach, the domain contains the value of one or zero, with one representing the existence of the processing take in the item. In the weighted approach, the domain is typically the set of all real positive numbers. The value of each processing token represents the relative importance of that processing token in representing the semantics of the item. The value assigned to each position is the weight of that term in the document. A value of zero indicates that the word is not in the document. The system and its associated research results have been evolving for over 30 years. Queries can be translated into the vector form. Search is accomplished by calculating the distance between the query vector and the document vectors. The use of weights also provides a basis for determining the rank of an item. The vector approach allows for a mathematical and a physical representation using a vector space model.

In addition to the general problem of dynamically changing database and the effect on weighting factors, there are problems with the vector model on assignment of a weight for a particular processing token to an item. A major problem comes in the vector space model when there are multiple topics being discussed in a particular item. There is no way to associate correlation factors between terms, since cach dimension in a vector is independent of the other dimensions.

The vector space model procedure can be divided into three stage. The first stage in the document indexing where the content bearing terms are extracted from the document text. It is obvious that many of the words in a document do not describe the content, like, the, is, are, in to, of, etc. These are called non significant words or stop words. In case of automatic indexing, these terms are removed from the document vector, so the document will only be represented by the content bearing terms. In general, 40-50% of the total number of words, in a document, are stop words. These can be removed with the help of a stop word list. The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure.

e. Probabilistic Retrieval Model

Probabilistic consideration may apply if one assumes that system characteristics such as the term assignment to the records, or the relevance properties of the records on probabilistic in nature. other mathematical techniques that have been used include decisions theory, information theory, pattern classification, mathematical linguistics and feature selection methods.

In addition to a vector space model, the other dominant approach uses a probabilistic retrieval model. The model that has been most successful in this area is the Bayesian approach. This approach is natural to information systems and is based upon the theories of evidential reasoning (drowing conclusions from evidence). Bayesian approaches have long been applied to information systems. The bayesian approaches could be applied as part of index term weighting, but usually is applied as part of the retrieval process by calculating the relationship between an item and a specific query.

The probabilistic approach is based upon direct application of the theory of probability to information retrieval systems. This has the advantage of being able to use the developed formal theory of probability to direct the algorithmic development. The use of probability theory is a natural choice because it is the basis of evidential reasoning.

According to Van Rijsbergen (1979), probability theory is an intuitively pleasing model for describing and analysing information retrieval. In this approach one can estimate the probability of retrieval. In other words, the probability of relevance of retrieval is measured. Van Rijsbergen proposed that a measure of the probability of relevance of a given document to a particular theory be based on a vector representing that document. He postulated that the pattern of index terms in relevant documents will differ from their pattern in non - relevant documents. Probability approach will help to analyse term-clustering frequency weighting, relevance weighting and ranking.

Probability theory can also be used to rank, and order documents according to their probability of relevance. Robertson (1978) shows that the order of documents can be based on term values and on 'Optimal Retrieval Function'. However, if one attempts to rank order of the documents in a Boolean environment, some difficulties arise which are inherent to the Boolean logic. Bookstein (1978) suggested that the retrieved documents be ordered according to the number of Boolean expressions present in the document that are true.

The probabilistic approach of a retrieval model is based on the assumption that the distribution of the indexing features tells something about the relevance of a document. This approach adopts a retrieval model which optimizes the retrieval effectiveness according to the Probability Ranking Principle (PRP).

William Cooper has formulated the Probability Ranking Principle as 'if a reference retrieval system's response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data as been made available to the system for this purpose, than the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data'. In this method the system replies to a query by presenting the beginning of a list of documents that are ranked in descending order of scores that either represent probabilities themselves or could be mapped to probabilities by means of an order preserving transformation. These scores often called Retrieval Status Values (RSV) depend on document descriptions consisting of appropriate statistical information about the indexing features. Such score may also depend on domain dependent parameters that are estimated by means of additional data e.g. by a thesaurus.

Probabilities are usually based upon a binary condition - an item is relevant or not. But in information systems the relevances of an item is a continuous function from non-relevances to absolutely useful. The output ordering by rank of items based upon probabilities, even it accurately calculated, may not be as optional as that defined by some domain specific heuristic. The source of the problems that arise in application of probability theory come from a lack of accurate data and simplifying assumptions that are applied to the mathematical model.

f. Linguistic Model

In linguistic model for information retrieval, study the information retrieval from the point of view of the properties of language. Information retrieval in provided by features of natural language as well as artificial language. The various ways of storage of information are essentially based on natural language. The human communication itself is full of natural language. In short, the language carry three types of functions :

i. They represent the contents of documents and other forms of information.

ii. The information problem of users are represented in terms of language, and

iii. Language is used in computer processing and in searching and retrieving of information.

The language works on three bases :

a. Semantic basis which conveys meaning from one human being to another.

b. The syntactic basis which helps formation of semantics in the use of grammar; and

c. The vocabulary, which supply different meaning to terms for formation of sentences, paragraphs and other structure.

The logical structure of a language and the taxonomy of the language refers to relationship between vocabulary and concepts. The vocabulary generally refers to the logical structure. In modern times the vocabulary control also include thesaural control and technical glossary control. Use of transformational grammar as well as parsing techniques provide processing speed of the language for information retrieval. Besides this, indexing language with coordinative control provides a basic model for information retrieval. use of associative mathematics in search logic and in search expression formulation, provide yet another type of language control in information retrieval. This linguistic model forms an essential base for information retrieval. In social science field, language plays an ambiguous role because the terminology of the field is not as rigorous as in the field of natural sciences.

Mathematical Model

Mathematical model generally pre-supposes a careful formal analysis of the problem and specification of the assumptions and explicit formulation of the way in which the model depends on the assumptions.

Mathematical models are essentially based on representative mathematics as well as associative connections, In particular, cluster analysis and clustering techniques are used on experimental basis in automatic abstracting and indexing. Use of sets the method of mathematical modeling of theory and Boolean logic is a very familiar information. Concept of similarity method of mathematical modeling of information. Concept of similarity measures and choice of variable and the combinational aspects of clustering try to provide semantic structure for information represented. Cluster analysis today involves statistical packages or clustering software.

g. Psychological Model

The psychotinguidtic approaches to information retrieval led to the study of formation of concepts in human mind the way in which the human thinking process arranges the ideas, its presentation at the time of enquiry, and the type of retrieval cues it demonds while searching has led to a cognitive research linked with computer communication processes. The studies of Belkin, Brooks and oddy (1979) on anomalous state of knowledge provides interesting insights in relation to information retrieval process. Further, the current day studies in the field of information retrieval and artificial intelligence have thrown sufficient light to bring in harmonious coupling of psychological theory into information retrieval.

j. Economic Model

The economic model of information retrieval centres round the measures of cost effectiveness and cost efficiency of information retrieval. These two criteria are based on performance of Information retrieval systems in relation to input cost as wall as the number of successful outputs. The concept of provision of multiple access points bring used gives a chance for measurement of information transfer. The field of information retrieval, which has developed several models of information measurement based on statistical and mathematical techniques used for studies in bibliometrics and scientometrics provides a scope for correlation of economic benefits. However, due to various intangible elements in information retrieval, which cannot be identified, the economic model does not yet provide a holistic approach to information retrieval.

Theoretically, the modelling of information retrieval can be looked at from output such as data from the methodological approaches from different disciplines. But, in practice we may not f ind a single model operating in that manner. For that purpose, a collective modelling of various levels can be seen through. For example, the ERIC (Educational Resources Information Center) model for IR is one of the important areas of IR, which is a combination of all approaches to information retrieval (Henry and Diodate, 1991). The idea behind these theoretical models is to help analytical studies for bringing in efficiency to different aspects of information retrieval. These multi-disciplinary approaches to information retrieval provide a better base.

k. Hypertext Linkage Model

Hypertext linkage are creating an additional information retrieval dimension traditional items can be viewed as two dimensional constructs. The text of the items is one dimension representing the information in the items. Imbedded references are a logical second dimension that has had minimal use in information search techniques. The major use of the citations has been in trying to determine the concepts within an item and clustering items. Hypertext. with its linkages to additional electronic items. Can be viewed as networking between items that extend the contents. The imbedding of the linkage allows the user to go immediately to the linked item for additional information. The issue is how to use this additional dimension to locate relevant information.

Looking at the internet at the current time there are three classes of mechamisms to help find information manually generated indexes or directories, automatically generated indexes and web crawlers (intelligent agents). yahoo (http:// wwal yahoo.com) is an example of the first case where information sources (home pages) are indexed manually into a hyperlinked hierarchy. the usey can navigate through the hierarchy by expanding the hyperlink on a particular topic to see the more detailed sub-topics. At some point the user starts to see end items. Lycos (http://wwwal.lycos.com) and Alta vista (http:/ altavista.com) automatically go out to other internet site and return the text at the sites for automatic indexing. Lycos returns home pages from each site for automatic indexing while Alta vista indexes all of the text at a site.

Web crawlers (web crawler, open text, pathfinder) and intelligent agents (coriolis groups netseeker) are tools that allow a user to define items of interest and they automatically go various sites on the internet searching for the desired information. The Uniform Resource Locator (URL) hypertext links can map to another item or to a specific location within an item.

Components of Information Retrieval System

Information retrieval locates relevant documents on the basis of user input such as keywords or example documents, for example : Find documents containing the words "datbase systems" the fingure shows information retrieval system block digram. It consists of three components: Query or Documents, IR System and Ranked Results.

i. Query / collections : Store only a representation of the document or query which means that the text of a document is lost once it has been processed for the purpose of generating its representation.

ii. IR System : Involve in performing actual retrieval function, executing the search strategy in response to a query.

iii. Ranked Results : a set of documents which improves the subsequent run after information retrieval.

Features of an Information Retrieval System

Liston and schoene suggest that an effective information retrieval system must have provisions for :

i. Prompt dissemination of information.

ii. Filtering of information.

iii. The right amount of information at the right time.

iv. Active switching of information.

v. Receiving information in an economical way.

vi. Browsing.

vii. Getting information in an economical way.

viii. Current literature.

ix. Access to other information systems.

x. Interpersonal communications, and

xi. Personalized help.

Search Strategy Basic Search Techniques search

Known item unknown item Ex-Authors, title, ISBN, publisher

In a bibliographical information retrieval environment, searches can be divided into two main classes-known item search and unknown item Search. A known item search is what is conducted when the user knows something about the item being sought. This may be any key, such as authors, title, publisher ISBN, and so on. An unknown item search is conducted when users are not aware of the existence of any document that may solve their problems. In other words, users do not know whether or not such an item exists that can meet their information requirements. There are different types of searches which are helpful to understand the entire process of search stategies.

Keyword and Phrase Search

A search can be conducted by entering a single search term or a phrase comprising more than one term. The keyword search is the simplest form of search facility offered by a search system. In keyword search mode, the system searches the inverted file (the index) for each keyword/term forming the search expression. The search terms can be entered through the keyboard or can be selected from an index or vocabulary control tool, such as subject headings lists or thesauri. Search expressionis containing more than one keyword may require the use of boolean or proximity operators.

In a phrase search, the system searches for the entire phrase rather than each individual keyword forming the phrase. Phrase searches Can be conducted only in those fields that are phrase indexed. If the index file comprises only single terms, then phrase search cannot be conducted, unless proximity operators are used where by the system will searche for earch constituent keyword in the search expression separately, and retrieve only those records where the keywords occur consecutively. A search phrase can simply be entered through the keyboard, or selected from an index f ile or vocabulary control tools like subject headings lists and thesauri.

Different search systems provide different facilities for conducting key word and phrase searches, For example, in a dialog search one can simply enter a key word or a phrase preceded by the search command. The user can restrict the search to one or more fields.

Many bibliographical information retrieval systems provide two types of search facilities for conducting an unknown item search; keyword search and subject search.

A keywords search allows users to enter one on more key words pertaining to their query. These keywords can be chosen by the user in any combination depending upon the requirements, and there are several search operators that can be used to combine several keywords to formulate a search expression. The search keywords can appear anywhere, or in one or more chosen fields, in the database records. A subject search allows the user to submit a subject expression that reffects his or her information requirement. Such a search is conducted on the subject f ield that contains the subject headings assigned by the indexes when the database was created. Thus, a record will be retrieved only when the user's subject search expression exactly matches the subject heading assigned by the indexes. For standardizing the process, and also helping the user identity the appropriate subject heading, IRS uses certain tools, called vocabulary control tools.

Boolean Search

This is a search techniques that combines search terms according to boolean logic. Three types of booolean search are possible. AND search, OR search and NOT search.

The AND search allows the user to combine two or more search terms using the boolean AND operator. The Search will then retrieve all those items that contain all the constituent terms. For example, the search expression "Internet AND computer" will retrieve all those records where both the terms occur. The search is restricted by adding more search terms. The more search term are AND ed. the more restricted, or specific will be the search and as a result the smaller will be the search output. Some times, a search may produce a blank result if too many search term are AND Ed.

Truncation

Truncation is a facility that enables a search to be conducted for all the different forms of a word having the same common root. As an example, the truncated word COMPUT* will retrieve items like COMPUTER, COMPUTING, COMPUTATION, COMPUTE etc. A number of different options are available for truncation viz. right truncation (as in COMPUTER* example), left truncation, and making of letters in the middle of the word. Left-turn truncation retrieves all words having the same characters at the right hand part e.g. HYL will retrieve words like METHYL, ETHYL etc. similarly middle truncation retrieve all words having the same characters at the left- and right hand parts. For example, a middle truncated search term COL* will retrieve both the terms COLOUR AND COLOR.

Proximity Search

This search facility allows the user to specify :

i. Whether two search terms should occur adjacent to each other.

ii. Whether one or more words occur in between the search terms.

iii. Whether the search term should occur in the same paragraph irrespective of the intervening words, so on.

The operators used for proximity searching and their meanings differ from one search system to another. The various types of proximity search facilities and the corresponding operators are available in CD-ROM and online database.

Field Specific Search

A search can be conducted on all the field in a database or it may be restricted to one or more chosen fields to produce more specific results. Specific fields, and codes vary according to the search system and database.

Limiting search

Sometimes the user may want to limit given search by using certain criteria such as language, year of publication, type of information source and so on. These are called limiting searches. Parameters that can be used to limit a search are decided by the database concerned. Below are two example of limiting searches in dialog.

Limit Qualifier Example

English-Language document only /ENG SELECT URBAN (s)CR IME?/ ENG

Patents only /PAT S TRANSISTOR? PAT

Range Search

The range search is very useful with numerical information. It is important is selecting records within certain data ranges. The following options are usually available for range searching, through the exact number of operators, their meaning etc. differ from one search system to another :

• Greater than (>)

• Less than (<)

• Not equal to (1=or<>)

• Greater than or equal to (>=)

• Less than or equal to (< =)

Search Tools

Library and information professionals have since been using four types of tools for organizing information. They are :

i. Classification Schemes

The classification schemes such as, Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), Library of Congress Classification (LC), Colon Classification (CC) and so on, are used for classifying documents, organization files and also for the physical organization of documents in libraries.

ii. Catalogue Codes

The catalogue coders, such as Anglo-American can cataloguing rules, classified catalogue code, etc. are used to prepare catalogue records of documents, which provide information to a user about what a given library / information center possesses.

iii. Standard Bibliographic Record Formats

Standard record formats such as ISBD and MARC formats are used to prepare machine readable records of bibliographic and other types of document.

iv. Vocabulary Control Devices

Vocabulary control devices such as thesauri and subject headings lists are used to standardize the terminology, which can be used both at the time of indexing and searching records.

All these tools can be used for organizing information in various types of information systems including digital library systems. However these are only basic search tools which may be used and there are many more search technique available for specific information retrieval systems.

Evaluation of IRS

Any information system exists to provide the seeker of information the document which bears the information or answers his query : The evaluation is a diagnostic activity to understand the performance of a system. It reveals the strength as also the weakness of an information system. It informs about the social benefits that accrue from the system. It also tells us about the economic aspects of the system, such as cost various aspects etc. On the basis of a careful evaluation one can thus ways for improving the system, if required. Evaluation is rightly called an investment for the future.

Evaluation Methodology

The evaluation programme of an information system involves a number of distinct steps. Let us understand their steps :

i. The first step is to be clear about the scope of evaluation. That is to say the purpose of evaluation should be very clearly defined. The scope should be defined precisely before the designing and execution of an evaluation programme.

ii. The second step is the designing of the evaluation programme. The design should be such so that it suits the objectives and purpose defined earlier. The success of the evaluation programme depends upon the choice of appropriate design.

iii. After deciding about the scope and design of the evaluation programme, the next step is the execution proper. The execution includes the collection of data, its organization, analysis and lastly, the drawing of conclusions.

iv. The fourth step is to analyse the conclusion and the interpretation of the results.

v. The fifth and the final step is to modify the information system on the basis of the result of evaluation as revaluation in steps 3 and 4.

Irrespective of the methodology followed, the purpose of evaluation of any information system is to find out how well the input performs and what measures need be taken for its improvement. Sometimes, the evaluation of a particular IPS may provide a clue for the design and development of other systems.

Criteria for Evaluation

The criteria on the basis of which an IRS can be evaluated are :

1. Recall and Precision and related factors affecting retrieval efficiency.

2. Cost

3. Response time.

1. Recall and Precision

The effectiveness of information retrieval can be measured by the ability of that system to retrieve the relevant document and hold back the irrelevant ones in a given collection in relation to a particular query. The ability to inform about the retrieval of relevant documents and withhold the irrelevant ones are called recall and precision powers of the system respectively. Through theoretically 100% recall and precision is desired in practice it is not possible, as these two factors are inversely proportional to each other. The system in which these two factors are at the optimum level will be regarded as the best one and would be preferred for application.

In response to a query, all the relevant document may not be retrieved in a search. Only a part of them may be retrieved. Similarly all the documents retrieved may not be relevant, through a number of non-relevant documents also remain as not retrieved. This can be illustrated in the following formats :

Document Retrieved. Non-Retrieved. Total

Relevant. a. b. a+b

Non-relevant. c. d. c+d

Total. a+c. b+d. a+b+c+d

Recall is the retrieval of relevant documents by the system. Recall ratio can defined as the ratio of the number of relevant items retrieved to the total number of relevant documents in the system. This can be mathematically represented as :

Number of relevant items retrieved

×100

Total number of relevant items

×100

a+b

Suppose there are in all 100 relevant document in a file and the index is able to retrieve only 75 of them and misses 25, than the recall ratio is

+25×100

= 75%

Precision Ratio

Precision is the capacity of the system to withhold non relevant document. Precision ratio may be defined as the ratio of the relevant retrieved documents to the total number of documents retrieved from the file. Mathematically it may be represented as.

Number of relevant items retrieved

×100

Total number of relevant items

×100

a+b

Suppose the total number of document retrieved are 150, out of these 75 are relevant then the precision ratio is :

75x100

150

or 50%

Many a time it is difficult to know the actual number of relevant documents in the store. Nevertheless, findings of recall and precision are helpful in assuring the quality of IRS,

Besides recall ratio and precision ratio, the other relevant measures which provide the retrieval efficiency of a system are :

i. Noise ratio

ii. Fall out ratio

iii. Novelty ratio

i. Noise ratio

It is complementary to the precision ratio. It shows the numbers of non-relevant documents out of the total documents retrieved. Mathematically it can be represented as.

Total No. of non-relevant document retrieved

X100

Total No. of document retrieved

b+d

a+c

The lesser the noise ratio the more efficient a retrieval system will be.

ii. Fall out Ratio

It shows how many non-relevant document, out of the total number of document in the store have been retrieved by the retrieval system. Mathematically it may be put as :

Total No. of non-relevant document retrieved

X100

Total No. of document in store

a+d

a+b+c+d

iii. Novelty Ratio

It is the proportion of nascent or new information items, which the system is able to bring to the attention of information seekers for the first time. Out of the total number of relevant document, a small percentage may be of such documents which contain nascent information. If out of the 100 relevant documents there as 15 such documents the Novelty Ratio will be 15% i.e.

No of retried doc.

×100

Total doc. store

Novelty Ratio =

×100

100

= 15%

An efficient retrieval system will bring to the attention of the usey more of such documents which provide novel or new or nascent information.

Indexing Exhaustivity

The exhaustivity of a system refers to the accuracy and depth with which the various concepts contained in the system are covered. Exhaustivity is the property of index description. The indexing exhaustivity is connected with recall power of the system. A system having high indexing efficiency posses high recal power.

2. Cost

Cost is an important factor of IR system evaluation. cast relate to initial expenditure required to develop a system and also other direct chargers, concerned with manpower, material, tools and other initial costs. The cost is a composite factor which also includes the effort involved on the part of the indexer and the time involved in the preparation of index and also the search time and search efforts on part of user. Initial cost can easily be measured but the cost of effort would be matter of experience and realization. If a particular system is less costly than the system it better than the other. The case of the use of the system by the user can be related to this aspect.

3. Response Time

Response time is another important factor for measuring the efficacy of the system. Response time should be measured while the users are interacting with the system. If a system requires less time to retrieve information it would be economic and would be better than the other taking a longer time to retrieve the same information.

Notes

Question

1. Persons working as classifiers, cataloguers, reference officers, indexers, abstractors etc. involved in creating tools for information storage and retrieval are know as information

A. processors B. disseminators

C. recorders D. retrievers

Ans:

2. The meaning of the Boolean expression A NOT B is to such a set, which has

A. elements of A

B. elements of B

C. elements of A which are not the elements of B

D. elements of B which are not the elements of A

Ans:

3. Which one is not the Boolean Operator?

A. NIL B. NOT

C. AND D. OR

Ans:

4. Who was George Boole?

A. Politician B. Mathematician

C. Information scientist D. Library science scientist

Ans:

5. The meaning of the Boolean expression A OR B is to such set, which has

A. elements of both A and B B. elements of B only

C. elements of A only D. all of these

Ans:

6. An increase in the level of ‘specificity’ of indexing languages results in increase in

A. Recall B. Precision

C. Noise D. both recall and precision

Ans:

7. The Boolean operator ‘AND’ is related to

A. Productive B. Additive

C. Logical Difference D. None of the above

Ans:

8. Time and cost factors are included in

A. system approach B. programme evaluation and review technique

C. Gantt chart D. work analysis

Ans:

9. For what, Archie the first of the information retrieval systems was developed by

A. BLAISE B. MEDLARS

C. Internet D. DELNET

Ans:

10. Assertion (A): With large collection of documents, recall can be measured properly.

Reason (R): The proper estimation of maximum recall for a query requires detailed knowledge of all the documents in the collection.

Codes :

A. Both (A) and (R) are true

B. Both (A) and (R) are true and (R) is not the correct explanation

C. (A) is true but (R) is false

D. (A) is false but (R) is true

Ans:

11. Boolean Logic was propounded by

A. B.C. Wickery B. S.C. Bradford

C. J. Buckland D. George Boole

Ans:

12. Assertion (A): Libraries of tomorrow will become more information service oriented centres. Reason (R): They would require more powerful tools for storage and retrieval of information.

Codes :

A. Both (A) and (R) are True and (R) is the correct explanation of (A)

B. Both (A) and (R) are True but (R) is not the correct explanation of (A)

C. (A) is True but (R) is False

D. (A) is False but (R) is True

Ans:

13. Which of the following search devices will lead to an increase in the Recall output?

A. Boolean ‘And’ B. Boolean ‘Not’

C. Proximity Operators D. Truncation

Ans:

14. The term ‘Precision’ to measure the performance of Information Retrieval Systems, was suggested by

A. S.R. Ranganathan B. F.W. Lancastere

C. Cyril Cleverdon D. H.P. Luhn

Ans:

15. Assertion (A): As the level of recall increases, precision tends to decrease.

Reason (R): Recall and precision tend to vary inversely.

Codes :

A. Both (A) and (R) are true

B. (A) is true but (R) is false

C. Both are partially true

D. (R) is true but (A) is false

Ans:

16. Assertion (A): Operators using ‘AND’, ‘OR’ and ‘NOT’ are mostly used in online IR.

Reason (R): User interfaces cannot transform the natural language input into Boolean search Query Codes :

A. Both (A) and (R) are true

B. (A) is true but (R) is false

C. (R) is true but (A) is false

D. Both (A) and (R) are false

Ans:

17.

UGC NET NTA Exam Preparation

Menu

Thursday, July 24, 2025

Information Retrieval System - Features, components, Models and Evaluation (part - 10)

Information Retrieval System - Features, Components, Models and Evaluation

Models of Information Retrieval System

i. Models Based on input/output

a. Data Retrieval Model

b. Information Retrieval Model

c. Knowledge Retrieval Model

ii. Models Based on Theories and Tools

a. Boolean Retrieval Model (Geogge boole 1847)

i. Standard Boolean Retrieval Model

ii. Weighted Boolean Retrieval Model

b. Fuzzy Logic Model (1965 Lotfi Zodeh)

c. Set Theoretic Model

d. Vector Space Model

e. Probabilistic Retrieval Model

f. Linguistic Model

g. Psychological Model

j. Economic Model

k. Hypertext Linkage Model

Components of Information Retrieval System

Features of an Information Retrieval System

Search Strategy Basic Search Techniques search

Known item unknown item Ex-Authors, title, ISBN, publisher

Keyword and Phrase Search

Boolean Search

Truncation

Proximity Search

Field Specific Search

Limiting search

Range Search

Search Tools

i. Classification Schemes

ii. Catalogue Codes

iii. Standard Bibliographic Record Formats

iv. Vocabulary Control Devices

Evaluation of IRS

Evaluation Methodology

Criteria for Evaluation

1. Recall and Precision

Precision Ratio

i. Noise ratio

ii. Fall out Ratio

iii. Novelty Ratio

Indexing Exhaustivity

2. Cost

3. Response Time

Notes

Question

About Author

No comments:

Post a Comment

Follow Us

About Me

Popular Posts

Labels