My ContentInformation Retrieval System - Features, Components, Models and EvaluationModels of Information Retrieval System Models Based on input/output Data Retrieval Model Information Retrieval Model Knowledge Retrieval Model Models Based on Theories and Tool Boolean Retrieval Model (Geogge boole 1847) Fuzzy Logic Model (1965 Lotfi Zodeh) Set Theoretic Model Vector Space Model Probabilistic Retrieval Model Linguistic Model Economic Model Hypertext Linkage ModelComponents of Information Retrieval SystemFeatures of an Information Retrieval SystemSearch Strategy Basic Search Techniques search Known item Keyword and Phrase Search Boolean Search Truncation Proximity Search Field Specific Search Limiting search Range Search Search Tools Classification Schemes Catalogue Codes Standard Bibliographic Record Formats Vocabulary Control DevicesEvaluation of IRSEvaluation MethodologyCriteria for Evaluation Recall and Precision Precision Ratio Noise ratio Fall out ratio Novelty ratio Indexing Exhaustivity Cost Response Time
Information Retrieval System - Features, Components, Models and Evaluation
In 1951, Calvin Mooers coined the term 'Information retrieval' and described it as "searching and retrieval of information from storage according to specific subject". The word retrieval means to discover and bring to the notice of the users the documents in which information is embedded. Again B.C. Vickery has described it as "retrieval is essentially concerned with the structure of the operation of the device to select documentary information from the store of information in response to several questions.
Objectives of information Retrieval System : The general objective of an information eetrieval system is to minimize the overhead of a user locating needed information. Overhead can be expressed as the time a user spends in all of the steps leading to reading an item containing the needed information (e.g. query generation, query execution, scanning results of query to select items to read, reading nonrelevant items). The success of an information system is very subjective, based upon what information is needed and the willingness of a user to accept overhead.
Models of Information Retrieval System
i. Models Based on input/output (3 types)
ii. Models Based on Theories and Tool (10 types)
i. Models Based on input/output
On the basis of input and the output. Information Retrieval Models can be grouped into three basic categories :
a. Data Retrieval Model
b. Information Retrieval Model
c. Knowledge Retrieval Model
a. Data Retrieval Model
Data retrieval model essentially handles data. For the purpose of our understandings. data can be taken as unprocessed information or preliminary phase of information. Data is an unbiased fact which can be used to form an information. For example, we can say that the population of the city of Jaipur is eighty (80) lakhs. This is a data. Thus, a census system is a data retrieval system. Similarly, National Sample Surrey Organization and Central Statistical, Organisation can be taken to be numerical data systems. A data retrieval model calls for organisational structure based on various criteria such as properties, clusters and other different entities. There is a need for a taxonomic presentation of these aspects. Such a taxonomic presentation must also be accessible from other types of associations. A searcher of a data comes for a specific information retrieval. There fore, the expression of information need should be very precise. Therefore, the data retrieval model is a simple model of information retrieval needing specific matching techniques viz. a taxonomic structure of the various entities involved and their properties.
b. Information Retrieval Model
Information is data oriented to a purpose. It actually combines several data into a relational structure. Information retrieval is therefore, a more complex model. It has to generally comprehend multi - dimensional relationships. It is not amenable easily to a taxonomic structure. The representation of information is to be based on a relational data base structure using some associative mathematics. The expression of information need is also complex and time consuming. It draws out for a long conversational or retrieval model must incorporate such facilities and interfaces.
c. Knowledge Retrieval Model
Knowledge is a kind of integration of general types of information. It normally occurs in the human mind. The human mind infers and integrate several coordinates with the information received by it. So,, knowledge is assimilated information. In order to facilitate decision-making and problem solving, intelligent knowledge based information retrieval models are coming up. Such systems comprise three basic aspects:
i. The so-called knowledge base or a store of accumulated set of rules for converting information into knowledge. It also incorporates knowledge acquisition system.
ii. The second aspect of the system is inference engine. An inference engine is capable of deriving appropriate information from the combination of rules for deriving a synthesized knowledge. This process of deriving is based on inferential logic using quantitative and non-quantitative techniques.
iii. A user interface, i.e. conversational process in the model which is capable of receiving information in the conversation mode and converting it into database signals for interaction purposes. Thus, a knowledge retrieval model is a sophisticated model of information processing, organization and retrieval.
ii. Models Based on Theories and Tools
Based on theories and methods / tools available in other disciplines, a number of models have been developed in onder to find satisfactory solutions for information retrieval problems.
b. Fuzzy Logic Model (1965- Lotfi Zadeh)
c Set Theoretic Model
d. Vector Space Model (Dr. Gerald Salton)
e. Probabilistic retrieval Model (Vam Rijsbergen)
f. Linguistic Model g. Mathematical Model
h. Psychological Model
i. Economic Model
j. Hypertext Linkage Model
a. Boolean Retrieval Model (Geogge boole 1847)
i. Standard Boolean Retrieval Model
Boolean logic allowes a user to logically relate multiple concepts together to define what information is needed.
Typically, the Boolean function apply to processing tokens identified anywhere within an item. The typical Boolean operators are AND, OR, and NOT. These operations are implemented using set intersection, set union and set difference procedures. A few systems introduced the concept of "Exclusive OR" but it is equivalent to a slightly more complex query using the other operators and is not generally usefull to users since most users do not understand it.
The normal Boolean operations produce the following results :
"A AND B" retrieves those items that contain both terms A and B.
"A OR B" retrieves those items that contain the term A or the term B or both.
"A NOT B" retrieves those items that contain term A and not contain term B.
ii. Weighted Boolean Retrieval Model
The two major approaches to generating queries are Boolean and natural language. Natural language queries are easily represented with in statistical model and are asable by the similarity measures. Issues arise when Boolean queries are associated with weighted index systems. Some of the issues are associated with hour the logic (AND, OR, NOT) operator function with weighted values and how weight are associated with the query terms. If the operators are interpreted in their normal interpretation, they act too restrictive or too general. Salton (1979) showed that using the strict definition of the operators would sub-optimise the retrieval expected by the user. closely related to the strict definition problem is the ranking which is missing in pure Boolean process. Salton provided additional insight into the issues of merging Boolean queries and weighted query terms under the assumption that there are no weight available in the indexes. The objective is to perform the normal Boolean operations and then refine the results using weighting techniques.
Weighting of index terms is not common in manual indexing systems. Weighting is the process of assigning an important to an index terms use in an item. The weight should represent the degree to which the concept associated with the index term is represented in the term. The weight should help in discriminating the extent to which the concept is described in items of the database. The manual process of assigning weights adds additional overhead on the indexer and requires a more complex data structure to store the weights. In a weighted indexing system, an attempt is made to place a value on the index terms representation of its associates concept in the document. An index term's weight is based upon a function associated with the frequency of occurrence of the term in the item. Typically, values for the index terms are normalised between zero and one. The higher the weight, the more the term represents a concept discussed in the item. The weight can be adjusted to account for other information such as the number of items in the database that contain the same concepts.
The query process uses the weights along with any weights assigned to term in the query to determine a scalar value (rank Value) used in predicting the likelihood that an item satisfies the query. The results are presented to the user in onder of the rank value from highest number to lowest number.
If weights are assigned to the terms between the valuers0.0 to1.0, they may be interpreted as the significance that users are placing on each term. The value 1.0 is assumed to be the strict interpretation of a Boolean query. The value 0.0 is interpreted to mean that the user places little value on the term under these assumptions, a term assigned a value of 0.0 should have no effect on the retrieved set. Thus
"AI OR BO" should return the set of items that contain A as a term.
'AI AND BO' will also return the set of items that contain term A.
"AI NOT BO' also return set A.
This suggests that as the weight for term B goes from 0.0 to 1.0 the resultant set changes from the set of all items that contains term A to the set normally generated from the Boolean operation. The process can be visualised by use of the Venn diagram.
b. Fuzzy Logic Model (1965 Lotfi Zodeh)
An Information Retrieval System has software component that has the features and functions required to manipulate "information" items versus a DBMS that is optimized to handle "structured" data. Here information is regarded as fuzzy text. The term "fuzzy" is used to imply the results from the minimal standards or controls on the creators of the text items. The author presents concepts ideas and abstractions along with supporting facts. As such, there is minimal consistency in the vocabulary, and styles of items. The searcher has to be omniscient to specify all search term possibilities in the query.
Fuzzy logic supports values – true and false as well as other values in between.
The conceptual fuzzy logic was introduced by professor Lotfi A zadeah. The basic objective of the fuzzy logic is to develop a model that could be close to natural language process. It is an appropriate tool for modeling the kind of uncertainty associated with vagueness with imprecision.
Fuzzy retrieval provide the capability to locate spellings of words that are similar to the entered search term. This function is primarily used to compensate for errors in spelling of words. Fuzzy retrieval increases recall at the expense of decreasing precision. In the process of expanding a query term, fuzzy retrieval includes other terms that have similar spellings, giving more weight to words in the database that have similar word lengths and position of the characters as the entered term. A fuzzy search on the term "computer" would automatically include the following words from the information database "computer", "Computer", "Computer', 'Computer'. An additional enhancement may lookup the proposed alternative spelling and if it is a valid word with a different meaning, include it in the search with a low ranking or not include it at all (e.g. computer) systems allow the specification of the maximum number of few terms that the expansion includes in the query.
Fuzzy retrieval has its maximum utilisation in a system that accepts items that have been optical character recognized. In the OCR process, a hard copy item is scanned into a binary image. The OCR process is a pattern recognition process that segments the scanned in image into meaningful sub-regions, often considering a segment - the area defining a single character. The OCR process will then determine the character and translate into an internal computer encoding. Based upon the original quality of the hardcopy this process introduces errors in recognising charaters. With decent quality input, systems achieve in the 90-99 per cent range of accuracy. Since these are character errors throughout the text, fuzzy retrieval allows location of items of interest compensating for the erroneous characters.
c. Set Theoretic Model
The set theoretical view of information retrieval is based on the recognition that information requests are normally formulated by choosing collections or sets of item identifiers, or keywords. The keywords sets "in turn" lead to the retrieval of record subsets chosen from among the stored collection of records. The fundamental data of retrieval theory are provided in this view by the relations which exist between the set of item descriptions and the corresponding record sets.
d. Vector Space Model
Offen weighted systems are discussed as vectorised information systems. This association comes from the SMART system at Cornell University, created by Dr Gerald Salton (1979). The System emphasises weights as a foundation for information detection and stores these weights in a vector form. In systems based upon a vector models, the semantics of every item are represented as a vector. A vector is a one-dimensional set of values, where the order/position of each value in the set is fixed and represents a particular domain. Each vector represents a document and each position in a vector represents a different unique word (processing token) in the database. There are two approaches to the domain of values in the vector binary and weighted. Under the binary approach, the domain contains the value of one or zero, with one representing the existence of the processing take in the item. In the weighted approach, the domain is typically the set of all real positive numbers. The value of each processing token represents the relative importance of that processing token in representing the semantics of the item. The value assigned to each position is the weight of that term in the document. A value of zero indicates that the word is not in the document. The system and its associated research results have been evolving for over 30 years. Queries can be translated into the vector form. Search is accomplished by calculating the distance between the query vector and the document vectors. The use of weights also provides a basis for determining the rank of an item. The vector approach allows for a mathematical and a physical representation using a vector space model.
In addition to the general problem of dynamically changing database and the effect on weighting factors, there are problems with the vector model on assignment of a weight for a particular processing token to an item. A major problem comes in the vector space model when there are multiple topics being discussed in a particular item. There is no way to associate correlation factors between terms, since cach dimension in a vector is independent of the other dimensions.
The vector space model procedure can be divided into three stage. The first stage in the document indexing where the content bearing terms are extracted from the document text. It is obvious that many of the words in a document do not describe the content, like, the, is, are, in to, of, etc. These are called non significant words or stop words. In case of automatic indexing, these terms are removed from the document vector, so the document will only be represented by the content bearing terms. In general, 40-50% of the total number of words, in a document, are stop words. These can be removed with the help of a stop word list. The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure.
e. Probabilistic Retrieval Model
Probabilistic consideration may apply if one assumes that system characteristics such as the term assignment to the records, or the relevance properties of the records on probabilistic in nature. other mathematical techniques that have been used include decisions theory, information theory, pattern classification, mathematical linguistics and feature selection methods.
In addition to a vector space model, the other dominant approach uses a probabilistic retrieval model. The model that has been most successful in this area is the Bayesian approach. This approach is natural to information systems and is based upon the theories of evidential reasoning (drowing conclusions from evidence). Bayesian approaches have long been applied to information systems. The bayesian approaches could be applied as part of index term weighting, but usually is applied as part of the retrieval process by calculating the relationship between an item and a specific query.
The probabilistic approach is based upon direct application of the theory of probability to information retrieval systems. This has the advantage of being able to use the developed formal theory of probability to direct the algorithmic development. The use of probability theory is a natural choice because it is the basis of evidential reasoning.
According to Van Rijsbergen (1979), probability theory is an intuitively pleasing model for describing and analysing information retrieval. In this approach one can estimate the probability of retrieval. In other words, the probability of relevance of retrieval is measured. Van Rijsbergen proposed that a measure of the probability of relevance of a given document to a particular theory be based on a vector representing that document. He postulated that the pattern of index terms in relevant documents will differ from their pattern in non - relevant documents. Probability approach will help to analyse term-clustering frequency weighting, relevance weighting and ranking.
Probability theory can also be used to rank, and order documents according to their probability of relevance. Robertson (1978) shows that the order of documents can be based on term values and on 'Optimal Retrieval Function'. However, if one attempts to rank order of the documents in a Boolean environment, some difficulties arise which are inherent to the Boolean logic. Bookstein (1978) suggested that the retrieved documents be ordered according to the number of Boolean expressions present in the document that are true.
The probabilistic approach of a retrieval model is based on the assumption that the distribution of the indexing features tells something about the relevance of a document. This approach adopts a retrieval model which optimizes the retrieval effectiveness according to the Probability Ranking Principle (PRP).
William Cooper has formulated the Probability Ranking Principle as 'if a reference retrieval system's response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data as been made available to the system for this purpose, than the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data'. In this method the system replies to a query by presenting the beginning of a list of documents that are ranked in descending order of scores that either represent probabilities themselves or could be mapped to probabilities by means of an order preserving transformation. These scores often called Retrieval Status Values (RSV) depend on document descriptions consisting of appropriate statistical information about the indexing features. Such score may also depend on domain dependent parameters that are estimated by means of additional data e.g. by a thesaurus.
Probabilities are usually based upon a binary condition - an item is relevant or not. But in information systems the relevances of an item is a continuous function from non-relevances to absolutely useful. The output ordering by rank of items based upon probabilities, even it accurately calculated, may not be as optional as that defined by some domain specific heuristic. The source of the problems that arise in application of probability theory come from a lack of accurate data and simplifying assumptions that are applied to the mathematical model.
f. Linguistic Model
In linguistic model for information retrieval, study the information retrieval from the point of view of the properties of language. Information retrieval in provided by features of natural language as well as artificial language. The various ways of storage of information are essentially based on natural language. The human communication itself is full of natural language. In short, the language carry three types of functions :
i. They represent the contents of documents and other forms of information.
ii. The information problem of users are represented in terms of language, and
iii. Language is used in computer processing and in searching and retrieving of information.
The language works on three bases :
a. Semantic basis which conveys meaning from one human being to another.
b. The syntactic basis which helps formation of semantics in the use of grammar; and
c. The vocabulary, which supply different meaning to terms for formation of sentences, paragraphs and other structure.
The logical structure of a language and the taxonomy of the language refers to relationship between vocabulary and concepts. The vocabulary generally refers to the logical structure. In modern times the vocabulary control also include thesaural control and technical glossary control. Use of transformational grammar as well as parsing techniques provide processing speed of the language for information retrieval. Besides this, indexing language with coordinative control provides a basic model for information retrieval. use of associative mathematics in search logic and in search expression formulation, provide yet another type of language control in information retrieval. This linguistic model forms an essential base for information retrieval. In social science field, language plays an ambiguous role because the terminology of the field is not as rigorous as in the field of natural sciences.
Mathematical Model
Mathematical model generally pre-supposes a careful formal analysis of the problem and specification of the assumptions and explicit formulation of the way in which the model depends on the assumptions.
Mathematical models are essentially based on representative mathematics as well as associative connections, In particular, cluster analysis and clustering techniques are used on experimental basis in automatic abstracting and indexing. Use of sets the method of mathematical modeling of theory and Boolean logic is a very familiar information. Concept of similarity method of mathematical modeling of information. Concept of similarity measures and choice of variable and the combinational aspects of clustering try to provide semantic structure for information represented. Cluster analysis today involves statistical packages or clustering software.
g. Psychological Model
The psychotinguidtic approaches to information retrieval led to the study of formation of concepts in human mind the way in which the human thinking process arranges the ideas, its presentation at the time of enquiry, and the type of retrieval cues it demonds while searching has led to a cognitive research linked with computer communication processes. The studies of Belkin, Brooks and oddy (1979) on anomalous state of knowledge provides interesting insights in relation to information retrieval process. Further, the current day studies in the field of information retrieval and artificial intelligence have thrown sufficient light to bring in harmonious coupling of psychological theory into information retrieval.
j. Economic Model
The economic model of information retrieval centres round the measures of cost effectiveness and cost efficiency of information retrieval. These two criteria are based on performance of Information retrieval systems in relation to input cost as wall as the number of successful outputs. The concept of provision of multiple access points bring used gives a chance for measurement of information transfer. The field of information retrieval, which has developed several models of information measurement based on statistical and mathematical techniques used for studies in bibliometrics and scientometrics provides a scope for correlation of economic benefits. However, due to various intangible elements in information retrieval, which cannot be identified, the economic model does not yet provide a holistic approach to information retrieval.
Theoretically, the modelling of information retrieval can be looked at from output such as data from the methodological approaches from different disciplines. But, in practice we may not f ind a single model operating in that manner. For that purpose, a collective modelling of various levels can be seen through. For example, the ERIC (Educational Resources Information Center) model for IR is one of the important areas of IR, which is a combination of all approaches to information retrieval (Henry and Diodate, 1991). The idea behind these theoretical models is to help analytical studies for bringing in efficiency to different aspects of information retrieval. These multi-disciplinary approaches to information retrieval provide a better base.
k. Hypertext Linkage Model
Hypertext linkage are creating an additional information retrieval dimension traditional items can be viewed as two dimensional constructs. The text of the items is one dimension representing the information in the items. Imbedded references are a logical second dimension that has had minimal use in information search techniques. The major use of the citations has been in trying to determine the concepts within an item and clustering items. Hypertext. with its linkages to additional electronic items. Can be viewed as networking between items that extend the contents. The imbedding of the linkage allows the user to go immediately to the linked item for additional information. The issue is how to use this additional dimension to locate relevant information.
Looking at the internet at the current time there are three classes of mechamisms to help find information manually generated indexes or directories, automatically generated indexes and web crawlers (intelligent agents). yahoo (http:// wwal yahoo.com) is an example of the first case where information sources (home pages) are indexed manually into a hyperlinked hierarchy. the usey can navigate through the hierarchy by expanding the hyperlink on a particular topic to see the more detailed sub-topics. At some point the user starts to see end items. Lycos (http://wwwal.lycos.com) and Alta vista (http:/ altavista.com) automatically go out to other internet site and return the text at the sites for automatic indexing. Lycos returns home pages from each site for automatic indexing while Alta vista indexes all of the text at a site.
Web crawlers (web crawler, open text, pathfinder) and intelligent agents (coriolis groups netseeker) are tools that allow a user to define items of interest and they automatically go various sites on the internet searching for the desired information. The Uniform Resource Locator (URL) hypertext links can map to another item or to a specific location within an item.
Components of Information Retrieval System
Information retrieval locates relevant documents on the basis of user input such as keywords or example documents, for example : Find documents containing the words "datbase systems" the fingure shows information retrieval system block digram. It consists of three components: Query or Documents, IR System and Ranked Results.
i. Query / collections : Store only a representation of the document or query which means that the text of a document is lost once it has been processed for the purpose of generating its representation.
ii. IR System : Involve in performing actual retrieval function, executing the search strategy in response to a query.
iii. Ranked Results : a set of documents which improves the subsequent run after information retrieval.
Features of an Information Retrieval System
Liston and schoene suggest that an effective information retrieval system must have provisions for :
i. Prompt dissemination of information.
ii. Filtering of information.
iii. The right amount of information at the right time.
iv. Active switching of information.
v. Receiving information in an economical way.
vi. Browsing.
vii. Getting information in an economical way.
viii. Current literature.
ix. Access to other information systems.
x. Interpersonal communications, and
xi. Personalized help.
Search Strategy Basic Search Techniques search
Known item unknown item Ex-Authors, title, ISBN, publisher
In a bibliographical information retrieval environment, searches can be divided into two main classes-known item search and unknown item Search. A known item search is what is conducted when the user knows something about the item being sought. This may be any key, such as authors, title, publisher ISBN, and so on. An unknown item search is conducted when users are not aware of the existence of any document that may solve their problems. In other words, users do not know whether or not such an item exists that can meet their information requirements. There are different types of searches which are helpful to understand the entire process of search stategies.
Keyword and Phrase Search
A search can be conducted by entering a single search term or a phrase comprising more than one term. The keyword search is the simplest form of search facility offered by a search system. In keyword search mode, the system searches the inverted file (the index) for each keyword/term forming the search expression. The search terms can be entered through the keyboard or can be selected from an index or vocabulary control tool, such as subject headings lists or thesauri. Search expressionis containing more than one keyword may require the use of boolean or proximity operators.
In a phrase search, the system searches for the entire phrase rather than each individual keyword forming the phrase. Phrase searches Can be conducted only in those fields that are phrase indexed. If the index file comprises only single terms, then phrase search cannot be conducted, unless proximity operators are used where by the system will searche for earch constituent keyword in the search expression separately, and retrieve only those records where the keywords occur consecutively. A search phrase can simply be entered through the keyboard, or selected from an index f ile or vocabulary control tools like subject headings lists and thesauri.
Different search systems provide different facilities for conducting key word and phrase searches, For example, in a dialog search one can simply enter a key word or a phrase preceded by the search command. The user can restrict the search to one or more fields.
Many bibliographical information retrieval systems provide two types of search facilities for conducting an unknown item search; keyword search and subject search.
A keywords search allows users to enter one on more key words pertaining to their query. These keywords can be chosen by the user in any combination depending upon the requirements, and there are several search operators that can be used to combine several keywords to formulate a search expression. The search keywords can appear anywhere, or in one or more chosen fields, in the database records. A subject search allows the user to submit a subject expression that reffects his or her information requirement. Such a search is conducted on the subject f ield that contains the subject headings assigned by the indexes when the database was created. Thus, a record will be retrieved only when the user's subject search expression exactly matches the subject heading assigned by the indexes. For standardizing the process, and also helping the user identity the appropriate subject heading, IRS uses certain tools, called vocabulary control tools.
Boolean Search
This is a search techniques that combines search terms according to boolean logic. Three types of booolean search are possible. AND search, OR search and NOT search.
The AND search allows the user to combine two or more search terms using the boolean AND operator. The Search will then retrieve all those items that contain all the constituent terms. For example, the search expression "Internet AND computer" will retrieve all those records where both the terms occur. The search is restricted by adding more search terms. The more search term are AND ed. the more restricted, or specific will be the search and as a result the smaller will be the search output. Some times, a search may produce a blank result if too many search term are AND Ed.
Truncation
Truncation is a facility that enables a search to be conducted for all the different forms of a word having the same common root. As an example, the truncated word COMPUT* will retrieve items like COMPUTER, COMPUTING, COMPUTATION, COMPUTE etc. A number of different options are available for truncation viz. right truncation (as in COMPUTER* example), left truncation, and making of letters in the middle of the word. Left-turn truncation retrieves all words having the same characters at the right hand part e.g. HYL will retrieve words like METHYL, ETHYL etc. similarly middle truncation retrieve all words having the same characters at the left- and right hand parts. For example, a middle truncated search term COL* will retrieve both the terms COLOUR AND COLOR.
Proximity Search
This search facility allows the user to specify :
i. Whether two search terms should occur adjacent to each other.
ii. Whether one or more words occur in between the search terms.
iii. Whether the search term should occur in the same paragraph irrespective of the intervening words, so on.
The operators used for proximity searching and their meanings differ from one search system to another. The various types of proximity search facilities and the corresponding operators are available in CD-ROM and online database.
Field Specific Search
A search can be conducted on all the field in a database or it may be restricted to one or more chosen fields to produce more specific results. Specific fields, and codes vary according to the search system and database.
Limiting search
Sometimes the user may want to limit given search by using certain criteria such as language, year of publication, type of information source and so on. These are called limiting searches. Parameters that can be used to limit a search are decided by the database concerned. Below are two example of limiting searches in dialog.
Limit Qualifier Example
English-Language document only /ENG SELECT URBAN (s)CR IME?/ ENG
Patents only /PAT S TRANSISTOR? PAT
Range Search
The range search is very useful with numerical information. It is important is selecting records within certain data ranges. The following options are usually available for range searching, through the exact number of operators, their meaning etc. differ from one search system to another :
• Greater than (>)
• Less than (<)
• Not equal to (1=or<>)
• Greater than or equal to (>=)
• Less than or equal to (< =)
Search Tools
Library and information professionals have since been using four types of tools for organizing information. They are :
i. Classification Schemes
The classification schemes such as, Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), Library of Congress Classification (LC), Colon Classification (CC) and so on, are used for classifying documents, organization files and also for the physical organization of documents in libraries.
ii. Catalogue Codes
The catalogue coders, such as Anglo-American can cataloguing rules, classified catalogue code, etc. are used to prepare catalogue records of documents, which provide information to a user about what a given library / information center possesses.
iii. Standard Bibliographic Record Formats
Standard record formats such as ISBD and MARC formats are used to prepare machine readable records of bibliographic and other types of document.
iv. Vocabulary Control Devices
Vocabulary control devices such as thesauri and subject headings lists are used to standardize the terminology, which can be used both at the time of indexing and searching records.
All these tools can be used for organizing information in various types of information systems including digital library systems. However these are only basic search tools which may be used and there are many more search technique available for specific information retrieval systems.
Evaluation of IRS
Any information system exists to provide the seeker of information the document which bears the information or answers his query : The evaluation is a diagnostic activity to understand the performance of a system. It reveals the strength as also the weakness of an information system. It informs about the social benefits that accrue from the system. It also tells us about the economic aspects of the system, such as cost various aspects etc. On the basis of a careful evaluation one can thus ways for improving the system, if required. Evaluation is rightly called an investment for the future.
Evaluation Methodology
The evaluation programme of an information system involves a number of distinct steps. Let us understand their steps :
i. The first step is to be clear about the scope of evaluation. That is to say the purpose of evaluation should be very clearly defined. The scope should be defined precisely before the designing and execution of an evaluation programme.
ii. The second step is the designing of the evaluation programme. The design should be such so that it suits the objectives and purpose defined earlier. The success of the evaluation programme depends upon the choice of appropriate design.
iii. After deciding about the scope and design of the evaluation programme, the next step is the execution proper. The execution includes the collection of data, its organization, analysis and lastly, the drawing of conclusions.
iv. The fourth step is to analyse the conclusion and the interpretation of the results.
v. The fifth and the final step is to modify the information system on the basis of the result of evaluation as revaluation in steps 3 and 4.
Irrespective of the methodology followed, the purpose of evaluation of any information system is to find out how well the input performs and what measures need be taken for its improvement. Sometimes, the evaluation of a particular IPS may provide a clue for the design and development of other systems.
Criteria for Evaluation
The criteria on the basis of which an IRS can be evaluated are :
1. Recall and Precision and related factors affecting retrieval efficiency.
2. Cost
3. Response time.
1. Recall and Precision
The effectiveness of information retrieval can be measured by the ability of that system to retrieve the relevant document and hold back the irrelevant ones in a given collection in relation to a particular query. The ability to inform about the retrieval of relevant documents and withhold the irrelevant ones are called recall and precision powers of the system respectively. Through theoretically 100% recall and precision is desired in practice it is not possible, as these two factors are inversely proportional to each other. The system in which these two factors are at the optimum level will be regarded as the best one and would be preferred for application.
In response to a query, all the relevant document may not be retrieved in a search. Only a part of them may be retrieved. Similarly all the documents retrieved may not be relevant, through a number of non-relevant documents also remain as not retrieved. This can be illustrated in the following formats :
Document Retrieved. Non-Retrieved. Total
Relevant. a. b. a+b
Non-relevant. c. d. c+d
Total. a+c. b+d. a+b+c+d
Recall is the retrieval of relevant documents by the system. Recall ratio can defined as the ratio of the number of relevant items retrieved to the total number of relevant documents in the system. This can be mathematically represented as :
Number of relevant items retrieved
×100
Total number of relevant items
Or
a
×100
a+b
Suppose there are in all 100 relevant document in a file and the index is able to retrieve only 75 of them and misses 25, than the recall ratio is
75
+25×100
75
= 75%
Precision Ratio
Precision is the capacity of the system to withhold non relevant document. Precision ratio may be defined as the ratio of the relevant retrieved documents to the total number of documents retrieved from the file. Mathematically it may be represented as.
Number of relevant items retrieved
×100
Total number of relevant items
or
a
×100
a+b
Suppose the total number of document retrieved are 150, out of these 75 are relevant then the precision ratio is :
75x100
150
or 50%
Many a time it is difficult to know the actual number of relevant documents in the store. Nevertheless, findings of recall and precision are helpful in assuring the quality of IRS,
Besides recall ratio and precision ratio, the other relevant measures which provide the retrieval efficiency of a system are :
i. Noise ratio
ii. Fall out ratio
iii. Novelty ratio
i. Noise ratio
It is complementary to the precision ratio. It shows the numbers of non-relevant documents out of the total documents retrieved. Mathematically it can be represented as.
Total No. of non-relevant document retrieved
X100
Total No. of document retrieved
Or
b+d
a+c
The lesser the noise ratio the more efficient a retrieval system will be.
ii. Fall out Ratio
It shows how many non-relevant document, out of the total number of document in the store have been retrieved by the retrieval system. Mathematically it may be put as :
Total No. of non-relevant document retrieved
X100
Total No. of document in store
Or
a+d
a+b+c+d
iii. Novelty Ratio
It is the proportion of nascent or new information items, which the system is able to bring to the attention of information seekers for the first time. Out of the total number of relevant document, a small percentage may be of such documents which contain nascent information. If out of the 100 relevant documents there as 15 such documents the Novelty Ratio will be 15% i.e.
No of retried doc.
×100
Total doc. store
Novelty Ratio =
15
×100
100
= 15%
An efficient retrieval system will bring to the attention of the usey more of such documents which provide novel or new or nascent information.
Indexing Exhaustivity
The exhaustivity of a system refers to the accuracy and depth with which the various concepts contained in the system are covered. Exhaustivity is the property of index description. The indexing exhaustivity is connected with recall power of the system. A system having high indexing efficiency posses high recal power.
2. Cost
Cost is an important factor of IR system evaluation. cast relate to initial expenditure required to develop a system and also other direct chargers, concerned with manpower, material, tools and other initial costs. The cost is a composite factor which also includes the effort involved on the part of the indexer and the time involved in the preparation of index and also the search time and search efforts on part of user. Initial cost can easily be measured but the cost of effort would be matter of experience and realization. If a particular system is less costly than the system it better than the other. The case of the use of the system by the user can be related to this aspect.
3. Response Time
Response time is another important factor for measuring the efficacy of the system. Response time should be measured while the users are interacting with the system. If a system requires less time to retrieve information it would be economic and would be better than the other taking a longer time to retrieve the same information.
No comments:
Post a Comment