Document Actions

IBM OmniFind Enterprise Edition Solution Overview

IBM WebSphere Information Integrator software will provide You with real-time, integrated access to business information—structured and unstructured, mainframe and distributed, public and private—across and beyond the enterprise.

Search is a fundamental information infrastructure capability that provides crucial access to text or other unstructured data (up to 85 percent of all enterprise data). Free-form search, search that uses just keywords or phrases, is a significant capability of the IBM information integration platform that will help You take advantage of their combined information assets.

WebSphere Information Integrator OmniFind Edition will deliver You high-quality, scalable, secure text search functionality that will find the most relevant corporate information for employees, suppliers, partners and customers. Designed to integrate seamlessly with Your existing systems, the enterprise search components handle the logistics that are required to collect data from diverse sources and index the data for fast retrieval. By applying linguistic analysis and other types of analysis to the data, WebSphere Information Integrator OmniFind Edition will provide You with highly relevant search results with sub-second response times from corporate information wherever business data resides, including Web sites, relational databases, file systems, newsgroups, portals, collaboration systems, applications and content management systems.

WebSphere Information Integrator OmniFind Edition fits easily into secure enterprise Java applications so confidential information is not inadvertently exposed to unauthorized users. In addition, OmniFind is built to scale to support millions of documents and thousands of users. In fact, OmniFind enterprise search supports 80,000 queries per day on the IBM corporate intranet, searching over 9 million unique pages.

WebSphere Information Integrator OmniFind Edition will provide You with a wide range of enterprise search benefits, including:

  • State-of-the-art ranking—OmniFind considers many factors when searching for relevant documents and even weighs factors differently based on the query type entered. For instance, some queries are general, such as “401K plan”; whereas others require very specific information, such as “changing intranet password”. OmniFind recognizes the different types of queries and factors them into its analysis to return relevant and refined results.
  • International linguistic support—OmniFind analyzes the text of documents being indexed, identifies its language, and analyzes the text using highly sophisticated language-specific linguistic processing. Supports search in 61 languages with advanced linguistic support for a subset of 25 languages. The indexing process further analyzes word structure, performs duplicate content removal and employs other techniques to improve overall search quality.
  • Parametric search—OmniFind also supports range queries on numeric values—a user can search for a price between $10 and $20 or for results more recent than a specified date.
  • Classification—Classification helps users find related groups of information, further improving the search experience. OmniFind provides a rule-based classification system and support for investments in IBM WebSphere Portal taxonomies.
  • Dynamic summary—Users find it helpful to have the query terms highlighted in a summary that is provided with the results. OmniFind dynamically generates a summary of each document that includes all query terms. This summary enables the user to easily determine which documents are most likely to contain the desired information without actually having to look at each one.

WebSphere Information Integrator OmniFind Architecture

The three major components of the WebSphere Information Integrator OmniFind Edition enterprise search middleware are the crawlers, the indexer/parser and the search runtime.

Figure 1. WebSphere Information Integrator OmniFind Edition Architecture

Crawler component

This component performs the function of crawling the various data sources at intervals configured by the administrator, and populates an IBM DB2® data store with the content extracted from the data sources. Administrators determine which data sources are relevant for a particular application, and group those data sources into collections. The crawlers then gather data and metadata from those data sources.

Indexer/parser component

This component analyzes the documents that were collected by a crawler and prepares them for indexing. The parser component analyzes document content and document metadata. It stores the results of the analysis in a file system data store for access by the indexing component. The following can be configured for a parser:

  • Field mapping rules for XML documents—This feature enables users to search structured and unstructured content in XML documents. If one maps XML elements to search fields, users can specify the field names in queries and search specific parts of XML documents.
  • Categories—This feature enables users to search documents by the categories that the documents belong to. Users can also select categories in the search results and browse only documents that belong to that particular category.

The indexer may be scheduled at regular intervals to add information about new and changed documents to an index that is stored in a file system. The indexer analyzes the documents and builds an index. In this process, text is extracted and analyzed using highly sophisticated linguistic processing for more than 25 supported languages. Further processing in the index server analyzes the link structure for intranet content, removes duplicate content and performs other processing on the collection of documents to improve overall search quality. The indexer is designed to scale to 20 million documents.

Search runtime component

This component works on behalf of a search application to process queries, search the index and return search results to the search application. The search runtime processes search requests, finds the most relevant documents in the index and returns the results with sub-second response time. Sophisticated ranking analysis ensures that the search server returns highly relevant results first. Two search runtime servers provide redundancy to ensure that search is always available.

In addition to this core capability, the architecture is open and extensible to easily support a wide variety of industry or enterprise-specific search applications. The application interface to the search engine is a well-defined Java API for ease of deployment into existing Your enterprise applications. WebSphere Information Integrator OmniFind Edition provides the foundation to plug in custom analytics to enable domain-specific search capabilities and more.

Administration Console

The administration console will allow You to create and administer collections, start and stop other components (such as the crawler, parser, indexer and search), monitor system activity and log files, configure administrative users, and associate search applications with collections. Reports about crawler, index and search details can be viewed at the administration console or sent to an e-mail address to be viewed at a later time.
The administrator can configure an alert or choose an option to log messages whenever certain events occur, you can then configure options to receive e-mail automatically whenever those messages are logged. You can also specify options to receive e-mail when messages that are not triggered by events are logged.

Administrators can monitor search activity with status summaries for performance, coverage, errors, query rates, response times, top queries and recent queries as well as view and export daily and hourly reports showing top queries, response time, queries served per second and more.

Data Sources Supported

WebSphere Information Integrator OmniFind Edition integrates seamlessly with existing systems, and its components handle the logistics required to collect data from diverse sources and index the data for fast retrieval. By applying linguistic analysis and other types of analysis to the data, OmniFind delivers highly relevant search results. One does not need to learn different interfaces to search various repository types.

The range of data sources supported by WebSphere Information Integrator OmniFind Edition include file systems, content repositories, databases, collaboration systems, intranets, extranets, and public-facing corporate Web sites including:

  • File systems
  • HTTP/HTTPS
  • News groups (NNTP)
  • Lotus Notes/Domino databases
  • Microsoft® Exchange public folders
  • IBM DB2 Content Manager
  • EMC Documentum and FileNet Panagon Content services via WebSphere Information Integrator Content Edition
  • DB2 UDB for Linux, UNIX and Microsoft Windows
  • DB2 UDB for z/OS, Informix Dynamic Server, and Oracle databases via WebSphere Information Integrator limited use license included
  • Microsoft SQL Server 2000
  • Hummingbird Enterprise DM
  • Additional connectors can be built to other data sources via a Data Listener API.

Security

Security is an integral element for enterprise search. WebSphere Information Integrator OmniFind Edition has been engineered to work with existing authentication components and as such will not require a separate login process for end users. When OmniFind requires the identity of the logged-in user, it interacts with the host environment such as the WebSphere® Portal, WebSphere Application Server or the application to obtain the user's credentials.

This approach works in conjunction with a user registry such as an LDAP repository and permits a smooth integration between OmniFind and an enterprise's existing authentication policies without requiring a separately maintained user registry.

Security mechanisms in OmniFind enable the protection of data sources from unauthorized searching and restrict administrative functions to specific users. With OmniFind, users can search a wide range of data sources. To ensure that only users who are authorized to access content do so and that only authorized users are able to access the administration console, OmniFind coordinates and enforces security at several levels.

WebSphere Information Integrator OmniFind Edition employs four levels of access control that may be used independently or together to provide increasing levels of authorization.

  • Administrative level access control determines which users can set up and maintain collections.
  • Collection level access control determines which search applications are permitted access to all or specific collections.
  • Document level access control determines which users have access to specific documents within a collection.
  • Data encryption security encrypts sensitive data such as passwords.

The OmniFind security token plug-in API provides an entry-point to facilitate deployment and integration of OmniFind in the existing security infrastructure. The goal is to achieve document level security by specifying options to associate security tokens with data when the data is being collected. By enabling security for the search applications, one can use these tokens to enforce access controls and ensure that only users with the proper credentials are able to query the data and view search results.

Portal and Customization

WebSphere Information Integrator OmniFind Edition provides a standalone HTML (Struts based) Web application for search as part of the search component. OmniFind also provides a sample search application and sample portlets that can be used as a template for creating search applications that meet the special needs of ones organization. OmniFind provides an SIAPI API that can be incorporated easily into enterprise Java applications to develop custom search portals.
WebSphere Information Integrator OmniFind Edition offers WebSphere Portal clients enhanced search capabilities with a broader content reach and scalability to millions of documents. WebSphere Portal Search customers will have a seamless transition to WebSphere Information Integrator OmniFind Edition, which imports and reuses existing portal taxonomies for navigation and categorization, migrates rules for rule-based classification, and provides the same user experience as the WebSphere Portal's Search Center portlet.

WebSphere Information Integrator software also enhances portal functionality, by their nature, portals are windows into multiple application and information domains. OmniFind enterprise search can replace embedded portal search functionality with broader content access, more scalable implementations and richer text analytics, resulting in better search results across more information than embedded portal search. In addition, WebSphere Portal customers can migrate their existing taxonomy and classification rules to WebSphere Information Integrator OmniFind Edition, making it the logical choice to upgrade existing implementations.

For example, for use cases requiring the rapid indexing of new content such as news feeds or e-commerce catalogs, additional sources can be processed and indexed by OmniFind using the “data listener” API to rapidly push changed content to the search system for indexing.

A custom search application may choose to present the search results returned from the search runtime in a different format. The custom search application may be a portlet, servlet, or a Java application.

Ranking Search Results

WebSphere Information Integrator OmniFind Edition has state-of-the art relevancy algorithms, developed by IBM Research, specifically designed to deliver highly relevant search results from your corporate intranet content. OmniFind factors in dozens of variables for each query and goes beyond mere link analysis to determine relevancy.

When a user enters a query in a search application, the search processes return the most relevant results for the terms and conditions of the query. The OmniFind search servers use several techniques to produce the most relevant search results:

  • Text-based scoring
  • Static ranking results
  • Dynamically summarizing document content
  • Collapsing results from the same Web site

Text-based scoring

OmniFind dynamically calculates a score for each document that matches the terms in a query. To calculate the text score of each document that matches a query, OmniFind considers many factors, such as:

  • The frequency of each query term in the entire collection. In general, query terms that appear in most documents contribute less to a document’s score than query terms that appear in a more selective set of documents.
  • The number of appearances of each query term in the matching document. In general, the more occurrences of query terms within a document, the higher its score is.
  • The proximity with which query terms appear in each matching document. In general, query terms that appear in close proximity to each other in a document contribute more to that document’s score than the same terms with more distant occurrences.
  • The context in which query terms appear in each matching document.

Static ranking

When you create a collection, you specify whether you want to associate a static ranking factor with the documents in the collection. Associating a static ranking factor increases the importance of those documents in the search results. For Web content, the number of links to a document from other documents, and the origins of those links, can increase the relevance of that document in the search results. For documents that include date fields or date metadata, you can use the date of the document to increase its relevance. For example, recent articles in NNTP news groups might be more relevant than older articles.

Dynamic summarization

Dynamic summarization is a technique that determines which phrases of a result document best represent the concepts that the user is searching for. OmniFind dynamic summarization tries to capture sentences in documents that contain a large variety of the search terms. A few sentences, or parts of sentences, are selected and displayed in the search results. The search terms are highlighted through HTML rendering of the search results.

Collapsing results from the same Web site

You can specify options for grouping result documents from the same Web site in the search results. OmniFind can organize the search results so that individual results from the same Web site are grouped together. When results are collapsed, the top result from the Web site typically appears flush left. One or more lower ranking results are grouped and indented below the top result. In the sample search application for enterprise search, the top two search result documents from each Web site are displayed. If more than two result documents are returned from the same Web site, you can specify that you want to see the collapsed results.

Collections, Categories and Scopes

A query can be directed against a specific collection, a specific category within a collection, or one or more scopes within the collection, or a combination of a category and one or more scopes. Categories can be used to group documents that share a URI (Universal Resource Identifier) pattern or a group of documents that contain or exclude specific words and phrases. Your end users may limit their search to a specific category by identifying them as the target of their search. Categories can also be used to create ‘quick links’ to specific documents. OmniFind search results can include predefined links or quick links. Quick links are documents that are returned in the search results whenever a user submits a query that includes specific words and phrases.

Scopes enables one to limit what users can see in the collection. Limiting the range of documents that users can search helps ensure that documents in the search results are specific to the information users seek. For example, one might create one scope that includes the URIs for the Technical Support department and another scope that includes the URIs for the Human Resources department.

Search Techniques

When you configure the search servers for a collection, you can specify options for how the collection is to be searched and configure a search cache to hold frequently requested search results. When the search servers process search requests, they first check if results for the same query already exist in the cache. If the search servers find the appropriate result documents, they can quickly return search results to the user.

WebSphere Information Integrator OmniFind provides an option for checking the spelling of query terms. If a user misspells a term in the query, the search server can provide suggestions for how to spell the term correctly. For example, if you specify ‘saerch’ as a query term, you would see an option to specify ‘search’ as a possible correction to the spelling of your original term.

WebSphere Information Integrator OmniFind supports a range of query techniques, such as free-text queries, search for specific phrases, exclude specific words or specify more complex queries to improve the precision of the search results.

Availability and Scalability

WebSphere Information Integrator OmniFind Edition is designed to provide superior performance, scalability and high-quality results through its ability to access a broad range of data sources typically found in an enterprise. It does so by extracting the documents from their original source, parsing and analyzing the content, and then building a collection (index) that is optimized for speed and result accuracy.

OmniFind has proven its robustness and scalability on one of the most challenging intranets in the world: the IBM intranet, serving a user community of more than 300,000 people. To date, two applications with very different requirements have been implemented:

  •  Very large scale general intranet search of over 10,000 Web sites and 25 million URLs. Over 9 million unique documents are indexed with a turnaround of four hours for updating the index. This application has been in 24x7 production since September 2003.

An Employee Profile application for expertise location that searches over 500,000 XML records and makes up to 20 thousand updates daily with two- to three-hour turnaround. This application has been in production since March 2004.


design by connectmedia.ch  site by agitator.com