
Webserver Search Engine and Webcrawler - Getting Started
This page contains the documentation that is specific to the set up and use of the i5/OS Webserver Search Engine and Webcrawler. The search engine is easy to set up and maintain. and allows you to perform full text searches. Here are some of its features.
- The search engine can be run on either an Apache or Original server.
- Documents either from your system or a remote system. can be indexed for searching by using built-in function that includes a web crawler.
- The search engine can be configured from the HTTP Server browser interface or using search CL commands.
- Net.Data macros and an HTML insert are provided to customize search and search results.
- Both simple and advanced searches are supported.
- Search results can be sorted in various ways such as by date or title.
- Search results can be improved using thesaurus support.
- Documents in all languages including Japanese, Chinese, and Korean can be searched.
- Web Crawler
- Technical Help
About this document
This document contains information on:
- Highlights of the iSeries Webserver Search Engine
- Administration of search indexes
- Setting up a search engine on your web site
- Configuring either an Original or Apache server to run your search.
- Running the web crawler
- Technical Details and Troubleshooting
If you want to try the Webserver Search Engine, read through "How to set up a search engine" and then test the indexing and search process by following the directions in the next section called "Getting started - indexing the sample recipe files."
How to set up a search engine
If you want to allow others to search through documents on your server, you will need to set up your system to be a searchable site. Doing this is very easy with the new iSeries Webserver Search Engine. There are just a few administrative tasks you need to do. Some customers required only about an hour to accomplish all of these tasks. These tasks can be summarized as follows:
- First, collect all of the related documents into a single directory on your iSeries. You may use either the Root (/) directory of the IFS or the QSYS.LIB file system. Using the IFS system allows you to easily port your files from a PC onto the iSeries.
- Next you will need to create a search index. An index is the collection of all of the selected documents in your directory. They are stored in a special indexed form. In the indexing process, the search engine takes each document provided in a document list, parses through it to create keys that are used in searches. The Webserver Search Engine uses very short character string keys. This indexed form allows for faster searching than could be done on documents that are not indexed.
- The documents provided to the indexing function are contained in a document list that is automatically created when you create an index. A list can also be created through administrative forms or by hand.
- Once you have created the search index, you can test it from the search administration form. This will allow you to see all of the different options available to select for a search, such as fuzzy or precise.
- Now you are ready for Setting up the Search Engine to Run on Your Web Site. A short HTML section has been supplied that can be added to your web page as well as a Net.Data macro containing all of the HTML you will need. This allows you to customize your search and search results forms. You may just use the short HTML form supplying a few values if you are not comfortable using Net.Data. However, you must still copy the sample macro to your directory to make all of this work. Detailed instructions for this are included.
- Once you have decided how you want to present your search forms, you will need to make sure the HTTP server you use contains the correct directives in the configuration to run the Net.Data macro and to make sure users can view the documents found on a search. A simple set of steps to do the necessary setup is provided for both Original and Apache servers. There is also plenty of documentation about the IBM HTTP Server for i to help you with other configurations.
- When all of this is completed, you are ready to do some searches!
- It is important to keep your index up to date. If you modify your documents from time to time, you want to make sure your users are finding the most current information. We have supplied a way for you to update your index. You can use the same document list you used when you originally created your index. We will index any changed files that were previously indexed. You can also add a new set of documents to an index that already exists as well as delete some of the documents from your index. This is just a matter of supplying different lists when you update the index.
Getting started - indexing the sample recipe files
If you are trying out the search engine for the first time, start here to find out how easy it is to set up a search. Once you have completed these steps, then see the instructions below for Setting up the search engine to run on your own web site.
A search engine searches files that have been indexed or converted into a form that makes them quick to search. These instructions will show you how to index some sample files. Once you have created the index, you can try searching them.
You do not need to do any additional setup for this exercise.
To get started using the iSeries Webserver Search Engine, we have provided you with a set of HTML files containing recipes for you to use to see how the the iSeries Webserver Search Engine works. Once you have practiced with this set of files, you will be ready to set up your own search site.
- Use your web browser to access the iSeries Tasks page by specifying the URL http://yourserver.com:2001 where yourserver.com is the domain name of your iSeries system.
- Select the icon for IBM HTTP Server for i.
- Select the icon for Configuration and Administration.
- Click on Search Setup to display the administration tasks that can be performed.(If you are on an older release, just click on Search Administration in the left frame and follow along.
- Select the option to Create a search index and enter the following information:
- Index Name Recipes.
- Index Directory /QIBM/UserData/HTTPSVR/index (the default).
- Index Description (optional).
- Press Apply.
- On this form, you will see the values you just entered at the top of the form. Now just fill in the field Build a document list field from this directory /QIBM/ProdData/HTTP/Public/HTTPSVR/HTML (Be sure to enter this directory using exactly the same upper and lower case letters).
- Press Apply. (You should get a message indicating that the index was created successfully.)
- Your new index will appear in the list in the left frame..
- In the left frame, select Recipes from the index pull down.
- Optionally click on View status of search index. This will provide you with information about the index you just created.
- With Recipes still selected in the left hand frame, click on Search index to display the search form.
- If you are on a V5R1 or later version system, click Next and then Next again to get to the search form.
- Enter a search string such as butter or "peanut butter".
- Click on Search. You will see a list of recipes that match your search string. Clicking on the title of a recipe will display the actual html file associated with that recipe. If you receive Error 403, look at the URL associated with the document. This error can occur if you see, for example, /qibm/proddata/http/public/httpsvr/html/fdoc0086.html because the casing of the under-lined letters is not exactly as described above. Select Delete search index then create it again as in the example above.
- Experiment with different search strings, boolean operators, and other search options. Try out the advanced search link using the query butter AND eggs NOT milk. This query will search for recipes that contain both butter and eggs but do not contain milk. Feel free to delete the Recipes index and recreate it using different options until you become comfortable with the iSeries Webserver Search Engine.
General tips for using the Search Administration Forms
This section contains some useful information to help you understand some of the features and restrictions applied during search administration.
Index Name
- The index name can contain only 8 single-byte characters
- The index name can be re-used as long as it is associated with different index directories.
- This name is used by search administration for files that are created for several indexing functions.
- The index name is used as the default for naming a document list - /indexdir/indexname.DOCUMENT_LIST.
- The index name is used as the default for naming the mapping rules file - /indexdir/indexname.MAP_FILE
Index Directory
- The index directory must be an IFS path name such as /myindexdir.
- The maximum length of the index directory path is 117 characters.
- The index directory name must begin with a / and must contain only single-byte characters.
- The index directory cannot be a path in the QSYS.LIB file system such as /QSYS.LIB/MYINDEX.LIB.
- Search administration creates its own files in the index directory for the following:
- Indexed documents
- Mapping rules file
- Document list
- Temporary files
- The index directory must have *PUBLIC *RWX authority so files can be added, changed, and removed from the directory.
- Parent directories of the index directory must have *PUBLIC *RX authority.
- The index directory should not be deleted until any search indexes that are associated with this directory have been deleted using the Delete index function.
Document list
- The document list is a file containing path names to your documents.
- Starting in Version 5 Release 1, a document list can be built from files on your server or from files found by crawling web sites. See also Getting Started - Running the Web Crawler.
- For documents on your system, search administration will automatically build a document list in file /indexdir/indexname.DOCUMENT_LIST on the Create index form, using the directory path you specify. You can select to traverse sub-directories or not.
- Search administration will build a document list in either a directory of the IFS or the QSYS.LIB file system.
- The document list can be modified through the search administrations forms or you can use an editor.
- Document lists created by crawling sites, contain the the path name of the downloaded file followed by the URL of the actual web page.
- All files in one document list should be tagged with the same codepage or CCSID. The CCSID of the data in the file must be the same as the CCSID of the tag. There is no error checking to verify this.
Documents
- Documents must have *PUBLIC *R authority so that they can be indexed and viewed as a search result.
- Documents can be either HTML or text files. If the documents are HTML files and you specify that they are HTML files, all HTML tags will be removed. For example, no results will be found if you search for "" . However, if the word "title" is in the document, it will be found. If the documents are HTML files but you specify that they are TEXT files, no HTML tags will be removed. In this case, a search for "
Mapping rules file
- A mapping rules files is used to map the path name of a document found on a search to an external URL, using the rules you supply in your configuration.
- Search administration will automatically build a mapping rules file /indexdir/indexname.MAP_FILE on the Create index form if you check the box to Create a mapping rules file from this HTTP server.
- Search administration will build a mapping rules file either in a directory of the IFS or the QSYS.LIB file system if you use the Build URL mapping rules file form.
- To avoid specifying a mapping rules file in your search form, use the default mapping rules file name - /indexdir/indexname.MAP_RULE. This file is automatically used on a search. If you create a mapping rules file with another name, you will need to specify the file in your version of the search macro.
- You can update the mapping rules file using any editor.
- The mapping rules file must have at least *PUBLIC *R authority. It must have *PUBLIC *RWX authority to be appended through search administration.
Configuration and the search macro
- Notice that the Map directive /cgi-bin/db2www in our Sample HTML below, is used in the ACTION parameter and in the URL used to get to the search forms. You can use whatever map directive you like, being sure to use it consistently in all 3 places.
Samples shipped with the Search Engine
IBM ships several samples for you to use when you test and set up your own search. There are references to these samples throughout the search documentation.
| File |
Description |
/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/
fdoc0xxx.html |
A set of sample HTML files (recipes) that can be used to test indexing and search. |
/QIBM/ProdData/HTTP/Public/HTTPSVR/
sample_search.ndm |
Sample Net.Data macro to use for search and search results (updated with additional parameters and features) |
/QIBM/ProdData/HTTP/Public/HTTPSVR/
thesaurus_sample_search.ndm |
Sample Net.Data macro with thesaurus support to use for search and search results |
/QIBM/ProdData/HTTP/Public/HTTPSVR/
sample_html.html |
HTML to insert into your web page. It contains a search box and a path to sample_search.ndm |
Understanding how the iSeries Webserver Search Engine works
The iSeries Webserver Search Engine provides high speed searching of multiple documents by working on an index file created from the documents' source. This index is far more efficient to search than processing each document separately. The index file contains a concise representation of all of the words contained in the source documents.
The Document List
In order to build a search index, the iSeries Webserver Search Engine requires a list of documents that are to be placed in this index. This list is called a document list. A document list is nothing more than a source file that contains the fully qualified path names of all the documents that are to be indexed. This list is passed to the iSeries Webserver Search Engine during index create and the specified documents are indexed one at a time. This process can take quite some time if you are indexing thousands of documents.
A document list can be built two ways. First, you can explicitly create a document list using the Build document list option of the Search Administration form. The Build document list form asks you for the directory you would like to build the list from, the filter used to select the documents from this directory, and whether sub-directories under this directory should be traversed. It also asks you for the name of the document list file to be built. There is an append option on this form that allows you to run Build document list multiple times, each time specifying a different directory thus allowing you to build a document list that contains any number of documents from any number of directories.
The second way to build a document list is to do it automatically when you create the index. On the Create search index form select the option to Build a document list from this directory. The documents from the specified directory will be extracted into a document list named /index_directory/index_name.DOCUMENT_LIST. The search index will then be created from the documents in this list.
When building a document list from source physical files in the QSYS.LIB file system, the filter used should reflect the fact that files in the QSYS.LIB file system have the extension MBR. For example, the filter could be *.mbr. Only IFS stream files (STMF) and source physical file members in the QSYS.LIB file system can be indexed.
A document list created from local files contains a list of paths to the documents. The sample below was created by specifying /qibm/proddata/http/public/httpsvr/html for the starting directory on the Create index form:
/qibm/proddata/http/public/httpsvr/html/Welcome.html
/qibm/proddata/http/public/httpsvr/html/fdoc0001.html
/qibm/proddata/http/public/httpsvr/html/fdoc0002.html
/qibm/proddata/http/public/httpsvr/html/fdoc0003.html
/qibm/proddata/http/public/httpsvr/html/fdoc0004.html
/qibm/proddata/http/public/httpsvr/html/fdoc0005.html
/qibm/proddata/http/public/httpsvr/html/fdoc0006.html
A document list created by crawling contains two lines for each document. The first line is the path to the downloaded document. The next line is the actual URL. This sample was created by crawling http://www.ibm.com.
The directory specified to store downloaded documents is /QIBM/USERDATA/HTTPSVR/INDEX/DOC. The web page found, www.ibm.com/~index.html, is stored in the document directory. The second line of the file is the actual URL to the web page.
/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/~index.html
http://www.ibm.com/
/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/account/~index.html
http://www.ibm.com/account/
/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/products/us/~index.html
http://www.ibm.com/products/us/
/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www-1.ibm.com/support/
~index.html
http://www.ibm.com/support/
/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/help/us/en/help/
~index.html
http://www.ibm.com/help/us/en/help/
Results Ranking
Documents satisfying a search request are returned in order of their ranking. A document's ranking specifies its relevance with respect to the specified search condition. The following three factors affect a document's ranking:
- Frequency of search terms in the document - As the search words appear more frequently in the document, its ranking gets higher.
- Position of search terms in the document - As the search words appear closer to the beginning of the document, its ranking gets higher.
- Frequency of search terms in the whole set of documents - As the search words appear less frequently within the documents in the entire index, the ranking for documents having the search words gets higher.
The highest ranking a document can have is 100%. This ranking would be achieved if relatively few of the documents in an index have the search words in them. The documents with the most number of search words appearing toward the beginning of the document would get a ranking of 100%. If, however, many documents in the index have the search words in them, it is likely that none of the documents would get a 100% ranking. It is possible that a document with one search word appearing toward the beginning of document could get a higher ranking than a document with multiple search words appearing toward the end of the document. The theory behind this is that the main subject words of a document usually appear toward the beginning of the document.
Considerations when using the Webserver Search Engine
Creating indexes
Indexes can be created on any HTML documents or text files stored in the iSeries file system. This includes documents stored in iSeries source physical files (the /QSYS.LIB directory) and those downloaded by using the web crawler. For best performance, it is recommended that HTML documents be stored in an IFS directory.. Search indexes cannot be stored in the QSYS.LIB file system. The index directory must be in the IFS file system and not in the QSYS.LIB file system.
You should only include HTML files or text files in your index. Indexing other file types such as GIFs, JPEGs or other images could adversely affect indexing and search performance.
Keeping index information up to date
Although HTML documents usually contain relatively static information, their content does change from time to time and new HTML documents are constantly being added. Your search index needs to keep pace with these changes. The Update search index option of the Search Administration form helps you do that. You will want to update your search index whenever your HTML documents change or new documents are added. New and changed documents are placed in the index directory in what is called a supplemental index. This is done so as to not disrupt searches that are currently going on in the main index. Although searches and updates can take place at the same time, it does take extra CPU cycles to update the index. For that reason, you may want to consider running updates during non-peak hours.
The index update operation is very much like the initial index creation. You provide a list of documents to be updated. The search engine processes this list and updates the index accordingly. The document list can be built for you automatically on the Update search index form or you can build it separately using the Build document list form. This list may be used to either add or delete a set of documents from the index.
The main index is created at the first indexing, and the supplemental index is created and updated by adding documents. Only the supplemental index is rewritten when documents are added. The supplemental index should be kept comparatively small by periodically merging the index using the Merge search index form. When the supplemental index is merged to the main index the whole index is rewritten. This takes time depending on the size of index. For large indexes, select to do the merge as a background task.
Document list processing
- If you are adding or changing documents in the index, each document in the document list is examined to see if it has already been indexed and if so, whether it has changed since it was last indexed. This check is done using the last modified date of the document. If the document is new, it is added to the supplemental index. If the document has changed, it will be deleted from the main index and added to the supplemental index. If the document hasn't changed, it will be ignored.
- If you are deleting documents from the index, each document in the document list is examined to see if it is currently in the index. If it is in the index, it is deleted from either the main or supplemental index. If the document is not currently in the index, it is ignored.
There is no automatic way of building a document list for documents to be deleted. You will have to determine what documents have been removed from the directory and build the document list by hand using SEU or the EDTF command. The document list is simply a text file with one entry per line. Each entry is the fully qualified path of the each document that was indexed.
Document lists that are built by the search engine administration be found in / indexdirectory/indexname.DOCUMENT_LIST.
Searching for Documents
There are two types of searches that can be performed, a simple or advanced search. On a simple search, you just enter a word or phrase. The defaults listed on the simple search form will be used to find documents containing the terms. Note that the default is to perform an exact search. You can change this default or any others by changing the sample macro.
This is the section of the macro that sets the default for the search precision (exact or fuzzy). To switch the default to a fuzzy search, remove CHECKED for the first radio button and insert it for the second radio button.
Precision of search terms
An advanced search allows you to enter a single query with each term specifically limited by the options you attach to it. You can modify the advanced search form to make this type of search less dependent on the technical ability of the user.
When searching for words in a document you can enter one or more search words or you may enter a search phrase. A search phrase must be surrounded by double quotes. For example you may enter the phrase "internet computing" to find those two words together exactly as shown. If a phrase includes double quotes within the phrase, you must double up those quotes for a proper search string. For example to find a phrase such as 'The "ultimate" source' you would enter the search string "The ""ultimate"" source". To search for "ultimate", you would enter the search string """ultimate""". Incorrect syntax of double quotes will cause an error to occur.
Simple Search Query
A simple search query contains the words or phrase you want to find. For example, if you enter the word computer, all documents containing the word computer will be listed in the results. Wildcards are also allowed in the query. To expand the search capabilities, you can select various options on the search form. Case sensitive: Click the check box for a case sensitive search. This will help you to receive more precise results. For example, if you want to find Internet and not internet, select a case sensitive search. If case sensitivity was disabled when the index was created, a case insensitive search is always done.
English stemming: Click the check box to include English terms that contain the "stem" or base word that you specified. For example, if you specify the word communicate, the search will find documents containing the words communicate, communication, and communicating. This value is effective when and only when the matching level is specified as exact.
Select the operator to define the relationship between terms in the search string: If multiple search words are specified, indicate the logical relationship between the words.
- AND - If all the words specified are in a document, that document is returned.
- OR - If one or more of the words are in a document, that document is returned.
- SAME SENTENCE - If all the words specified are in the same sentence within a document, that document is returned.
Precision of search terms: Click the box for a fuzzy or close match. Selecting a fuzzy search will give you more results.
- Find the exact match - Only documents containing an exact match for the search words will be found. For example, if you enter program, only documents containing the word program but not documents containing the word programs will be found unless you have selected English stemming.
- Find a fuzzy or close match - Words that match 60% of the letters in the search word will be found. For example, if you enter program, documents containing the words programs and programmed will be found.
Number of documents to return on a page : Enter the number of documents to return on a page. If the number of documents found is greater than this number, click the button to view the next set of results.
Advanced Search Query The advanced search form provides flexibility and precision for your search query. One or more of the search attributes we described for a simple search can be attached to individual words or quoted phrases in an advanced search. Using these attributes refines your search to exactly the text you want to find in a document.
Search using multiple search strings: On this form you can specify multiple search terms. Each search term may contain multiple search words or search phrases enclosed in double quotes. A logical operator must be specified between each word or quoted phrase.
Search using Boolean operators: If you specify multiple search terms, you must specify an operator between each term. Use AND or * , OR or +, NOT or - as follows: <
| Operator |
Alternate operator |
Search action |
| A AND B |
A * B |
Search for documents that contain both A and B |
| A OR B |
A + B |
Search for documents that contain either A or B, and both A and B |
| A NOT B |
A - B |
Search for documents that contain A and do not contain B |
Search using a priority of operators: Operators are interpreted in the order AND = NOT > OR. This means that AND has the highest priority, then =, and so on. Terms are ANDed before they are ORed even if an OR is specified before an AND. By using parentheses, the order of the operation can be changed, for example, victory * ( tennis + ping-pong ) - "domestic game" indicates to first search for documents containing tennis or ping-pong. These results must also include victory and must not contain "domestic game".
Search for a phrase containing quotes: If there is a search term that includes quotation marks ("), the search term should be enclosed by another set of quotation marks ("). The search phrase must also be enclosed by a set of quotation marks. If , for example, you want to search for so called "product", first enclose "product" with a second set of quotations marks to become ""product"". Then surround the entire phrase with quotation marks . The final search query is "so called ""product""" .
Search using attributes with a word or phrase: You can specify multiple attributes for each search term. These must be attached in the following order:
word [precision or matching level][case_sensitive][weight]
- Precision: You can specify a matching level from 1 to 100. 100 means exact matching. As the number decreases, the matching is fuzzier. We recommend using a value more than 60. To activate English stemming, use %STEM. Only one of the 3 attributes may be used with a search term. The default precision is 100 with no English stemming.
| Attribute |
Example |
Description |
| %nnn |
database%70 |
database is searched for with 70 percent matching level. See note. |
| %STEM |
communicate%STEM |
communication, communicating , for example, are also searched based on English inflection rule |
Case sensitive: You can specify whether you want a case sensitive search on a word or quoted phrase. If case sensitivity is disabled when indexing, a case insensitive search is always done. The default is for a case insensitive search.
| Attribute |
Example |
Description |
| #N |
"Webserver Search Engine"#N |
Webserver Search Engine and webserver Search engine will be found on a search. (case insensitive) |
| #C |
Internet#C |
Internet but not internet will be found on a search. (case sensitive) |
Weight: Specify the weight value preceded by "$". The weight values are evaluated relatively, therefore you should specify distinct values between important and unimportant search terms. The default value is 100.
| Attribute |
Example |
Description |
| $nnn |
"more important term"$200 + "less important term"$100 |
"more important term" is interpreted twice as important as "less important term" |
Examples for advanced search emergency%100 * "security department"%60
Documents that contain "emergency" and contain "security department" with equal to or more than 60% matching are found.
( birds + nature ) * government
Documents that contain either "birds" or "nature" and also contain "government" are found.
EC#C * "member nation"
Case sensitive search (if allowed for this index) is done for "EC" while case insensitive search is done for "member nation".
internet%70#C$100 * communicate%STEM$200
This example combines many of the options using the correct order.
|