Skip to main content

 
IBM Power Systems software  >  IBM i  > Software  > 

HTTP Server for i

Secure, powerful, and complete

  
Overview Getting Started Documentation Support

Webcrawler - Getting Started

This page contains the documentation that is specific to the set up and use of the web crawler contained in the Webserver Search Engine.

  • Highlights of the iSeries Webserver Search Engine web crawler.
  • Starting the web crawler 
  • Controlling the web crawler 
  • Special features
  • Technical Details and Troubleshooting 

  Contents



 About the web crawler

The web crawler is a program that you can start from the same Search Setup forms that you use to set up your search engine. It works in much the same way you do when you enter a URL on your browser and then click on various links to go to new web pages.

The crawling program starts by finding the URL you provide. It downloads this web page to your system and then continues to follow the links it finds. Each web page that it links to is also downloaded until there are no more links to follow or your timer expires.

The web crawler extends the capability for building a document list. As each file is downloaded, the local path plus the original URL is added to your document list. This document list can then be used to create a search index. Search results for this type of index will display the URL where the document was originally found rather than the local copy. When you find one of these documents in your search results, you will be taken to the actual page that was found during crawling.

When you select to build a document list by crawling web sites, the session always runs as a background task whether it is initiated from the browser or one of the search CL commands. It will take several minutes to run at a minimum, depending, of course, on the maximum time you selected for the session to run, as well as other attributes you have specified.

The web crawler has some special features. It can go to any web site, English or non-English, and process the downloaded files correctly for indexing and searching. If a site requires authentication, you can provide the necessary setup. Since web crawlers can run for quite a long time and consume lots of your system storage, we have added ways for you to limit the time the crawler runs, the size of the files it can download, and the amount of storage it can consume. Additionally we give you a way to stop, pause, and resume your crawling session.

All of these features are on the Search Setup forms that are part of the HTTP Server Configuration and Administration.

 How to start the web crawler

You can start a web crawling session by using the browser forms under Search Setup or running the CL command Start HTTP Crawler ( STRHTTPCRL). In each case, you will enter the starting URL such as http://www.ibm.com, a path to your document list, and a directory to store the downloaded documents.

Once you have created the document list, you can create a new search index.

To use the browser forms, do the following:

  1. From the set of forms under Search Setup, select the Build document list option.
  2. On the next form, select Build a document list by crawling web sites. You will notice that you can select to crawl one URL or a list. Using a list will be discussed later. Make sure you have selected Build the document list by crawling a URL.
  3. On the next form, you need to give the name of the document list you want created. Change the * to the name of the list such as /QIBM/UserData/HTTPSVR/index/ crawl.DOCUMENT_LIST.
  4. Under Start crawling from this URL enter, for example, http://www.ibm.com. Be sure to enter the URL exactly as you would on a browser.
  5. Under Crawling options enter the Directory to store documents such as /QIBM/UserData/HTTPSVR/index/doc.
  6. You can use the default for all other fields unless your system is behind a firewall. In this case, enter your Proxy server for HTTP and your Proxy port for HTTP.
  7. Press Apply.
  8. A message will appear saying that the document list will be built in a background task.
  9. To see if crawling is completed, select option Work with document list status.
  10. On the drop down box, select /QIBM/UserData/HTTPSVR/index/crawl.DOCUMENT_LIST.
  11. Under Document list activity, there is a field Active request. If it says there is no active request for this document list, you know the crawling is completed. This form also shows you other information about the document list such as how many documents were actually downloaded and the Message received from the last active request .
  12. Once crawling is no longer active, you can set up the search for your web site using this document list to create a search index. Select option Create search index and find your new document list in the drop down box. See also Setting up the search engine to run on your own web site for additional information.

To use the CL commands, to perform the same tasks, do the following:

  1. Enter STRHTTPCRL and press F4.
  2. Enter the Option *CRTDOCL. Press Enter.
  3. Enter the name of the document list file you want created such as. /QIBM/UserData/HTTPSVR/index/crawl.DOCUMENT_LIST.
  4. Use the default Document storage directory /QIBM/USERDATA/HTTPSVR/INDEX/DOC.
  5. Enter a URL, for example, http://www.ibm.com. Be sure to enter the URL exactly as you would on a browser.
  6. You can use the default for all other fields unless your system is behind a firewall. In this case, enter your Proxy server for HTTP and your Proxy port for HTTP.
  7. Press Enter.
  8. No message will appear unless an error has occurred..
  9. To see if crawling is completed, enter CFGHTTPSCH and press F4.
  10. Enter the Option *PRTDOCLSTS. Press Enter.
  11. Enter document list /QIBM/UserData/HTTPSVR/index/crawl.DOCUMENT_LIST.
  12. Press Enter.
  13. Enter WRKSPLF.
  14. Display the file called QPZHASRCH which contains the status of your document list.
  15. Under Document list activity, there is a field Active request. If it says there is no active request for this document list, you know the crawling is completed. This file also shows you other information about the document list such as how many documents were actually downloaded and the Message received from the last active request .
  16. Once crawling is no longer active, you can set up the search for your web site using this document list to create a search index. Use the CL command CFGHTTPSCH and enter option *CRTIDX.
  17. Enter your new document list and any other attributes you want to use. See also Setting up the search engine to run on your own web site for additional information.

Note:There is equivalent function using either the browser forms or CL commands for all search administrative functions. See Browser and CL command interface for iSeries Webserver Search Engine and Crawler .

Objects used for crawling

For a one-time crawl, you can enter a single URL and some session details. However, for crawling multiple times or crawling multiple URLs in one session, you can create a URL object that can be specified when you start the crawling session. You will also need to create an options object.

  • URL object: The URL object contains the language of the pages you are crawling, the directory for storage of downloaded files, and a list of URLs to crawl.
  • Options object: The options object contains proxy servers and ports as well as values that restrict the storage and run time of a crawling session.
  • Validation list object: The validation list contains a list of URL domain filters with the associated userid and password for site authentication.
  • Activity log file: The activity log file contains all the attributes of the crawling session.

 How to crawl a list of URLs (Create a URL object)

If you want to crawl several URLs in one session, create a URL object either from the browser form Build URL object or using command CFGHTTPSCH OPTION(*CRTURLOBJ). A URL object contains a list of URLs plus a few other web crawling attributes.

A URL object contains the following:

  • The directory to use to store the downloaded documents
  • The language of the web sites you are crawling.
  • A URL list containing the following:
    • URL
    • URL domain filter
    • Maximum crawling depth
    • If you want support for robot exclusion.

If you want to start a crawling session using a URL object, you must also create an options object which contains general crawling attributes.

 How to create an options object

Create an options object that can be re-used for crawling sessions. It contains general system related values that are not specific to the URLs you are crawling. If you want to use an options object for a crawling session, you must also create a URL object.

An options object contains the following:

  • Proxy server for HTTP
  • Proxy port for HTTP
  • Proxy server for HTTPS
  • Proxy port for HTTPS
  • Maximum file size to download.
  • Maximum storage for files
  • Maximum threads for the crawler to run
  • Maximum run time for the crawling session
  • An optional activity log file for crawling information

 How to create a validation list object

A validation list object should be created if you intend to crawl sites that require a userid and password. To create a validation list object, use the Build validation list form or the command CFGHTTPSCH OPTION(*CRTVLDL). Enter the URL domain filter along with the authenticating userid and password. The userid/password will be used for any URLs within the domain specified. Multiple entries can be added to the list. Passwords are not displayed on the form. The validation list object will be created in library QUSRSYS and owned by the signed on user. Public use is excluded.

The URL domain filter must be entered in the form www.ibm.com. Do not begin or end the domain with a slash and do not include port numbers. Error messages are not sent for invalid entries. The authentication fails.

Invalid entry: mysystem:2001 (do not add a port)
Valid entry  : mysystem
Invalid entry: www.ibm.com/server
Valid entry  : www.ibm.com
Invalid entry: http://www.ibm.com

When you are ready to start your crawling session, be sure to enter the name of the validation list you have created. See also Validation list object.

 Additional details

  • Running and controlling the crawler

    The programs that run - The crawler can be started, stopped, paused or resumed. When the crawler is started from the browser or from the STRHTTPCRL command, two jobs will be started in the QBATCH subsystem, QJVACMDSRV and QZHASTRCRL. When the crawler is ended or paused, that is, stopped temporarily, two jobs will be started in the QBATCH subsystem, QJVACMDSRV and QZHAENDCRL. When crawling is resumed for a paused session, two jobs will be started in the QBATCH subsystem, QJVACMDSRV and QZHARSMCRL.

    Start crawling - Crawling can be started using the STRHTTPCRL command or from the browser using the Build document list form. It takes a while for the crawler to be initiated. The crawler attempts to connect to a URL, based on the properties set up, such as proxy servers, etc., then determines if a file can be down-loaded, based on robot files, type of file (only text and HTML are downloaded), the size of the file, and the amount of storage allocated for the downloaded files. Duplicate files are not stored. Some information about the session is written to the Activity Log file.

    Stopping the crawl - Crawling can be stopped using the ENDHTTPCRL command or from the browser using the Work with document list status form. Select the document list that is getting built. There will be buttons at the end of the form that can be used to either stop or pause the crawler. If no buttons are displayed, the crawling is not in a valid state for ending.

    Pausing the crawl - Crawling can be paused using the ENDHTTPCRL command or from the browser, using the Work with document list status form. Select the document list that is getting built. There will be buttons at the end of the form that can be used to either stop or pause the crawler. If no buttons are displayed, the crawling is not in a valid state for pausing. When crawling is paused, the URLs currently being crawled are stored.

    Resuming the crawl - Crawling can be resumed using the RSMHTTPCRL command or from the browser, using the Work with document list status form. Select the document list that is getting built. There will be a button at the end of the form that can be used to resume crawling. If a Resume button is not displayed, the crawling is not in a valid state for resuming. When crawling is resumed, it begins with the URL where is was paused.

  • Browser forms for crawling

    This is a list of the main browser forms specific to using the web crawler. For a complete list, see Browser and CL command interface for iSeries Webserver Search Engine and Crawler.

    • Build document list - select to build a document list by crawling remote web sites. Additionally you can select one of 2 forms. Either specify all the details to crawl one web site or select to crawl using your pre-defined URL and options objects.
    • Build URL object - create an object containing a list of URLs to crawl.
    • Build options object - create an options object to re-use for crawling sessions. This contains crawling attributes such as proxy server for HTTP and proxy port for HTTP.
    • Build validation list - create a validation list object that contains a URL and associated userid and password for authentication.
    • Work with document list status - Use this form to find out if the crawling is complete, the number of documents in the document list once the crawl is complete, and and the final message sent for the crawling session. Buttons will be displayed at the end of the form that will allow you to pause, resume, or end a crawling session, depending on the current status of the crawl.
  • CL commands for crawling

    This is a list of the CL commands related to the web crawler. See also Browser and CL command interface for iSeries Webserver Search Engine and Crawler.

    • STRHTTPCRL - Start HTTP Crawl
    • ENDHTTPCRL - End or Pause HTTP Crawl
    • RSMHTTPCRL - Resume HTTP Crawl
    • CFGHTTPSCH - Configure HTTP Search - Use this form to create a URL, options, or validation list object. Also use the command to print to a spool file the status of the crawling session. The information displayed is the same that you see when you select Work with document list status from the browser.
  • Definitions

    This section contains more details about values set for your crawling sessions.

    • URL object:  A URL object contains a list of URLs to crawl plus other crawling values related to the URLs. It is useful when you want to crawl multiple URLs in one session. Use in combination with an options object.
    • Options object:  An options object contains general crawling attributes. Use in combination with a URL object.
    • Validation list object: This validation list object can only be used by the web crawler. It contains a list of URL domain filters with the associated userid and password. A validation list should be used if you are crawling sites that require a userid and password. The validation list will be created in library QUSRSYS with the name QZHAxxxxxx where xxxxxx is the name you entered on the form or CFGHTTPSCH command. Passwords will be stored in the validation list object in encrypted form. In order to store and decrypt the passwords for authentication, the system value QRETSVRSEC (Retain Server Security) must be set to 1 before the validation list is created. If the system value is changed from 1 to 0 once the validation list exists, the encrypted passwords will be removed and authentication will fail. In this case, the system value will need to be reset to 1 and the validation list deleted and created again.
    • Directory to store documents:  This is the directory where files downloaded during crawling will be stored. For example, if the directory /mydocs is specified, file www.ibm.com/index.html found during crawling will be stored as  /mydocs/www.ibm.com/index.html . The directory will be created if it does not exist.
    • Document language: This specifies the language of the documents that are downloaded. If you are crawling a Japanese web site, for example, the document language selected should be Japanese. The language selection is similar to those found on browsers under character set or encoding selections. It is important to make sure the language selected matches the language of the files that will be downloaded so that documents will be indexed in the correct CCSID.
    • URL list This list is a collection of specific information about each of the URLs that you want to crawl.Each entry contains a URL, a domain filter, the maximum crawling depth, and whether robot exclusion should be supported.
    • URL: A URL should be in the form, for example, http://www.ibm.com. The crawler will start running at this URL.
    • URL domain filter: You can limit the domain of the crawler to a specific domain such as ibm.com. This will prevent the crawler from going to linked sites in other domains.
    • Maximum crawling depth: This refers to the depth of links from the starting URL. The starting URL is at depth 0. The links on that page are at depth 1. The links at depth 1 would be depth 2 and so on.
    • Support robot exclusion: To control crawlers, sites will attach a robot file or META tag that will tell crawlers that they do not want certain links to be followed. It is considered good crawling manners to support robots. If you select to support robot exclusion, any site or pages that are referenced in robot exclusion META tags or files will not be crawled.
    • Proxy server for HTTP: This is the proxy server for HTTP requests.
    • Proxy port for HTTP: This is the port number for the above proxy server. A port is required if a proxy server for HTTP is specified.
    • Proxy server for HTTPS: This is the proxy server for HTTPS requests.
    • Proxy port for HTTPS: This is the port number for the above proxy server. A port is required if a proxy server for HTTPS is specified.
    • Maximum file size to download: This is the maximum size for a downloaded file (in KB). Files greater than the specified size will not be downloaded to your system.
    • Maximum storage for files: This is the maximum storage space for all downloaded files (in MB). The crawler keeps track of the total bytes downloaded for all files. Once the storage limit is met, crawling ends.
    • Maximum threads: This is the maximum number of threads used during web crawling. You should set this value based on the system resources that are available.
    • Maximum run time: this is the maximum amount of time the crawling session remains active in hours and minutes. When the crawling session has run for this length of time, it will stop.
    • Activity log file: This file contains information about the crawling session plus any errors that occur. An activity log file can be used by only one crawling session at a time..

Back to top


 
Related links

WebSphere Application Server - Express for i5/OS

WebSphere Application Server for i5/OS

IBM Business Solutions


Redbooks

HTTP Server (powered by Apache): An Integrated Solution for IBM iSeries Servers

iSeries Acronym Glossary