Skip to main content

 
IBM Power Systems software  >  IBM i  > Software  > 

HTTP Server for i

Secure, powerful, and complete

  
Overview Getting Started Documentation Support

iSeries Webserver Search Engine - National Language Support

This page contains the documentation that is specific to national language support. Be sure to read iSeries Webserver Search Engine - Getting Started

dotted_rule_443.gif

 NLS Considerations for Documents in the Index

Document CCSID Considerations

Documents that you are indexing can be encoded in most ASCII codepages and EBCDIC CCSIDs (Coded Character Set Identifier). Because the search engine supports only a limited number of CCSIDs, your documents might be converted to one of the supported CCSIDs during the indexing process. Your original document will not be changed.  During indexing, each document is read, converted if necessary, and put into a form that can be used for searches.  This final form is stored in various files created by the administration functions. The CCSID or codepage of the first file in the document list is used for the conversion to the supported index CCSID. Therefore, it is important that the documents in your directory or document list be encoded in the same CCSID. 

If you want to index, for example, some ASCII files and some EBCDIC files in the same index, put the ASCII files in one directory and the EBCDIC encoded files in another directory. It is not necessary to know if the files are ASCII or EBCDIC encoded file. As long as one set of files has a different CCSID from another set, they should be put in different directories. To determine the CCSID of an IFS file use WRKLNK and then option 8 next to the file; to determine the CCSID of a QSYS.LIB file, enter DSPFD supplying the library/file name containing the members to index. 

When you create the index, just specify the directory with  the first set of files that have, for example, the ASCII CCSID (819). Next select your new index and then Update search index. Now specify the directory or document list containing the documents encoded with an EBCDIC CCSID (37). Since documents in codepage 819 and 37 will both be indexed in CCSID 500, they can be indexed together. 

You will not see a failure message if you have documents with varying CCSID tags in one directory. However, your search will not  find documents that are tagged with different CCSIDs because of the difference in the way the characters are converted.

The following is a table showing all of the CCSIDs supported by the Webserver Search Engine. When you view the status of the search index, you will see the CCSID used to index your documents. Documents in languages from the Included character sets can all be contained in the same index, provided they are indexed separately. For example, an index can contain English, French, and German documents. When you create the index, include just the English documents. Then update with the French documents. Update again with the German documents. 

If you attempt to index say Italian and Russian documents in the same index, an error will occur since the two languages cannot be converted to a common index CCSID. In this case, you will have to create two separate indexes. See table below. 
 

Index CCSID  Code page name Included character sets (CCSIDs)
500  Latin 1  International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish 
838   Thai  Thai 
870   Latin 2  Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia 
1025  Cyrillic  Bulgarian, Macedonian, Russian, Serbian (Cyrillic)
1026  Latin 5 Turkish 
875  Greek  Greek
424  Hebrew  Hebrew
420  Arabic  Arabic
1112  Baltic Latvian, Lithuanian
1122  Estonian  Estonian 
935  Simplified Chinese (GB)  Simplified Chinese (GB) 
1388  Simplified Chinese (GBK) Simplified Chinese (GBK)
937  Traditional Chinese  Traditional Chinese 
5026 (930) Japanese Katakana  Japanese Katakana 
5035 (939)  Japanese Latin  Japanese Latin 
1364 (933) Korean  Korean 

DBCS Considerations 

  • Wildcard characters in search strings are not allowed for double byte languages. A wildcard search is implied for double byte languages. Strings within double byte words can be found without the need for wildcard characters. For example, "AccessLog" in Chinese can be found by searching for the string "Access".
  • Both the name of the index and the index directory must be specified in single byte characters.

 More about CCSIDs and character encoding

There are several character encodings such as ASCII and EBCDIC. PCs support only ASCII encodings; the iSeries supports mainly EBCDIC encoding but can also handle ASCII. Every character which is displayed or printed by a computer, is viewed internally not as the actual letter we see but as a hexadecimal value. The letter that is displayed is determined by this hex value for a particular CCSID or codepage.

For example, the letter A in all EBCDIC CCSIDs is hex value C1; a lower case a is hex value 81. In ASCII CCSID 819, A has hex value 41 and a has hex value 61. In most ASCII CCSIDs or code pages, there is no difference with these characters.

However, for national character ü, for example, the hex value in ASCII CCSID 819 is FC. The hex value in ASCII CCSID 850 is 81. With this in mind, let's see why there can be incorrect characters displayed when the CCSID tag of the document does not match with the encoding of contents.

Let's say that your document is encoded in CCSID 819 but it is tagged with CCSID 850. If you look at your document in a hexadecimal view, you will see FC for the ü. However, since the document is tagged with CCSID 850, this value is interpreted using the 850 code page which says that hex value FC is a ³ (superscript 3) NOT an ü. You will see a ³ for every ü. For example, the word  für will display as  f³r . The incorrect tagging can occur during the FTPing of files from the PC to the iSeries. If you see this type of problem, specify the correct CCSID to use with the FTP command.

To see the hex values for a file in an IFS directory, use the DSPF command to display the file. Then press F10 to see the hex values for characters. Reference Appendix F of the International Application Development Guide for many code page tables associated with a particular CCSID.

Typically, the invalid characters show up in the title of a document displayed in the search results. During indexing, the title is extracted and stored separate from the indexed document. The document CCSID is used to store and retrieve the title. This is where most problems are seen even though the document itself displays correctly.


 
Related links

WebSphere Application Server - Express for i5/OS

WebSphere Application Server for i5/OS

IBM Business Solutions


Redbooks

HTTP Server (powered by Apache): An Integrated Solution for IBM iSeries Servers

iSeries Acronym Glossary