ANET.at HomepageSearchEngine (HSE) 3.4+ (released on November 9, 2002) (c) 1999-2002, ANET.at Core software bundle required for all available version types, including the free time-limited Trial version of the Pro edition (expires on January 31, 2003). Homepage: http://www.HomepageSearchEngine.com/ (English) or http://www.HomepageSearchEngine.de/ (German/Deutsch) R E F E R E N C E M A N U A L In the current package, support for the following 24 languages is included: Arabic, simplified Chinese, traditional Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, Romanian, Russian, (Latin) Serbian, Spanish, Swedish, Thai and Turkish ________ Contents 1. About this software 2. System requirements 3. Which package do I need? 4. File structure of this package 5. Quick install guide (for the advanced or impatient user) 6. Installation Manual 1. Executable file ("HomepageSearchEngine.exe" on Windows or "HomepageSearchEngine.cgi" on Unix) and libraries 2. Configuration file ("hse.ini") 3. IMPORTANT - Admin Area, Creating an Admin account and the Users file 4. Central Style Sheet definition file ("/hse/HomepageSearchEngine.css") 5. Static HTML template file ("hse_template.html") 6. Dynamic HTML template file 7. Language files ("hse_lang.txt" and "hse_help.txt") 8. Language and Configuration sub directories / delivery parameters ("lang" and "conf") 9. Setting the Language Locale - the locale-enabled HomepageSearchEngine Executable 10. Shell Executable: creating file-lists, indexes and using other tools (Pro edition only) 11. IMPORTANT - Testing your installation: search for "list:files" 12. Excluding specific sections within HTML files from being searched 13. Options to call the search engine 14. Optional turn from the Trial to the Freeware version with the public key ("hse_key.cgi") 7. Special Shell Executable features and using the cronjob script (Pro edition only) 1. Spidering and URL Grabbing: Searching dynamic sites or the content of any URLs 2. Searching different websites hosted on the same computer 8. Updating from a previous version 1. Updating from v3.1 2. Updating from v3.2 3. Updating from v3.21 4. Updating from v3.3 or 3.31 5. Updating from v3.32 6. Updating from v3.33 7. Updating from v3.34 8. Updating from v3.35 9. Updating from v3.36 10. Updating from v3.37 11. Updating from v3.4 12. Clean installation and updating from versions earlier than 3.1 9. Debugging 10. Known issues 11. To-Do's 1. Internationalisation 2. New features 12. Support 13. Credits 14. History of version changes ("change log") 15. License agreement ______________________ 1. About this software This software is intented to search the real content of HTML pages (both static and dynamically generated ones) and all other text files including special formats such as RTF. The resulting output is in pure valid XHTML 1.0. The found files or any other URL's content can be viewed with all matches highlighted in a desired style. The main purpose is to search medium sized websites on the inter- or intranet, but it may also be used to search documentations or other content written in HTML on your local harddisk or even on a CD-ROM. ______________________ 2. System requirements Webspace on a Win32- or supported Unix-system with the right to run your own CGI programs. If the webspace is remotely hosted, access via FTP/SFTP is required for installation. The webspace can also be on the local harddisk or on a CD-ROM when a webserver software and a webbrowser is or will be installed. On large websites and for optimal use of the indexing functionality, shell access (usually via Telnet/SSH) is recommended. The resource consumption for basic actions of HSE lies at about 3-4 MB memory. Once HSE is installed, you can determine a comparable value of its memory usage by executing it with the 'memory' command. Executing ./HomepageSearchEngine.cgi memory via the web based Admin Console, running on a GNU/Linux 2.4 (i686) system, resulted in 2740 KB for the current version. ___________________________ 3. Which package do I need? Be sure to download the package containing the latest available version supporting your target platform from http://www.HomepageSearchEngine.com/download_en.phtml Packages for both Windows and Unix are available, which only differ in the executable file and its associated libraries (within the "cgi-bin/hse" sub directory). If you want to use the search engine on Unix, it is strongly recommended to first run the "platform.cgi" script found in the "cgi-bin/platform" sub directory of this package or at http://www.HomepageSearchEngine.com/_download/platform.tgz The distributed package is called "HSEn.n_Platform.ext" HSE............."HomepageSearchEngine" n.n.............version number (eg. "3.4+") Platform........platform the webserver is running on (target platform). Currently, the following 10 platforms are supported: Windows platforms: ------------------ "Win32".......Windows 32bit on Intel x86 processors (Microsoft Windows XP, 2000, NT, Me, 98, 95) Unix and compatible platforms: ------------------------------ "Linux".......GNU/Linux (aka "Linux") v2.x on Intel x86 processors (i386/i586/i686) (all current glibc 2 based distributions like Caldera OpenLinux, Debian, Mandrake, Red Hat, Slackware, Sun RaQ3 or higher boxes, SuSE, TurboLinux, XandrOS) Some older Linux distributions may need the additional "Linux-old" package (including all libc 5 based distributions - the platform.cgi script will determine that). "SunOS".......Sun Solaris Sun SunOS v5.5 or higher including Solaris v2.5 or higher on Sun SPARC (sun4x series) processors "FreeBSD".....FreeBSD v3.x or higher on Intel x86 processors "BSD-OS"......BSDi BSD/OS v3.x or higher on Intel x86 processors "HP-UX".......HP HP-UX (Hewlett Packard) v10.x or higher on HP PA-RISC (9000) processors "AIX".........IBM AIX (International Business Machines) v4.x or higher on IBM processors Axx "DEC-OSF1"....DEC OSF/1 (Digital Equipment Corporation) including Digital UNIX and (currently called) Compaq Tru64 UNIX v4.0 or higher on DEC alpha processors "IRIX"........SGI IRIX64 (Silicon Graphics, Inc.) v6.5 or higher on SGI IP2x (IP27 and compatible) processors (eg. Rapidsite systems) "MacOSX"......Apple MacOS X v10.x (with its Darwin core system) on Power Macintosh (PPC) processors ext.............Filename extension: for Windows target platforms: ----------------------------- "zip".....ZIP-compressed file You can unpack it using WinZip or a similar common program (WinRAR, 7-Zip etc.). for Unix target platforms: -------------------------- "tgz".....TapeArchive (tar) format GNU-Zip (gzip) compressed file If you work on a Windows machine you can also unpack it using WinZip. You can also unpack it directly on the Unix machine by typing the following commands: gzip -d HSEn.n_Platform.tgz tar -xvf HSEn.n_Platform.tar (where "n.n" and "Platform" have to be replaced by the real strings). Under MacOS, StuffIt Expander can be used to unpack the package. Make sure to unpack the package including sub directories, not cutting long file names and preserving the filename's case. Currently, the following packages are available: "HSE3.4+_Win32.zip".......HomepageSearchEngine version 3.4+ for Windows 32bit platforms "HSE3.4+_Linux.tgz".......HomepageSearchEngine version 3.4+ for GNU/Linux platforms "HSE3.4+_Linux-old.tgz"...HomepageSearchEngine version 3.4+ add-on for old GNU/Linux platforms "HSE3.4+_SunOS.tgz".......HomepageSearchEngine version 3.4+ for Sun Solaris platforms "HSE3.4+_FreeBSD.tgz".....HomepageSearchEngine version 3.4+ for FreeBSD platforms "HSE3.4+_BSD-OS.tgz"......HomepageSearchEngine version 3.4+ for BSDi BSD/OS platforms "HSE3.4+_HP-UX.tgz".......HomepageSearchEngine version 3.4+ for HP HP-UX platforms "HSE3.4+_AIX.tgz".........HomepageSearchEngine version 3.4+ for IBM AIX platforms "HSE3.4+_DEC-OSF1.tgz"....HomepageSearchEngine version 3.4+ for DEC OSF/1 platforms "HSE3.4+_IRIX.tgz"........HomepageSearchEngine version 3.4+ for SGI IRIX64 platforms "HSE3.4+_MacOSX.tgz"......HomepageSearchEngine version 3.4+ for Apple MacOS X platforms Note that support for the platforms "GNU/Linux-mips", "OpenBSD" and "NetBSD" has been discontinued as of version 3.4+ since these platforms are rarely used for running a web server. If you need a package for one of these platforms you have to use version 3.4. The latest packages for the most common platforms are also always available at the following direct download URLs: http://www.HomepageSearchEngine.com/_download/HSE_Win32.zip for Windows 32bit http://www.HomepageSearchEngine.com/_download/HSE_Linux.tgz for GNU/Linux http://www.HomepageSearchEngine.com/_download/HSE_SunOS.tgz for Sun Solaris http://www.HomepageSearchEngine.com/_download/HSE_FreeBSD.tgz for FreeBSD _________________________________ 4. File structure of this package There are 3 different main directories which contents goes into different locations on your server machine that reflect a different nature: 1. the webserver's script (cgi-bin) directory 2. the webserver's document root directory 3. your home directory (outside a directory accessable by the webserver) 1. + cgi-bin CGI applications. To be put into the webserver's script (cgi-bin) directory. | | | + hse HSE's program (main) directory - corresponds to the URL "/cgi-bin/hse" | | | + platform Platform Detector; an optional tool to find out which package you need on a Unix platform | 2. + htdocs HTML (web) documents. To be put into the webserver's document root directory. | | | + hse HSE's web documents directory - corresponds to the URL "/hse" | 3. + tools Tools. Can be put into the user's home directory (outside a directory accessable by the webserver) | + hse HSE's non-web directory The HSE's program directory ("cgi-bin/hse") contains the platform specific executable file and some associated libraries as well as a bundle of platform independent files. The filename extension of the executable file in the Windows package is ".exe" and in the Unix package ".cgi.bin" (to be renamed to ".cgi" once residing on the server). For Windows, the libraries have the ".dll" extension. For a Unix platform, they have one of the following extensions: .so for AIX, DEC-OSF1, FreeBSD, IRIX, Linux, SunOS .o for BSD-OS, .sl for HP-UX, .bundle for MacOSX. ___________________________________________________________ 5. Quick install guide (for the advanced or impatient user) Assuming you host your site on a Unix platform and you have shell access to this machine you can follow these quick instructions. If you don't understand this, read through the Installation Manual (section 6) instead. We assume that you have uploaded the matching package into your home directory, your web document root directory is "~/httpd/htdocs" and your script directory is "~/httpd/cgi-bin". (1) On the shell, go into your home directory. Unpack and install the package by entering gzip -d HSEn.n_Platform.tgz tar -xvf HSEn.n_Platform.tar cd HSEn.n_Platform mv cgi-bin/hse ~/httpd/cgi-bin/hse mv htdocs/hse ~/httpd/htdocs/hse mv tools/hse ~/hse cd ~/httpd/cgi-bin/hse chmod 755 HomepageSearchEngine.cgi.bin mv HomepageSearchEngine.cgi.bin HomepageSearchEngine.cgi After installing, you may want to remove the "HSEn.n_Platform.tar" file and the "HSEn.n_Platform" directory. (2) Open "hse.ini" and configure the 2 directives in its section 1.1 and 1.2. (3) Call the URL to HSE, http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.cgi and follow the online instructions (regarding the Admin account and HTML template). (4) "Fine tune" the .ini file (descriptions are included in the file itself). (5) Test your configuration by calling HSE's URL again and searching for "list:files". ______________________ 6. Installation Manual 6.1 Executable file ("HomepageSearchEngine.exe" on Windows or "HomepageSearchEngine.cgi" on Unix) and libraries --------------------------------------------------------------------------------------------------------------- Win32 Users on IIS please read http://www.HomepageSearchEngine.com/iis_en.phtml first! Determine a target installation directory on your server. This must be your script directory (usually called "cgi-bin") or a sub directory of it (it is recommended to create a new sub directory called "hse"). This program (main) directory of HSE on the server (usually "/cgi-bin/hse") will be referred to as the "installation directory". All required files reside in the "cgi-bin/hse" sub directory of the distributed package. Upload the (executable) file "HomepageSearchEngine.exe" and all .dll files (libraries found in a Windows package) or "HomepageSearchEngine.cgi.bin" and all .so / .o / .sl / .bundle files (libraries found in a Unix package), respectively, into the target installation directory. Make sure that all these files will be uploaded in binary mode (normally you don't need to care about the mode since the file extensions should force the correct one). On Unix, rename "HomepageSearchEngine.cgi.bin" to "HomepageSearchEngine.cgi" afterwards and chmod it to 755 (rwx r-x r-x). General note for file permissions on Unix: Normally, you only have to care to set the correct permissions for the executable file. All other files should be readable by the executable without any change. But, on some server configurations, it is required to chmod all these files to 644 (rw- r-- r--). Point your webbrowser to the URL of HomepageSearchEngine's executable (the file "HomepageSearchEngine.exe" or "HomepageSearchEngine.cgi", respectively), eg. ("tld" stands for a top level domain such as "com") http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe and you should see a message that you should upload the configuration file. 6.2 Configuration file ("hse.ini") ---------------------------------- Edit the file "hse.ini" by setting the proper values to its directives. Here is an overview of all its directives with the sections they are assigned to and the default values: directive: default value: The values in the following section are the only ones that *must* be checked and probably edited: (1) Base directory (1.1) basepath ../../ (1.2) baseurl / All following settings are optional. You may want to keep the default values for the first time. (1.3) cgiurl (2) Files ex-/including (2.1) exclude_dirs _* cgi-bin hse (2.2) ban_list /*private*/ /robots.txt /Thumbs.db .log .BAK .css .js .cgi (2.3) search_always /a_private_directory/a_public_file.html (3) Outfit of the input form (3.1) formtable_width 475 (3.2) formtable_border-color black (3.3) formtable_background-color #dadada (3.4) formtable_background-image (3.5) formtable_alignment left (3.6) formtable_input-size 38 (3.7) helpwindow_width 620 (3.8) helpwindow_height 690 (4) International settings (4.1) charset iso-8859-1 (4.2) date_format M D, Y (4.3) decimal_sep . (4.4) dir ltr (4.4) locale (system's default Locale string - only recognized by HomepageSearchEngine-lc) (5) Security tuning (5.1) debug_level 0 (5.2) max_found_files 1000 (5.3) cgi_timeout 25 (5.4) allowed_referer_sites All following settings are for advanced users and are not effective in the (free) Light edition. (7) Categories (7.1) categories_nr 1 (in the Light edition: "none") (7.2) categories_nameNR (7.3) categories_dirNR (7.4) categories_sourceNR (8) Results pages customizing (8.1) template_url (8.2) results_global search_string + options + time + summary + engine-links: 'Query the entire web using ' Google.com (8.3) results_details icon:custom16x16 + url + size + matches + update (8.4) results_descriptions 250 characters + 1 matches (40) (8.5) highlight-style background-color:yellow (8.6) target (8.7) results_href highlightmatches + gotofirstmatch + maxsize:100 (in the Light edition: "none") (8.8) results_previous_img black (8.9) results_next_img black Setting a value to a directive must follow the syntax "directive = value" and stand in *one* (own) line. Descriptions of all directives including possible values and examples are in the .ini file itself. Comments are allowed in lines beginning with the semicolon character (";"). Edit the configuration file with a text editor and save it. It can be saved in each DOS, Unix or Mac format. Then upload it into the installation directory (the same directory where the executable file resides) or another configuration directory (will be described in section 6.8 below). If you point your webbrowser to the URL of HomepageSearchEngine's executable again, you should see the first graphical screen, following you through the next steps. First, only configure section 1 - Base directory. Later, you should "fine tune" the .ini file. Especially in larger websites, it is recommended to use the categories feature (defined in section 7) to split your site into several categories each containing not more than a few MB of text. Test the categories setup by searching for "list:files" in each category without *and with* the "Search text of Non-HTML files" checkbox switched on. This will also show you that you may want to exclude some directories and files by modifying the "exclude_dirs" and "ban_list" values. 6.3 IMPORTANT - Admin Area, Creating an Admin account and the Users file ------------------------------------------------------------------------ Now a link to the Admin Area appears. This is http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe?admin (or equivalent). You should now go to this link because at the first time this URL will be accessed the user will be asked to create a username/password pair for an Admin account. Once the first account has been created, you may login with that user data in the future to be able accessing administration tools such as a Console to run HomepageSearchEngine Shell Executable. You may also create additional Admin accounts. So make sure not to forget your Admin login data! The created username/password pairs will be stored in a text file called "hse_users.cgi". Although this users file's extension is ".cgi", it is not executable. The purpose of its "false" extension is to prevent it from being read for higher security. The passwords are encrypted using the undecryptable DES algorithm and work on both Windows and Unix platforms. The users file has the same format as the authUserFile that the .htaccess method uses for protecting directories. Therefor, you can also use the Admin Area to create such authUserFiles. If you want to disable accessing the Admin Area for security or any other reasons, just copy an *empty* file called "hse_users.cgi" into the installation directory. 6.4 Central Style Sheet definition file ("/hse/HomepageSearchEngine.css") ------------------------------------------------------------------------- Upload the .css file found in the "htdocs/hse" directory into HSE's web documents directory (/hse). This central Style Sheet definition file is required by the HTML template file described in section 6.5 and 6.6 below. 6.5 Static HTML template file ("hse_template.html") --------------------------------------------------- You should then upload "hse_template.html" found in "cgi-bin/hse" into the installation directory. After refreshing your webbrowser at the search engine's URL, the upper and lower part of the page has changed. Edit this file to fit your desired design and upload it again. In its head there is a reference to the central Style Sheet definition file mentioned above. Be sure that its URL ("/hse/HomepageSearchEngine.css" by default) points to the proper location where you have uploaded the file to. You will then see the styles that affect all elements which HSE creates on the results pages. You may want to edit the style sheet. Note that the design always keeps the same, since this template produces static HTML. This is the easiest way and may be sufficient for most webdesigners. If you are a more pretentious webmaster, you may want to use a dynamic HTML template instead (see next section). The border between the upper and lower part is marked by a line consisting of Never remove that line! 6.6 Dynamic HTML template file ------------------------------ As an alternative to the static HTML template, some require a dynamic one to be able to use SSI, PHP or any other server parsed script language. For this purpose, put a template file into HSE's web documents directory (/hse). You can name it how you want, taking care of the correct extension that is required by your server (eg. .shtml for SSI or .phtml or .php for PHP). Make sure that the border line as in the static HTML template keeps present. There is a sample SSI enabled and PHP enabled dynamic HTML template file called "hse_template.shtml" and "hse_template.phtml", respectively, in the directory "htdocs/hse" of this package. Once you have edited and uploaded your custom dynamic HTML template, you must specify its absolute URL in section (8.1) - template_url - of your .ini file, eg. template_url = http://www.yourdomain.tld/hse/hse_template.shtml To enable highest compatibility between different servers, you may drop the "http://" prefix and use something like template_url = /hse/hse_template.shtml instead. Then, the full URL will be constructed using your server's environment variable HTTP_HOST by prefixing "http://HTTP_HOST". For example, if you have installed HSE at "http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe", HTTP_HOST equals to "www.yourdomain.tld" and the example above would resolve in the URL "http://www.yourdomain.tld/hse/hse_template.shtml". If you have installed HSE at "http://www.yourdomain.tld:81/cgi-bin/hse/HomepageSearchEngine.exe", HTTP_HOST equals to "www.yourdomain.tld:81" and the example above would resolve in the URL "http://www.yourdomain.tld:81/hse/hse_template.shtml". If the directory the dynamic HTML template resides in is password protected, you must specify login information (username and password) HSE should use to authenticate, using the syntax template_url = http://username:password@www.yourdomain.tld/hse/hse_template.shtml When your dynamic HTML template uses the "HTTP_ACCEPT_LANGUAGE" environment variable, you can set its value by delivering "lang=LANG" to the .cgi action. A more detailed description of this option follows in section 6.8. 6.7 Language files ("hse_lang.txt" and "hse_help.txt") ------------------------------------------------------ If you want the program's output in another default language than English, then also upload "hse_lang.txt" and "hse_help.txt" found in the matching language directory of the "lang" sub directory into the installation directory. The name of each language directory LANG is the 2 letter ISO 639-1 language code (eventually with an additional "-" character followed by a 2 letter-regional code) of the language it holds. These and their associated international settings for the currently 24 supported languages are: language code | language | charset | date_format | decimal_sep | dir ------------------------------------------------------------------------------------ ar | Arabic | windows-1256 | DD/M/Y | . | rtl cs | Czech | iso-8859-2 | D. M. Y | , | ltr da | Danish | iso-8859-1 | D. M Y | , | ltr de | German | iso-8859-1 | D. M Y | , | ltr el | Greek | iso-8859-7 | D M Y | , | ltr en | English | iso-8859-1 | M D, Y | . | ltr es | Spanish | iso-8859-1 | M D, Y | , | ltr fi | Finnish | iso-8859-1 | D. M Y | , | ltr fr | French | iso-8859-1 | D M Y | , | ltr hu | Hungarian | iso-8859-2 | Y. M D. | , | ltr it | Italian | iso-8859-1 | M D, Y | , | ltr ja | Japanese | shift_jis | Y.M.DD | . | ltr nl | Dutch | iso-8859-1 | D M Y | , | ltr no | Norwegian | iso-8859-1 | D. M Y | , | ltr pl | Polish | iso-8859-2 | D. M Y | , | ltr pt | Portuguese | iso-8859-1 | M D, Y | , | ltr ro | Romanian | iso-8859-2 | D M, Y | . | ltr ru | Russian | windows-1251 | D M Y | , | ltr sr | (Latin) Serbian | iso-8859-2 | D. M Y | , | ltr sv | Swedish | iso-8859-1 | D M Y | , | ltr th | Thai | tis-620 | DD/M/Y | . | ltr tr | Turkish | iso-8859-9 | D. M Y | , | ltr zh-cn | simplified Chinese | gb2312 | Y.M.DD | . | ltr zh-tw | traditional Chinese | big5 | Y.M.D | . | ltr If there are no language files for your preferred language or if you want to change the current words to fit your needs, you can edit the distributed language files. Please contact us before you want to create a new language file set if you want to get a full version of our search engine for free. 6.8 Language and Configuration sub directories / delivery parameters ("lang" and "conf") ---------------------------------------------------------------------------------------- The "cgi-bin" directory of the distributed package includes two sub directories: "lang" holds all available language directories containing the language files; and "conf" is the container for configuration directories named "1", "2", .. to "9" that can be filled with additional configuration sets. Upload them all into the installation directory to get the option to switch between languages and configuration sets. On Unix, make sure all directories are chmod'ed 755. You can then change the language and its associated international settings - as stated in the table above - by delivering the "lang" parameter with the name of the language directory (the language code) as its value to the executable. The separating character for thousands blocks will always be " " (eg. "1 679 matches") unless you deliver the lang=en paramter, resulting in changing that character to "," (eg. "1,679 matches"). The value of the "lang" delivery parameter will also be sent to the server as accepted language (as "Accept-Language" HTTP header) which results in the environment variable "HTTP_ACCEPT_LANGUAGE" set to this language code. This may be useful when a dynamic SSI enabled HTML template (see section 6.6) is used that calls a script to automatically display the date in the user's correct language format. You can try using the included dynamic HTML template file called "hse_template.shtml" to see how this works. For instance, calling http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe?lang=de changes the language and all its associated international settings to German and sets the "HTTP_ACCEPT_LANGUAGE" environment variable to the "de" value. Similary, you can deliver a "conf" parameter with the name of a configuration directory (a number from 1 to 9) as value to the .cgi URL. For instance, calling http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe?conf=1 forces the search engine not to use any of the default configuration files (residing in the main installation directory), but instead using those found in the directory "1". So you can use one installation of HomepageSearchEngine with up to 10 different configuration sets. Please see the WhatsThis.txt file residing in the conf/1 directory for details. NOTE: Each uploaded configuration directory must at least contain the .ini file ("hse.ini"). The distributed package contains only one configuration directory, namely "1". If you create or upload additional ones ("2" to "9") make sure that all of these directories contain an .ini file! You can disable the access to a configuration set by setting allowed_referer_sites = - in the .ini file residing in the corresponding configuration directory. That .ini file does not need to contain anything else. This may be especially useful if you only use configuration sub directories, but don't want the main configuration set to be used. Therefore, place an .ini file as mentioned above into your main installation directory. It the URL http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe will be called, nothing appears but a message like "ERROR: Sorry, this CGI application is set not to be callable from the site 'www.yourdomain.tld'." 6.9 Setting the Language Locale - the locale-enabled HomepageSearchEngine Executable ------------------------------------------------------------------------------------ You don't need to take care about this section unless your site is in another language than English. If your site hosted on a Unix server contains words with characters other than the English A-Z characters (characters higher than US-ASCII), eg. the German "Umlaute" (Ä, Ö, Ü), you may observe the following problem (known as the "always-case-sensitive" bug): When a search string includes such characters, the search will always be performed case-sensitive, even if the 'Match case' checkbox keeps disabled (matchcase=off). Also, restricting the search to accept only whole words (noparts=on), will not work properly with words containing such characters: If the character is the first or last one of the word, the word will not be found. Searches with matchcase=on and noparts=off will always work properly. The reason for this behaviour is that the system is set to another default "Locale" (language environment) that the characters in question belong to. This could be fixed by switching the system's default Locale to the corresponding one. If this setting cannot be changed system-wide, you may HomepageSearchEngine let use its own custom Locale. Only use this feature if you are affected and case-insensitive searches for such special characters are important for you because this option requires more system resources. To test if you are affected, include a word in both lowercase and uppercase letters in one of your searchable documents, eg. "schönbrunn" and "SCHÖNBRUNN" (which contains the German "Ö" Umlaut). Then, search for this word ("schönbrunn"), keeping the restrictive search options on default case-insensitive. The result must then include the file with both occurencies found. If it only finds one instead of two occurencies, your system is affected. To solve this issue, use the "locale-enabled HomepageSearchEngine Executable" instead of the default one. It is included as file called "HomepageSearchEngine-lc.exe" (for Windows) or "HomepageSearchEngine-lc.cgi.bin" (for Unix). The latter should be renamed to "HomepageSearchEngine-lc.cgi" or "HomepageSearchEngine.cgi" once residing on the server. The Windows version is usually not needed since the "always-case-sensitive" bug seems not to affect Windows platforms. Anyway, the Windows package also contains a locale-enabled HomepageSearchEngine Executable, for testing purposes. It is a good idea to first determine your system's default Locale by calling HomepageSearchEngine-lc in Enhanced Debug mode (see section 9 for details): http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine-lc.exe?debug (or equivalent). The shown Locale can be changed in section (4.5) of your .ini file. To support German characters, the following setting may work: locale = German If it doesn't, find out which Locale strings are supported by the current system configuration of your host machine. 6.10 Shell Executable: creating file-lists, indexes and using other tools (Pro edition only) -------------------------------------------------------------------------------------------- Especially on large websites, you may want to speed up the search time by searching in an index instead of searching the files directly. The content of all matching HTML files will be stored in a tabstop separated text file called "hse_indexNR_html.txt". The file "hse_indexNR_nonhtml.txt" holds the content of all matching Non-HTML files. Both files represent the index file pair for category NR. If the index file *pair* for the actual category is present, it will be used, otherwise the flat or the on-the-fly search method will be applied. To create the index files, go into the installation directory on the command prompt (shell) and execute the executable file. To do this, shell access (via Telnet, SSH or direct access) is required. On Windows, you have to type something like cd F:\InetPub\www.yourdomain.tld\cgi-bin\hse HomepageSearchEngine (with or without its ".exe" extension) while on Unix, you have to type something like cd /web/www.yourdomain.tld/cgi-bin/hse ./HomepageSearchEngine.cgi If you do not have shell access, you can use the web based Console which is part of the Admin Area to execute the executable file on the shell (the executable file then behaves as the "Shell Executable"). Just point your webbrowser to http://www.yourdomain.tld/cgi-bin/hse/HomepageSearchEngine.exe?admin (or equivalent). Remember that you need to login with a username/password pair created in step 6.3 above. Executing the Shell Executable with the '-help' argument will show how it can be used: Usage: HomepageSearchEngine spider [-conf=DIR] [-cat=NR] [-lang=LANG] [-depth=LEVEL] [-max=NR_URLS] -url=URL [-debug] [-nobackup] [-batchmode] [-help] | geturls [-conf=DIR] [-cat=NR] [-lang=LANG] [-nobackup] [-batchmode] [-help] | makelist [-conf=DIR] [-cat=NR] [-nononhtml | -nohtml] [-nobackup] [-batchmode] [-help] | index [-conf=DIR] [-cat=NR] [-nononhtml | -nohtml] [-nobackup] [-nocheck] [-batchmode] [-help] | changetitles [-conf=DIR] [-cat=NR] [-nobackup] [-batchmode] [-help] | changeurls [-conf=DIR] [-cat=NR] [-nobackup] [-batchmode] [-help] | memory [-wait=SEC] Commands: spider Spiders a remote site recursively, beginning at URL down to a given LEVEL. geturls Get remote URLs and stores their content in files on your site. makelist Makes the file-list(s) required for indexing all or specific categories. index Indexes all or specific categories. changetitles Changes titles in specific index file pairs. changeurls Changes URLs in specific index file pairs. memory Shows information about the memory used by this application. Options available for the commands 'spider', 'geturls', 'makelist', 'index', 'changetitles' and 'changeurls': -conf=DIR Specifies DIR (1..9) to be used as configuration directory ./conf/DIR If not set, the main directory (the one you are currently in) will be used. -cat=NR Specifies the category number NR (1..25) to be used. With the 'spider' command, it tells which URL-list file should be created. With the 'geturls' command, it tells which URL-list file should be read. If not set, the main URL-list file (hse_urllist.csv) will be used. With the 'makelist' command, it tells which file-list file pairs should be created. With the 'index' command, it tells which index file pairs should be created. If not set, all file pairs will be created. With the 'changetitles' or 'changeurls' command, it tells which index file pair should be modified. If not set, the main index file pair (hse_index_html.txt and hse_index_nonhtml.txt) will be modified. -nobackup Does not backup a file before overwriting it. Useful when available disk space is limited to a small size. -batchmode Turns on batch mode (Does not ask any questions). -help Displays help for the given command. Without a command, displays general help (this screen). Additional Options available for the commands 'spider' and 'geturls': -lang=LANG Sends the LANG value (ISO 639 language code; eg. 'en') to the server as accepted language. Additional Options available for the command 'spider': -url=URL (Mandatory) The URL the spider should begin collecting (internal) links from. -depth=LEVEL The hierarchical LEVEL (0..10) how deep links should be followed recursively. Defaults to 3. -max=NR_URLS The maximum number of URLs NR_URLS (1..1000) to be got and checked. Defaults to 100. -debug Prints additional information (verbose mode) useful for debugging. Additional Options available for the commands 'makelist' and 'index': -nononhtml Does not create the file-list or index for Non-HTML files. -nohtml Does not create the file-list or index for HTML files. Additional Option available for the command 'index': -nocheck Does not check the index file for correct content after creating it to save system resources May be useful on systems with little available resources to avoid terminating prematurely. Options available for the command 'memory': -wait=SEC Specifies a time SEC (1..300) in seconds that the process waits before being terminated. Useful to find out memory consumption with other tools during that time. As you can see, the the Shell Executable can be called together with a command (a single word) and a number of options (always begins with the "-" character). To index a site, a file-list must first be made (using the 'makelist' command) which is then used to create the index files (using the 'index' command). Detailed information can be obtained by executing "HomepageSearchEngine index -help". The most powerful way to index your site would be if you let the index files to be created automatically every day. This could be done on Unix using the shell script "hse_cronjob.sh" or on Windows using the the batch script "hse_cronjob.bat" found in the "tools/hse" directory. Details are available in the "ReadMe.txt" file residing there. Instead of creating the index files directly on the production server, you can also create them on your local hard drive where you have mirrored the site, regardless of the platform. Just be sure to use the correct executable on your development platform. No webserver is required to be installed. Finally upload the index files via FTP onto the production server. If you cannot or don't want to create/use the index files for some reasons like limited resources, you can still improve the search speed that would result from an on-the-fly search by applying the flat search method. For this reason, only make the file-list pair by executing the 'makelist' command and skip the 'index' command. 6.11 IMPORTANT - Testing your installation: search for "list:files" ------------------------------------------------------------------- The best thing to do when installation is finished is to call the search engine in the "advanced search" form and then search for the term "list:files". You will then see which search method will be applied and which files will be searched. If the resulting list has been collected on-the-fly, you will also know how many files and directories had to be inspected, as well as the required CPU time. This may unveil unnecesarry items and the need to add some directory names to the "exclude_dirs" directive of the .ini file. REPEAT this step after all the 4 checkboxes regarding the parts of the web pages are *disabled* and the (last) checkbox "Search text of Non-HTML files" is enabled. You will then see which Non-HTML files will also be searched. You may find that unwanted files are included that have made the index file very large. Reconfigure the .ini file in this case and re-index your site again. If you have set up (more than one) categories, check all categories. Note that the "list:files" output does not work when the "debug_level" directive in your .ini file is set to a value higher than 2. So be sure that this directive keeps at its default value of 0 or is set to 1 or 2 when you want to be able to view this list. 6.12 Excluding specific sections within HTML files from being searched ---------------------------------------------------------------------- Some may have reasons to exclude certain areas within several HTML files from being searched. Put such areas between a span or div tag assigned to a "HSE-nosearch" class to force HomepageSearchEngine not to look inside these sections: This text will never be looked up by HomepageSearchEngine