#!/usr/bin/perl =pod =head1 NAME search_engine -- Search engine for web sites =head1 AUDIENCE This search engine is intended for web masters with small static sites (up to 1000 pages) under a single directory tree. It is installed as a cgi script, so your ISP must allow you to install your own cgi scripts in order to use it. =head1 INSTALLATION Before installing the script, you should find out from your ISP what extension should be used for a Perl cgi script and what is the location of Perl on your ISP's system. This information will probably be available in the FAQ. If the top line of this file does not match the location of Perl on your ISP's system, edit it so it does. If your web pages do not end with the extension '.html', edit the value of DO_SEARCH to have the correct extension. Ftp this file into the top level html directory on your system. If your ISP does not allow you to place cgi scripts in this directory, place this script in the directory your ISP indicates and follow the instructions on setting BASE_URL and BASE_DIRECTORY in the customization section. Use the ftp client to give this file executable permission. Once all that is done, test the script by typing its url, something like http://www.yoursite.com/search_engine.cgi You should see a search form with instructions on how to use it. Enter a search term and press return. You should now have a working search engine on your site. You can simply add a link to the search engine from your web site or you can add a form which calls the search engine. Here is an example of the html code to add to your site for the form.
=head1 CUSTOMIZATION Although you should now have a functioning seach engine, you probably will want to change how it looks and acts. You can do this by editing the scripts configuration variablesand by creating html template files for the search engine. The constants DO_SEARCH and DONT_SEARCH control which files the search engine looks at. They contain filename wildcard patterns. If you want to use more than one pattern you can create an array of patterns, which is a comma separated list surrounded by square brackets. For example ['*.htm', '*.html'] By default the search engine searches the directory containing the script and all its subdirectories. If your ISP does not allow you to place scripts in the top html directory, set BASE_DIRECTORY and BASE_URL to the complete path and the url of the top html directory. The look of the search page is controlled by an html template file. By default the search engine uses the template afterthe __DATA__ statement at the end of the program. You can create your own template, put its name in the constant SEARCH_TEMPLATE and ftp it to your web site in the same directory as this script. One reason you may wish to use your own template is to give the search results the look of your web site. To do this, take a typical web page on your site, strip out the content unique to the page, and replace it with the portion of the template below between the body tags. The template can be customized further to make it look the way you wish. The template commands and variables are described in the next section. This script allows you to create a site wide template so that you can have a common look for all the web pages and cgi scripts on your site. If you wish to use this optional feature, set the value of SITE_TEMPLATE to the name of your site wide template. Blocks in the site template are wrapped in html comments that look like where name is any identifier string. Block delimeted in the same comments in the search template replace the blocks in the site template. By default the search engine checks all text between the opening and closing body tags. You can restrict the search to a region of each page by setting START_MARK and END_MARK to strings that bound the region to be searched. I suggest using html comments, but you can use any string. The search engine divides its results into separate pages with links to the next and previous pages of results, where appropriate. The number of results displayed on each page is controlled by NUMBER_RESULTS. The search results display an extract from the matched page containing the first search term so you can see the term in its context. The length of this extract is set by CONTEXT_LENGTH. The parameters QUERY_PARAM and START_PARAM set the names of the cgi parameters used by this script. You will probably not need to change them. =head1 TEMPLATE CUSTOMIZATION This script builds the web page it outputs by filling in values in a template. The default template is at the end of the file. Whenever a string like $(name) occurs in the template, it is replaced by the corresponding value generated by this script. The template also uses simple control structures: the #for and #endfor statements that loop over the results returned by the search engine and the #if, #else, and #endif statement that include code conditionally in the output, depending on the value of the variable on the #if statement. All control structures must have the # in the first column of a template line. The statement continues up to and includes the newline character at the end of the line. This script produces the results array, whose contents are looped over to display the search results. The following variables can be used in the template lines between the "#for results" and "#endfor" lines/ =over 4 =item url The url of the web page matched in the search =item title The title of the web page =item context A string displaying the search term in context =item count The number of times the search terms occured in the page =back The other variables calculated by the search script can be used outside the loop. These variables are: =over 4 =item query The search string =item query_param The name of the parameter containing the search string =item base_url The url of the topmost directory to search in =item search url The url of this script =item start The index of the first result to display, counting from one =item finsih The index of the last result to display =item total The total number of results returned =item previous_url The url of the revious page of search results. Empty string if none =item next_url The url of the next page of results. Empty string if none =back =head1 AUTHOR Bernie Simon (http://carelesshand.net) =head1 LICENSE Copyright Bernard Simon, 2005. You may use this file as you wish as long as this copyright notice is maintained. =cut use strict; use lib '.'; use CGI qw(:standard); use CGI::Carp 'fatalsToBrowser'; use Text::ParseWords; use FileHandle; use File::Find; use Cwd; #---------------------------------------------------------------------- # Configuration variables # Filename pattern of files to include in search. Can be an array use constant DO_SEARCH => '*.html'; # Filename pattern of files to exclude from search. Can be an array use constant DONT_SEARCH => ''; # Base directory of documents searched use constant BASE_DIRECTORY => ''; # Base URL of documents searched use constant BASE_URL => ''; # Template used to display results use constant SEARCH_TEMPLATE => '../lib/search_template.htm'; # Template used to give page the site's look use constant SITE_TEMPLATE => 'template.htm'; # String that marks the start of the page text use constant START_MARK => ''; # String that marks the end of the page text use constant END_MARK => ''; # Number of search results displayed on a page use constant NUMBER_RESULTS => 20; # Length of displayed context use constant CONTEXT_LENGTH => 80; # Name of parameter containing query use constant QUERY_PARAM => 'query'; # Name of parameter containing start position use constant START_PARAM => 'start'; #---------------------------------------------------------------------- # Main procedure $| = 1; my $hash = {}; # Read and untaint CGI parameters my $cgi = CGI->new (); my $query = scrub_parameter ($cgi->param (QUERY_PARAM)); my $start = scrub_parameter ($cgi->param (START_PARAM)); # Set configuration variables if left empty my $base_directory = BASE_DIRECTORY || cwd(); my $base_url = BASE_URL || $cgi->url(); $base_url =~ s!/[^/]*$!!; # Perform the search and put results into an array my @term = map ('\b'.quotemeta($_).'\b', shellwords ($query)); my $results = do_search ($base_url, $base_directory, @term); # Add extra info used in the output page $hash->{query} = $query; $hash->{base_url} = "$base_url/"; $hash->{script_url} = $cgi->url(); $hash->{query_param} = QUERY_PARAM; # Build navigation links $hash->{results} = restrict_page ($results, $start); navlinks ($hash); # Generate output page and print it my $template = slurp (SEARCH_TEMPLATE) || join ('', ); my $output = yeti ($template, $hash); my $site_template = slurp (SITE_TEMPLATE); if ($site_template) { $output = substitute_blocks ($site_template, $output); } print "Content-type: text/html\n\n"; print $output; #---------------------------------------------------------------------- # Build the page url from the base url and filname sub build_url { my ($base_url, $base_directory, $filename) = @_; $base_directory =~ s!([^/])$!$1/!; $filename = substr ($filename, length ($base_directory)); return "$base_url/$filename"; } #---------------------------------------------------------------------- # Do the search and build the output array sub do_search { my ($base_url, $base_directory, @term) = @_; # Create the closure used to search the files my $results = []; my $do_pattern = globbify (DO_SEARCH); my $dont_pattern = globbify (DONT_SEARCH); my $searcher = sub { return if $do_pattern && ! /$do_pattern/o; return if $dont_pattern && /$dont_pattern/o; my ($title, $text) = parse_htmldoc ($_); return unless length ($text); my ($count, @pos); foreach my $term (@term) { my $pos = 0; while ($text =~ /$term/gi) { $pos ||= pos ($text); $count ++; } if ($pos) { push (@pos, $pos); } else { return; } } my $modtime = (stat $_)[9]; my $result = {title => $title, count => $count, modtime => $modtime}; $result->{url} = build_url ($base_url, $base_directory, $File::Find::name); $result->{context} = get_context ($text, $term[0], $pos[0]),; push (@$results, $result); }; # Search the directory tree find ($searcher, $base_directory) if @term; return $results; } #---------------------------------------------------------------------- # Add parameters to a url sub encode_url { my $url = shift (@_); my (%param) = @_; my ($key, $value, $arglist); while (($key, $value) = each (%param)) { $arglist .= '&' if $arglist; $value =~ s/([&\+\"\'])/sprintf ('%%%02x', ord($1))/ge; $value =~ tr/ /+/; $arglist .= "$key=$value"; } return "$url?$arglist"; } #---------------------------------------------------------------------- # Get the context of a search term match sub get_context { my ($text, $term, $pos) = @_; my $start = $pos - CONTEXT_LENGTH / 2; $start = 0 if $start < 0; my $end = $pos + CONTEXT_LENGTH / 2; my $len = ($end - $start) + 1; $len = length ($text) - $start if $len > length ($text) - $start; my $context = substr ($text, $start, $len); $context =~ s/^\S*\s+//g; $context =~ s/\s+\S*$//g; $context =~ s!($term)!$1!gi; return $context; } #---------------------------------------------------------------------- # Convert filename wildcards into regexp wildcards sub globbify { my ($pattern) = @_; my @pattern; if (ref $pattern) { @pattern = @$pattern; } else { push (@pattern, $pattern); } my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); my @regexp; foreach my $pattern (@pattern) { next unless length ($pattern); $pattern =~ s/(.)/$patmap{$1} || "\Q$1"/ge; $pattern = '(^' . $pattern . '$)'; push (@regexp, $pattern); } return join ('|', @regexp); } #---------------------------------------------------------------------- # Create urls for previous and next queries sub navlinks { my ($hash) = @_; if ($hash->{start} > 1) { my $first = $hash->{start} - NUMBER_RESULTS; $first = 1 if $first < 1; $hash->{previous_url} = encode_url ($hash->{script_url}, START_PARAM, $first, QUERY_PARAM, $hash->{query}); } if ($hash->{finish} < $hash->{total}) { $hash->{next_url} = encode_url ($hash->{script_url}, START_PARAM, $hash->{finish} + 1, QUERY_PARAM, $hash->{query}); } } #---------------------------------------------------------------------- # Get title and remove html from document sub parse_htmldoc { my ($filename) = @_; my $text = slurp ($filename); return $text unless length ($text); my ($title) = $text =~ m!(.*)!i; $title =~ tr/\t\r\n / /s; $title = '(No Title)' unless length ($title); my $start_mark = quotemeta (START_MARK || ''); my $end_mark = quotemeta (END_MARK || ''); my ($top, $rest) = split (/$start_mark/i, $text, 2); $rest ||= $top; my ($mid, $bottom) = split (/$end_mark/i, $rest, 2); $mid =~ s// /gs; $mid =~ s/<[^>]*>/ /gs; $mid =~ s/ / /g; $mid =~ tr/\t\r\n / /s; return ($title, $mid); } #---------------------------------------------------------------------- # Sort and restrict the set of results sub restrict_page { my ($results, $start) = @_; $hash->{total} = @$results; $hash->{start} = $start || 1; $hash->{finish} = NUMBER_RESULTS + $hash->{start} - 1; $hash->{finish} = $hash->{total} if $hash->{finish} > $hash->{total}; my $sorter = sub { $b->{count} <=> $a->{count} || $b->{modtime} <=> $a->{modtime} }; @$results = sort $sorter @$results; my @restricted = @$results[$hash->{start}-1 .. $hash->{finish}-1]; return \@restricted; } #---------------------------------------------------------------------- # Make sure there are no nasty characters in the query sub scrub_parameter { my ($oldvalue) = @_; $oldvalue =~ /^([^\&\<\>]*)$/; my $newvalue = $1 || ''; return $newvalue; } #---------------------------------------------------------------------- # Read a file into a string sub slurp { my ($input) = @_; local $/; my $in = FileHandle->new ($input); return '' unless defined $in; my $text = <$in>; $in->close; return $text; } #---------------------------------------------------------------------- # Substitue comment delimeted blocks for same blacks in template sub substitute_blocks { my ($template, $input) = @_; my $name; my %block; # Extract blocks from input my @tokens = split (/()/, $input); foreach my $token (@tokens) { if ($token =~ //) { if (defined $name) { die "Nested blocks in input: $token\n"; } $name = $1; } elsif ($token =~ //) { if ($name ne $1) { die "Nested blocks in input: $token\n"; } undef $name; } elsif (defined $name) { $block{$name} = $token; } } # Substitute blocks into output my $output; @tokens = split (/()/, $template); foreach my $token (@tokens) { if ($token =~ //) { if (defined $name) { die "Nested blocks in template: $token\n"; } $name = $1; $output .= $token; } elsif ($token =~ //) { if ($name ne $1) { die "Nested blocks in template: $token\n"; } undef $name; $output .= $token; } elsif (defined $name) { $output .= $block{$name}; delete $block{$name}; } else { $output .= $token; } } # Add leftover blocks so we don't lose them for $name (sort keys %block) { $output .= "\n$block{$name}\n"; } return $output; } #---------------------------------------------------------------------- # (Yet) Another html interpolator sub yeti { my ($template, $hash) = @_; my @tokens = split (/^(\#[^\n]*)\n/m, $template); my $yeti = {index => 0, tokens => \@tokens, output => ''}; yeti_backend ($yeti, $hash, 1); return $yeti->{output}; } #---------------------------------------------------------------------- # Backend to interpolation procedure sub yeti_backend { my ($yeti, $hash, $visible) = @_; my $saved_test; while ($yeti->{index} < @{$yeti->{tokens}}) { my $token = $yeti->{tokens}->[$yeti->{index}++]; if ($token =~ /^\#(\w+)\s*(\w*)/) { my $cmd = $1; my $name = $2; if ($cmd eq 'for') { my $subarray = $hash->{$name}; my $saved_index = $yeti->{index}; if ($subarray && @$subarray) { foreach my $subhash (@$subarray) { $yeti->{index} = $saved_index; my %hash = (%$hash, %$subhash); yeti_backend ($yeti, \%hash, $visible); } } else { yeti_backend ($yeti, {}, 0); } } elsif ($cmd eq 'endfor') { return; } elsif ($cmd eq 'if') { my $value = $hash->{$name}; if (ref $value eq 'ARRAY') { $saved_test = @$value ? 1 : 0; } elsif (ref $value eq 'HASH') { $saved_test = %$value ? 1: 0; } else { $saved_test = $value ? 1 : 0; } yeti_backend ($yeti, $hash, $visible && $saved_test); } elsif ($cmd eq 'endif') { if (defined $saved_test) { undef $saved_test; } else { $yeti->{index} --; return; } } elsif ($cmd eq 'else') { if (defined $saved_test) { yeti_backend ($yeti, $hash, $visible && ! $saved_test); } else { $yeti->{index} --; return; } } } elsif ($visible) { $token =~ s/\$\(([^\)]*)\)/$hash->{$1}/ge; $yeti->{output} .= $token; } } return; } __DATA__ Site Search

Site Search

#if query #if total

Documents $(start) to $(finish) of $(total)

#else

No documents matched

#endif #else

Enter one or more words to search for. The results will list pages containing all the search terms. The match is case insensitive and only matches entire words. To search for a phrase, enclose it "in quotes".

#endif #for results

$(title)
$(context)

#endfor

#if previous_url Previous #endif #if next_url Next #endif