TBGsearch 1.00 coded by Apachez (c) apachez@home.se TBGsearch is a vectorbased searchengine which is written in Perl and using MySQL as storage. The bitvectors are being stored as LZF compressed chunked signaturefiles currently in chunks of 10kbit (1250 bytes) which in average takes approx 33 bytes per chunk (instead of 1250 as when they are uncompressed). An index of TBGsearch is constructed by two tables, Search_INDEXNAME_Word and Tbl_Search_INDEXNAME_Vector. Search_INDEXNAME_Word contains the unique searchwords which has been identified by the indexer along with a wordid and a chunkvector. The chunkvector will tell TBGsearch in which chunks the specific searchword exists in. Tbl_Search_INDEXNAME_Vector contains the actual searchvectors along with wordid and chunkid which belongs to each searchvector. CREATE TABLE Tbl_Search_INDEXNAME_Word ( `SearchWord` varchar(32) NOT NULL DEFAULT '', `SearchWordId` mediumint UNSIGNED NOT NULL DEFAULT '0', `SearchFrequency` int UNSIGNED NOT NULL DEFAULT '0', `SearchChunkVector` blob NOT NULL, PRIMARY KEY (`SearchWord`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1; CREATE TABLE Tbl_Search_INDEXNAME_Vector ( `SearchWordId` mediumint UNSIGNED NOT NULL DEFAULT '0', `SearchChunkId` smallint UNSIGNED NOT NULL DEFAULT '0', `SearchFrequency` smallint UNSIGNED NOT NULL DEFAULT '0', `SearchVector` blob NOT NULL, PRIMARY KEY (`SearchWordId`, `SearchChunkId`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1; Currently following external packages are being used by the scripts: strict warnings DBI DBD::mysql Compress::LZF Time::HiRes The packages strict and warnings are usually installed by default when you have Perl in your system. In order to install a missing package such as the LZF package in a *nix environment the recommended way is to use MCPAN package such as: perl -MCPAN -e 'install Compress::LZF' Description of each script and what might need to be changed for your needs: * tbgsearch_frequency.pl This will aggregate the sum of word frequency for each searchword from vector-table into the word-table. No argument is needed. Things that might need to be changed: - my %sqlconfig contains login information for the destination (the search database where TBGsearch will store its vectors in). - my @table = qw/Message Subject User/; will need to be changed into the index names you wish to use (for more information read about tbgsearch_index.pl regarding the three (3) indexes). * tbgsearch_index.pl This is the TBGsearch indexer. It will take a numeric value as an argument which is equal to which chunk you wish to index. In order to index 1 million posts from the forum you can create a bat file with following contents: FOR /L %%A IN (0,1,99) DO perl tbgsearch_index.pl %%A which will index chunks 0 to 99. Things that might need to be changed: - my %sqlconfig contains login information for the source (for example the forum you wish to index). - my %sqlconfig2 contains login information for the destination (the search database where TBGsearch will store its vectors in). - At row 141 there is a prepared statement in order to read the data from the source you wish to index. This will most likely need to be changed in your case. - Currently tbgsearch_index.pl will create three (3) indexes, one for messages, one for subject and one for username. If you only need one index then you should remove the code for the other two indexes which might not be needed in your case in order to gain performance during the indexing phase. - For each processed post there is a filter regarding allowed text. Specially: $sql_NewsMessage =~ s/^>.*\n?//mg; $sql_NewsMessage =~ s/^.*\s+wrote\s+.*\s+at\s+\d+-\d+-\d+\s+\d+:\d+\n?//mg; might need to be removed if you wish to index the whole post (TBGsearch is being used at http://www.tbg.nu where it will exclude quoted text from being indexed). * tbgsearch_install.pl This will create the tables in MySQL needed for TBGsearch to function properly. No argument is needed. Things that might need to be changed: - my %sqlconfig contains login information for the destination (the search database where TBGsearch will store its vectors in). - my @table = qw/Message Subject User/; will need to be changed into the index names you wish to use (for more information read about tbgsearch_index.pl regarding the three (3) indexes). * tbgsearch_optimize.pl This will run the MySQL command OPTIMIZE TABLE on each table in the TBGsearch database. No argument is needed. Things that might need to be changed: - my %sqlconfig contains login information for the destination (the search database where TBGsearch will store its vectors in). - my @table = qw/Message Subject User/; will need to be changed into the index names you wish to use (for more information read about tbgsearch_index.pl regarding the three (3) indexes). * tbgsearch_optimize_reorder.pl This is same as tbgsearch_optimize.pl with the addition that it will also reorder the tables before it will run the MySQL command OPTIMIZE TABLE in order to further optimize each table. No argument is needed. Things that might need to be changed: - my %sqlconfig contains login information for the destination (the search database where TBGsearch will store its vectors in). - my @table = qw/Message Subject User/; will need to be changed into the index names you wish to use (for more information read about tbgsearch_index.pl regarding the three (3) indexes). * tbgsearch_search.pl This is the actual searchscript made for console testing. Any arguments will be considered to be searchwords which the client asks for. Example of usage: Search for abc AND def: perl tbgsearch_search.pl abc def Search for words starting with abc: perl tbgsearch_search.pl abc* Search for words ending with def: perl tbgsearch_search.pl *def Search for abc AND words which contains def: perl tbgsearch_search.pl abc *def* Note: Wildcard searches will for obvious reasons be more timeconsuming than regular searches. Things that might need to be changed: - my %sqlconfig contains login information for the destination (the search database where TBGsearch will store its vectors in). - my $searchtable = 'Message'; defines which index out of Message or Subject that the search will operate on. The User index is included by default - my $sortorder = 1; defines search order, 1 means descending while 0 means ascending. - At row 139 there is a "# FIXME" note. That is because currently there is no collection table available which tells TBGsearch which post was last indexed, when the index was last runned and how many chunks have been processed. In your case you need to alter "for my $i(0..106)" into the number of chunks which your vector table contains. - At row 213 there is a commented print command which can be used for debug resons to see which chunks the searchengine is processing. During a benchmark this should be commented or removed from the script in order to gain some performance. - In the end of the script there is an commented code segment with an example of how this script can be used in a cgi environment where you will also get the actual hits (not just the raw docid's). In case you find this useful for production use or have some comments, ideas or improvements then dont hesitate to send me an email at apachez@home.se /Apachez