Back to Home Page

An Implementation of a Boolean Expression Evaluator for Web Searches

by Bob Day (bobday.nh@verizon.net)

Copyright (C) November, 2002 by Bob Day. All rights reserved.


1. The Problem

Web mining software programs often work something like this: You enter a base URL from which to start the search, and a Boolean expression, or search string, that specifies what you are interested in finding.  An example of a search string might be: "vacations and (california or florida)".  As the program scans Web pages, it adds links that occur on each page it scans to the list of links it is scheduled to scan.  When the program encounters a page for which the Boolean expression is satisfied, it saves that page, or perhaps some particular piece of information, such as an email address or phone number, on that page.

During its search of the Internet, a web mining program, rather than reading an entire Web page into memory at once, may instead perform a series of reads of successive portions of a page into a fixed size buffer.  This method conserves memory and its use of memory is predictable.  However, this method also places constraints on the function that is used to evaluate the Boolean search expression, which brings us to the heart of the subject of this article.  In order to be efficient the evaluation function must satisfy the following requirements.

1. It must be fast, since it will be called whenever we scan the data buffer for one of the variables in the search expression.

2. It must be able to detect tautologies, expressions that are always true despite the truth or falsity of their constituent variables.  Conversely, it must also be able to detect when an expression is always false (an "antitautology" if that's not already a word).  Moreover, the expression evaluator must be able to detect when an expression becomes tautological or antitautological due to the determination of the truth (occurrence) or falsity (nonoccurrence) of some of its variables.  This latter problem does not occur when the values of all the variables are available simultaneously.

The next section describes the solution I came up with to satisfy these requirements, and the final section lists the source code.


2. Solution

The first requirement in the previous section, making the Boolean expression evaluator fast, was addressed in two ways:

1. Compiling the Boolean search expression entered by the user into "Reverse Polish" notation, which consists of an array in which the variables or operands to which an operator applies precede the operator.  This compilation step is only needed once, and when represented in this notation, a Boolean expression can be evaluated very efficiently.

2. Scanning search strings for duplicate variables, and constructing, for use by the evaluator function, an array containing only the unique variables.  This avoids the possibility of having to scan the data buffer multiple times for the occurrence of same variable, which would obviously be very inefficient.

The second requirement in the previous section is where it got interesting to me.  Compiling a Boolean expression into Reverse Polish notation is kind of standard.  But how to detect when an expression becomes tautological, and code that efficiently?  That was what was new to me, and figuring it out was fun.

Here are the details:

1. The Boolean expression is stored in two arrays: RPArray contains the format of the expression and varArray contains the variables.  The RPArray contains the Reverse Polish format of the Boolean expression.  Each array location is a structure that stores the type of expression element it represents, either an operator or a variable, and then, either the operator itself or the index of the variable in varArray.  If the same variable occurs multiple times in an expression, then multiple expression elements have the same index in varArray.  This is all the special handling that the occurrence of a variable more than once in an expression requires.

varArray is also an array of structures. Each structure represents a unique variable in the Boolean expression and contains the following information:

a) pVar, a pointer to the variable string.

b) detFlag, whose value is changed from 0 to 1 when the truth or falsity of the variable has been determined.

c) val, the value of the variable when its value has been determined: 1 for true and 0 for false.

d) test, which contains the current true or false (1 or 0) value that is being tested if the value of the variable has not yet been determined.

2. The evaluation of the Boolean expression proceeds as follows: The test value of all the undetermined variables is initialized to 0.  Then we enter a loop that cycles through all the possible values of the undetermined variables.  Here are the steps in the loop:

a) For each variable that is undetermined, we copy its current test value into val.

b) Then we evaluate the expression.  The result must be true or false (1 or 0) since we have assigned values to all of the variables.  The result is "anded" with a variable named "alwaysTrue", and the negation of the result is anded with "alwaysFalse".  If, at the end of the loop, one of these variables is true, then the Boolean expression has become either tautological or antitautological, in which case we need do no further calls to the evaluation function for the Web page we are in the process of reading.

c) We then increment the test values in a way that I think is pretty elegant.  We treat the sequence of test values for the undetermined variables as an odometer.  We start with the test value for the first undetermined variable and add 1 to it, but retain only the low order bit as the new test value for that variable.  If there was a carry when we added 1, we propagate it to the next undetermined variable, adding 1 to it, and retaining only the low order bit as its new test value.  And so on, until either there is no carry or all the undetermined variables have been processed.

d) The loop cycles back to step a, above, until all possible combinations of the undetermined variables have been tried.

e) The evaluation function then returns one of: "true", "false", or "undetermined".


3. Source Code

//
// Evaluate
//
// Function to evaluate the truth or falsity of the boolean expression.
//
// Author: Bob Day (bobday.nh@verizon.net)
//
// Copyright (C) November, 2002 by Bob Day.  All Rights reserved.
//
// This program may be used and distributed, with
// acknowledgment to the author, under the terms
// of the GNU General Public License.
// long CBoolExpEval::Evaluate (long* expVal, BOOLEXP_TOKEN* RPArray, VAR_STRUCT* varArray) { long alwaysTrue; long alwaysFalse; long carry; long evalArray[(C_maxTokens + 3) & ~3]; long idx; long jdx; long kdx; long keepLoopingFlag; long mdx; long rc; rc = SUCCESS; if (m_numVariables == 0) // If the user did not enter a search { // string, *expVal = 1; // declare it to be "true", and goto exit; // go return to caller. } if ( (m_numVariables == 1) // If there's just one variable and just && // one item in the RPArray... (m_RPArrayIdx == 1) ) // (m_RPArrayIdx is the number of operands // in RPArray.) { if (varArray[0].detFlag == 1) // If the value has been determined, *expVal = varArray[0].val; // that's what we return. else // Otherwise, *expVal = -1; // return "undetermined". goto exit; // go return to caller. } *expVal = -1; // The default result is "undetermined". alwaysTrue = 1; alwaysFalse = 1; keepLoopingFlag = 1; for (idx = 0; idx < m_numVariables; ++idx) { // Initialize the test values to zero. varArray[idx].test = 0; } while (keepLoopingFlag) // Loop to try all combinations of the { // undetermined variables... for (idx = 0; idx < m_numVariables; ++idx) { if (varArray[idx].detFlag == 0) // If the value is "undetermined", { // copy the test values into the value. varArray[idx].val = varArray[idx].test; } } for (kdx = -1, jdx = 0; jdx < m_RPArrayIdx; ++jdx) { // Loop to evaluate the boolean // expression for a particular // combination of variables... if (RPArray[jdx].type == 1) // If the current symbol is a variable, { evalArray[++kdx] = varArray[RPArray[jdx].varArrayIdx].val; // put the value into the evaluation // array. } else // Otherwise, if the symbol is an { // operator... switch (RPArray[jdx].oper) { case '~': // If it's "not"... if (kdx >= 0) // If there's at least one symbol in { // the evalArray, evalArray[kdx] ^= 1; // complement the preceding value. } else { error (&rc, 'F', "EVA 300", "F - Invalid value array."); goto exit; } break; case '&': // If it's "and"... if (kdx >= 1) // If there are at lease two symbols in { // the evalArray, evalArray[kdx - 1] &= evalArray[kdx]; --kdx; // replace the the two preceding values // with the value of their "and"; } else { error (&rc, 'F', "EVA 320", "F - Invalid value array."); goto exit; } break; case '|': // If it's "or"... if (kdx >= 1) // If there are at lease two symbols in { // the avalArray, evalArray[kdx - 1] |= evalArray[kdx]; --kdx; // replace the the two preceding values // with the value of their "or"; } else { error (&rc, 'F', "EVA 340", "F - Invalid value array."); goto exit; } break; } // End of switch. } // End of else if the symbol is an // operator conditional. } // End of loop to evaluate the boolean // expression. if (kdx != 0) // If the evalArray index is not at { // zero, it's an error. error (&rc, 'F', "EVA 360", "F - Inconsistent value array index."); goto exit; } alwaysTrue &= evalArray[kdx]; alwaysFalse &= (evalArray[kdx] ^ 1); keepLoopingFlag = 0; carry = 1; // Initialize "carry" to 1 so that we // add 1 to the lowest order position // and then propagate any carries. for (mdx = 0; mdx < m_numVariables; ++mdx) { // Loop to step to the next combination // of undetermined variables. if (varArray[mdx].detFlag == 0) // If the variable is undetermined... { // Increment to the next test varArray[mdx].test += carry; // combination. carry = varArray[mdx].test >> 1; varArray[mdx].test &= 1; keepLoopingFlag |= varArray[mdx].test; // When we wrap back around to zero, // we're done. if (carry == 0) break; } } } // End of loop to try all combinations // of the undetermined variables. if (alwaysTrue == 1) // Set up the return value of the *expVal = 1; // boolean expression. else if (alwaysFalse == 1) *expVal = 0; else *expVal = -1; exit: return rc; } // End of Evaluate().

Back to Home Page