Summary of Tom M. Mitchell's "Generalization as Search" (by Christian Shelton) This paper beings by giving examples of the wide range of abilities all grouped as learning. Mitchell then limits the scope of the paper to generalization by highlighting it as a central capability in many aspects of learning. Generalization is described here as the ability to take a set of events or descriptions and their classification (binary in this case) and predict the classification of other events or descriptions following the same set of "rules" as the first set. Definitions A set of definitions are required to understand this general conversation about generalization: instance language the language used to represent the observations to be generalized generalization language the language used to represent the learned generalization. It should be noted that a generalization language often times will not allow all possible generalizations. matching predicate a test to determine whether or not an instance is classified as positive in a given generalization language. training instance an instance from the instance language and its classification used for learning the generalization Generalization Problem "Given: 1. A language in which to describe instances 2. A language in which to describe generalizations 3. A matching predicate that matches generalizations to instances 4. A set of positive and negative training instances of a target generalization to be learned Determine: Generalizations within the provided language that are consistent with the presented training instances (i.e., plausible descriptions of the target generalization)." Ordering and Search "The above generalization problem is essentially a search problem. The generalization language corresponds to an hypothesis space (search space) of possible solutions, and the learning task is to examine this hypothesis space, subject to constraints imposed by the training instances, to determine plausible generalizations." An important feature of this hypothesis space is that it is partially ordered by the "more-specific-than" operator. "This partial ordering is important because it provides a powerful basis for organizing the search through the hypothesis space." Thus there are three major ways of searching this space: * Depth-first search: obviously this one exists. Here, one begins with a very specific definition consistent with the first observation. Then, for each observation, make the definition either more restrictive (for negative observations) or more relaxed (for positive observations) so that the current "working" generalization satisfies the current observation. In doing so, one must look back at all of the previous observations to insure that consistency is maintained. As well, there are many different ways of changing the "working" generalization. Depth-first search explores all of these combinations through traditional backtracking with all of its problems and advantages. * Breadth-first search (specific to general): Here, one constructs a set S such that for each s in S, s is consistent with the observations thus far and as specific as possible. Or S = { s | s is a generalization that is consistent with the observed instances, and there is no generalization which is both more specific than s and consistent with the observed instances}. In such a search, S is initialized based on the first instance and then for each negative instance, members of S are removed and for each positive instance, the members of S are generalized and then checked for consistency. * Version Space strategy: This is one specific to general and one general to specific breadth-first search done at the same time. One defines a set G similar to S, but such that the members are as general as possible (p. 100). Then the S set is advanced as before and the G set is advanced in the complimentary manner. Thus, the whole space of possible generalizations is "represented" by these two "bounding" sets in the sense that all possible generalizations could be created from these two sets. Specifically, a generalization is represented if it is more specific than or equal to some member of G and more general than or equal to some member of S. There are a few restraints that should be noted for this technique: + As with the breadth-first search, this only works when a the more specific than operator and more general than operator can computed by direct examination. + In addition, this technique assumes a most general and most specific generalization. This might not always be possible. Considerations when Comparing Search Methods * not efficiency * "ability to detect the point at which the target generalization is completely determined by the training instances" (when only one generalization is possible -- the data is completely learned) This is very simple with the version space search. In this case, the generalization has been completely learned when S and G are equal and contain only one generalization. * the ability to "use incompletely determined generalizations in a reasonable manner" The version space search also works well here. An incomplete generalization consists of an S and a G set which "bound" all possible generalizations. Thus, a new set can be classified as positive if it matches every element of S and negative if it doesn't match any elements of G with certainty even though the exact generalization has not (yet) been learned. This is correct under the assumption that all training sets were correct and that the generalization language will allow a complete description of the desired generalization. Note that the breadth-first strategy does all such unambiguous classifications for positive instances, but not for negative instances (as the G set is not computed). * "ability to direct the presentation of training instances to obtain informative instances" Now we consider the case where the program can control the training sets presented to it. It would be most advantageous for the program to pick a training set which would match 1/2 of the plausible generalizations. In the Version Space search, this corresponds to an instance which matches 1/2 of S and G. For the breadth-first search, this is one which matches 1/2 of S. Complexity (where p is the number of positive sets, n is the number of negative sets, s is the maximum size of the set S and g is the maximum size of the set G) Processing Time: * Depth-First: O(pn) * Breadth-First: O(spn + ssp) * Version Space: O(sg(p+n) + ssp + ggn) Storage Space: * Depth-First: O(p + n) * Breadth-First: O(s + n) * Version Space: O(s + g) Other Generalization Strategies These are non-data-driven methods (the above three were all data-driven). Generate-and-test a class of strategies that produce hypotheses independently of the data given and then test them against all of the data. Statistical Pattern Recognition a class of strategies that represent instances as points in an n-dimensional space and decision surfaces as surfaces in these spaces. These methods tend to be better than the search methods when there are errors in the training set. Further Issues The following are open problems in machine learning * The generalization language: a significant problem exists in choosing the correct generalization language so that the generalization is easy to find and possible to describe. It can also influence the resources required to search the space. If the generalization is made poorly, matching generalizations to instances can take a lot of computing time. As well, the parameters of the search space can change drastically. The choice of the generalization language can dictate whether the space will be narrow and deep or wide and shallow. * Using prior knowledge/expectations: One way is to impose restraints on the generalization language. Or one can put them into the hypothesis generator in generate-and-test cases. This can be important in limiting the search space. * Inconsistency: If "(1) the generalization language is insufficient to describe the target generalization, or (2) the training instances contain errors" then it will be impossible to find a target generalization and for the search methods given, the algorithm will not be able to tell which of these two cases caused the failure. Gracefully dealing with this kind of problem can be very important for practical applications. * Partially Learn Generalizations: It is important to be able to use partially learned generalizations as it is "unlikely in realistic applications that sufficient training data will be available to fully determine every needed generalization."