perl-Algorithm-DecisionTree

Perl module for decision-tree based classification of

*Algorithm::DecisionTree* is a _perl5_ module for constructing a decision tree from a training datafile containing multidimensional data. In one form or another, decision trees have been around for about fifty years. From a statistical perspective, they are closely related to classification and regression by recursive partitioning of multidimensional data. Early work that demonstrated the usefulness of such partitioning of data for classification and regression can be traced to the work of Terry Therneau in the early 1980's in the statistics community, and to the work of Ross Quinlan in the mid 1990's in the machine learning community. For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space corresponds to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you try to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the new data point the same class label as that of the nearest training sample. This is referred to as the nearest neighbor classification. There exist hundreds of variations of varying power on these two basic approaches to the classification of multidimensional data. A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that partitions the training data in a way that causes maximal disambiguation of the class labels associated with the data. In terms of information content as measured by entropy, such a feature test would cause maximum reduction in class entropy in going from all of the training data taken together to the data as partitioned by the feature test. You then drop from the root node a set of child nodes, one for each partition of the training data created by the feature test at the root node. When your features are purely symbolic, you'll have one child node for each value of the feature chosen for the feature test at the root. When the test at the root involves a numeric feature, you find the decision threshold for the feature that best bipartitions the data and you drop from the root node two child nodes, one for each partition. Now at each child node you pose the same question that you posed when you found the best feature to use at the root: Which feature at the child node in question would maximally disambiguate the class labels associated with the training data corresponding to that child node? As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying new data. What is cool about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier. For a tutorial introduction to how a decision tree is constructed and used, visit <a href="https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf">https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf</a> This module also allows you to generate your own synthetic training and test data. Generating your own training data, using it for constructing a decision-tree classifier, and subsequently testing the classifier on a synthetically generated test set of data is a good way to develop greater proficiency with decision trees.

There is no official package available for openSUSE Leap 15.3

Distributions

openSUSE Tumbleweed