What domains are applicable?
Class-Imbalance is a phenomenal problem in a lot of domains, especially for statistical relational learning problems where the number of the ground substitutions for a logical predicate is exponential in the number of the instances its argument logical variables have and among them only a few substitutions are true. Click here for details of the performance of this approach in relatonal domains.
Recall is preferred
Domains where the cost for false negative predictions is much more than that of false positive predictions. For example, in medical diagnosis, the false positive prediction may just lead to a few extra clinical tests while the false negative prediction could cost the patient’s life. Click here for details of the performance of this approach in medical domains.
Precision is preferred
Domains where the false positive prediction is more unfavorable. For example, in recommendation systems, one would rather overlook some of candidate items that could match the users (false negatives) than send out numerous spam emails to the users with inappropriate recommendations (false positives). Click here for details of the performance of this approach in recommendation systems.
How to use this package?
The whole package can be downloaded here.
The package includes the pre-processing code for standard machine learning input data, the Soft-Margin RFGB code and the code for calculating the measurements of evaluating the performance of learning algorithms for class-imbalance problems.
For standard machine learning problems, just use the python function ConvertData_standard to convert the flat table into the input files that Soft-Margin RFGB can take; for relational data sets, please refer to Mode Guide for more sophisticated designs of logic predicates.
A sample usage is as below:
$ python ConvertData_standard.py filename=PATH/TO/YOUR/DATA/DATA.csv target=TargetVariable \
> Discretize='feature1':[threshold list],'feature2':['value', Nclass],'feature3':['quantile',Nclass] \
The optional arguments are Discretize and TestRatio .
Use Discretize if one wants to discretize the continuous-valued variables. There are three options: i. assign categorical values based on the thresholds given as a list; ii. categorize into N classes based on values by specifying ['value', Nclass] ; iii. discretize into N bins based on sample quantiles by specifying ['quantile', Nclass].
Use TestRatio to specify how you want to split the data into training and test sets. If not assigned, all the samples will be written in the training data files.
Run Soft-Margin RFGB
Here is a simple example on how to use SoftBoosting.
$ java -cp SoftBoosting.jar edu.wisc.cs.Boosting.RDN.RunBoostedRDN \
> -target num \
> -l -train SampleData/OutputDataForSoft-RFGB/HD/train/ \
> -i -test SampleData/OutputDataForSoft-RFGB/HD/test/ \
> -alpha 2 \
> -beta -1 \
The parameter alpha controls the cost of false negative samples while beta controls the cost of false positive samples. When the parameter (alpha or beta) is set positive, it assigns more weights on the miss-classified positive or negative samples, whearas when it is negative, it allows the model to put more tolerance on the the miss-classified positive or negative samples. When they are both zero, it is equivalent to the standard RFGB, i.e. false positive and false negative have uniform cost.
Standard evaluation metrics for the prediction performance include the use of accuracy, Area Under ROC or PR curves (AUC-ROC or AUC-PR), F1 score, etc., which measure accuracy with balanced weight between positive and negative examples. However, in the cost-sensitive learning, the model should identify as many important cases as possible as long as the accuracy on predicting the less importance class stays within a reasonable range. To better evaluate the performance of different algorithms for learning with class-imbalanced data, we employed FΒ measure and weighted AUC-ROC.
For FΒ measure, Β controls the importance of Precision and Recall. When Β > 1, FΒ is recall dominated, while as 0< Β < 1, FΒ is precision dominated.
For weighted AUC-ROC, we refered the paper: C. G. Weng and J. Poon, “A new evaluation measure for imbalanced datasets,” in Seventh Australasian Data Mining Conference (AusDM 2008), with a valid correction to make the weighted AUC-ROC measurement theoretically sound.