Class-Imbalance is a phenomenal problem in a lot of domains, especially for statistical relational learning problems where the number of the ground substitutions for a logical predicate is exponential in the number of the instances for the logical variables and among them only a few substitutions are true.

Domains where the cost for false negative prediction is much more than that of the false positive prediction. For example, in medical diagnosis, the false positive prediction may just lead to few more clinical tests while the false negative prediction could cost the patient’s life.

Domains where the false positive prediction is more unfavorable. For example, in recommendation systems, one would rather overlook some of the candidate items that could match the users (false negatives) than send out numerous spam emails to the users with inappropriate recommendations (false positives).

The package includes the pre-processing code for standard machine learning input data, the Soft-Margin RFGB code and the code for calculating the measurements of evaluating the performance of learning algorithms for class-imbalance problems.

For standard machine learning problems, just use the python function ** ConvertData_standard** to convert the flat table into the input files that Soft-Margin RFGB can take; for relational data sets, please refer to __Mode Guide__ for more sophisticated designs of logic predicates.

$ python ConvertData_standard.py filename=PATH/TO/YOUR/DATA/DATA.csv target=TargetVariable \

> Discretize='feature1':[threshold list],'feature2':['value', Nclass],'feature3':['quantile',Nclass] \

> TestRatio=0.1

The optional arguments are ** Discretize ** and ** TestRatio **.

Use ** Discretize ** if one wants to discretize the continuous-valued variables. There are three options: **i.** assign categorical values based on the thresholds given as a list; **ii. ** categorize into N classes based on values by specifying ['value', Nclass] ; ** iii. ** discretize into N bins based on sample quantiles by specifying ['quantile', Nclass].

Use ** TestRatio ** to specify how you want to split the data into training and test sets. If not assigned, all the samples will be written in the training data files.

Here is a simple example on how to use SoftBoosting.

$ java -cp SoftBoosting.jar edu.wisc.cs.Boosting.RDN.RunBoostedRDN \

> -target num \

> -l -train SampleData/OutputDataForSoft-RFGB/HD/train/ \

> -i -test SampleData/OutputDataForSoft-RFGB/HD/test/ \

> -alpha 2 \

> -beta -1 \

The parameter ** alpha ** controls the cost of false negative samples while ** beta ** controls the cost of false positive samples. When the parameter (alpha or beta) is set positive, it assigns more weights on the miss-classified positive or negative samples, whearas when it is negative, it allows the model to put more tolerance on the the miss-classified positive or negative samples. When they are both zero, it is equivalent to the standard RFGB, i.e. false positive and false negative have uniform cost.

Standard evaluation metrics for the prediction performance include the use of accuracy, Area Under ROC or PR curves (AUC-ROC or AUC-PR), F1 score, etc., which measure accuracy with balanced weight between positive and negative examples. However, in the cost-sensitive learning, the model should identify as many important cases as possible as long as the accuracy on predicting the less importance class stays within a reasonable range. To better evaluate the performance of different algorithms for learning with class-imbalanced data, we employed F_{Β} measure and weighted AUC-ROC.

For F_{Β} measure, Β controls the importance of Precision and Recall. When Β > 1, F_{Β} is recall dominated, while as 0< Β < 1, F_{Β} is precision dominated.

For weighted AUC-ROC, we refered the paper: C. G. Weng and J. Poon, “A new evaluation measure for imbalanced datasets,” in Seventh Australasian Data Mining Conference (AusDM 2008), with a valid correction to make the weighted AUC-ROC measurement theoretically sound.