Partitioning
Description
Partitioning is used when according to the analysis task it is required to divide the source data set into the training and test samples. It is possible to configure the size of these samples, and records are selected for them using the selected sampling method. The training sample records are selected first, the remaining ones are used for the test one (this order can be changed in the handler wizard).
Ports
Input
- Input data source (data table).
Output
- Common output data set (data table). It contains all strings taken from both samples. The "test set" field is added. The "true" value in this field means that the record has been placed into the test sample, whereas the "false" value has been placed into the training one.
- Training output data set (data table).
- Test output data set (data table).
Wizard
- Login status: it enables to use the input data when the status is active. For example, the data from the input data set is required in "Biased sampling".
- Total records: the records count of the input data source table.
- Area of the row count configuration for the training and test samples.
The size of each sample is customized. The "Method" button used for calculation provides a choice whether to set the row count or to set it as a percentage of the source table size. The sum of the samples rows cannot exceed the row count in the source table. If the training and test sets do not match each other by the number of records (in total they provide the row count that exceeds the input data set rows), the first set is generated according to Test set priority checkbox, whereas the second set is generated by the residual model. - Sampling method:
- Random: the records are randomly selected from the source data set, and then they are placed into the resulting sample.
- Random uniform: all records of the source data set are divided into groups, and then they are randomly selected from each group and placed into the resulting sample. The group dimension is set in the method parameters.
- Stratified: all records of the source data set are divided into uniform groups (strata), and then they are randomly selected from each group and placed into the resulting sample. The strata defining fields are set in the method parameters by means of checkboxes.
- Sequence: the records are sequentially selected from the source data set and placed into the resulting sample. The sizes of sampling and unused sets are configured in the method parameters.
- Biased sampling: the number of records with selected unique values is decreased or increased in the source data set before processing. The increase coefficient is set in the "factor" field in the method parameters in front of each unique value of the selected column of the source table. The number of records for each unique value can be manually entered.
- Test set priority (optional checkbox).
Selection of this checkbox enables to select the records for the test sample first, whereas the remaining ones are used for the training one.
There are three selection modes defined by "Priority test set position" parameter:
- Defined by algorithm: records will be selected according to the previously selected sampling method.
- Start of set: the set start rows taken in the same order as in the source table will be used as the test sample.
- End of set: the set end rows taken in the same order as in the source table will be used as the test sample.