subgroups.algorithms.subgroup_sets package

Submodules

subgroups.algorithms.subgroup_sets.bsd module

This file contains the implementation of the BSD algorithm.

class subgroups.algorithms.subgroup_sets.bsd.BSD(min_support, quality_measure, optimistic_estimate, num_subgroups, max_depth, additional_parameters_for_the_quality_measure={}, additional_parameters_for_the_optimistic_estimate={}, write_results_in_file=False, file_path=None)[source]

Bases: Algorithm

This class represents the BSD algorithm.

Parameters:
  • min_support (typing.Union[int, float]) – Minimum support threshold (NUMBER OF TIMES, NOT A PROPORTION).

  • quality_measure (subgroups.quality_measures.quality_measure.QualityMeasure) – Specific quality measure to use for the final subgroups.

  • optimistic_estimate (subgroups.quality_measures.quality_measure.QualityMeasure) – Optimistic estimate of the quality measure.

  • num_subgroups (int) – max of subgroups to calculate the prune threshold

  • max_depth (int) – max depth of search

  • additional_parameters_for_the_quality_measure (dict[str, typing.Union[int, float]]) – if the quality measure passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • additional_parameters_for_the_optimistic_estimate (dict[str, typing.Union[int, float]]) – if the optimistic estimate passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • write_results_in_file (bool) – whether the results obtained will be written in a file. By default, False.

  • file_path (typing.Optional[str]) – if ‘write_results_in_file’ is True, path of the file in which the results will be written.

fit(pandas_dataframe, tuple_target_attribute_value)[source]

Method to run the BSD algorithm and generate subgroups.

Parameters:
  • pandas_dataframe (pandas.DataFrame) – Input dataset. It is VERY IMPORTANT to respect the following conditions: (1) the dataset must be a pandas dataframe, (2) the dataset must not contain missing values, (3) for each attribute, all its values must be of the same type.

  • tuple_target_attribute_value (tuple) – Tuple with the name of the target attribute (first element) and with the value of this attribute (second element). EXAMPLE1: (“age”, 25). EXAMPLE2: (“class”, “Setosa”). It is VERY IMPORTANT to respect the following conditions: (1) the name of the target attribute MUST be a string, (2) the name of the target attribute MUST exist in the dataset, (3) it is VERY IMPORTANT to respect the types of the attributes: the value in the tuple (second element) MUST BE comparable with the values of the corresponding attribute in the dataset, (4) the value of the target attribute MUST exist in the dataset.

Return type:

list

Returns:

a list of tuples with the best subgroups and its quality measures.

property max_depth: int

The maximum depth of the search.

property minimum_support: int | float

The minimum support threshold.

property num_subgroups: int

The maximum number of subgroups to calculate the prune threshold.

property quality_measure: QualityMeasure

The quality measure used to evaluate the subgroups.

property selected_subgroups: int

The number of selected subgroups.

property unselected_subgroups: int

The number of pruned subgroups.

property visited_subgroups: int

The number of visited subgroups.

subgroups.algorithms.subgroup_sets.cbsd module

This file contains the implementation of the CBSD algorithm.

class subgroups.algorithms.subgroup_sets.cbsd.CBSD(min_support, quality_measure, optimistic_estimate, num_subgroups, max_depth, additional_parameters_for_the_quality_measure={}, additional_parameters_for_the_optimistic_estimate={}, write_results_in_file=False, file_path=None)[source]

Bases: BSD

subgroups.algorithms.subgroup_sets.cpbsd module

This file contains the implementation of the CBSD algorithm.

class subgroups.algorithms.subgroup_sets.cpbsd.CPBSD(min_support, quality_measure, optimistic_estimate, num_subgroups, max_depth, additional_parameters_for_the_quality_measure={}, additional_parameters_for_the_optimistic_estimate={}, write_results_in_file=False, file_path=None)[source]

Bases: BSD

subgroups.algorithms.subgroup_sets.qfinder module

This file contains the implementation of the QFinder algorithm.

class subgroups.algorithms.subgroup_sets.qfinder.QFinder(num_subgroups, cats=-1, max_complexity=-1, coverage_thld=0.1, or_thld=1.2, p_val_thld=0.05, abs_contribution_thld=0.2, contribution_thld=5, delta=0.2, write_results_in_file=False, file_path=None)[source]

Bases: Algorithm

This class represents the QFinder algorithm.

Parameters:
  • cats (int) – the number of maximum values for each column. If there is more values, we take the most frequent ones. If this value is -1, we take all the values.

  • max_complexity (int) – the maximum complexity (length) of the patterns.

  • coverage_thld (float) – the minimum coverage threshold.

  • or_thld (float) – the minimum odds ratio and adjusted odds ratio threshold.

  • p_val_thld (float) – the maximum p-value threshold. This threshold is used for p-values corrected for confounders and adjusted p-values.

  • abs_contribution_thld (float) – the minimum absolute contribution threshold.

  • contribution_thld (float) – the minimum contribution ratio threshold.

  • write_results_in_file (bool) – if True, the results will be written in a file.

  • file_path (typing.Optional[str]) – the path of the file where the results will be written.

  • delta (float) – minimum delta to consider that a subgroup has a higher effect size.

  • num_subgroups (int) – the number of top subgroups to return.

fit(pandas_dataframe, tuple_target_attribute_value)[source]

Main method to run the QFinder algorithm. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

Parameters:
  • data – the DataFrame which is scanned. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

  • target – a tuple with 2 elements: the target attribute name and the target value.

Return type:

None

property selected_subgroups: int

The number of selected subgroups.

test_subgroups(test_dataframe, tuple_target_attribute_value, write_to_file=False, file_path=None)[source]

Method to test the best subgroups on a different dataset. This method can only be called after the fit method.

Parameters:
  • test_dataframe (pandas.core.frame.DataFrame) – the DataFrame which is scanned. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

  • target – a tuple with 2 elements: the target attribute name and the target value.

Returns:

a dictionary with the credibility measures for each subgroup.

property top_patterns: list[Pattern]

The list of the selected patterns.

property unselected_subgroups: int

The number of unselected subgroups.

property visited_subgroups: int

The number of visited subgroups.

subgroups.algorithms.subgroup_sets.sdmap module

This file contains the implementation of the SDMap algorithm.

class subgroups.algorithms.subgroup_sets.sdmap.SDMap(quality_measure, minimum_quality_measure_value, minimum_tp=None, minimum_fp=None, minimum_n=None, additional_parameters_for_the_quality_measure={}, write_results_in_file=False, file_path=None)[source]

Bases: Algorithm

This class represents the SDMap algorithm. Two threshold types could be used: (1) the true positives tp and the false positives fp separately or (2) the subgroup description size n (n = tp + fp). This means that: (1) if ‘minimum_tp’ and ‘minimum_fp’ have a value of type ‘int’, ‘minimum_n’ must be None; and (2) if ‘minimum_n’ has a value of type ‘int’, ‘minimum_tp’ and ‘minimum_fp’ must be None.

Parameters:
  • quality_measure (subgroups.quality_measures.quality_measure.QualityMeasure) – the quality measure which is used.

  • minimum_quality_measure_value (typing.Union[int, float]) – the minimum quality measure value threshold.

  • minimum_tp (typing.Optional[int]) – the minimum true positives (tp) threshold.

  • minimum_fp (typing.Optional[int]) – the minimum false positives (fp) threshold.

  • minimum_n (typing.Optional[int]) – the minimum subgroup description size (n) threshold.

  • additional_parameters_for_the_quality_measure (dict[str, typing.Union[int, float]]) – if the quality measure passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • write_results_in_file (bool) – whether the results obtained will be written in a file. By default, False.

  • file_path (typing.Optional[str]) – if ‘write_results_in_file’ is True, path of the file in which the results will be written.

property additional_parameters_for_the_quality_measure: dict[str, int | float]

The additional needed parameters with which to compute the quality measure.

fit(pandas_dataframe, target)[source]

Main method to run the SDMap algorithm. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

Parameters:
  • pandas_dataframe (pandas.core.frame.DataFrame) – the DataFrame which is scanned. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

  • target (tuple[str, str]) – a tuple with 2 elements: the target attribute name and the target value.

Return type:

None

property minimum_fp: int | None

The minimum false positives (fp) threshold.

property minimum_n: int | None

The minimum subgroup description size (n) threshold.

property minimum_quality_measure_value: int | float

The minimum quality measure value threshold.

property minimum_tp: int | None

The minimum true positives (tp) threshold.

property quality_measure: QualityMeasure

The quality measure which is used.

property selected_subgroups: int

Number of selected subgroups after executing the SDMap algorithm (before executing the ‘fit’ method, this attribute is 0).

property unselected_subgroups: int

Number of unselected subgroups after executing the SDMap algorithm (before executing the ‘fit’ method, this attribute is 0).

property visited_nodes: int

Number of visited nodes after executing the SDMap algorithm (before executing the ‘fit’ method, this attribute is 0).

subgroups.algorithms.subgroup_sets.sdmapstar module

This file contains the implementation of the SDMapStar algorithm.

class subgroups.algorithms.subgroup_sets.sdmapstar.SDMapStar(quality_measure, optimistic_estimate, minimum_quality_measure_value, minimum_tp=None, minimum_fp=None, minimum_n=None, additional_parameters_for_the_quality_measure={}, additional_parameters_for_the_optimistic_estimate={}, write_results_in_file=False, file_path=None, num_subgroups=0)[source]

Bases: Algorithm

This class represents the SDMapStar algorithm.

Parameters:
  • quality_measure (subgroups.quality_measures.quality_measure.QualityMeasure) – the quality measure which is used.

  • optimistic_estimate (subgroups.quality_measures.quality_measure.QualityMeasure) – the optimistic estimate of the quality measure which is used.

  • minimum_quality_measure_value (typing.Union[int, float]) – the minimum quality measure value threshold.

  • minimum_tp (typing.Optional[int]) – the minimum true positives (tp) threshold.

  • minimum_fp (typing.Optional[int]) – the minimum false positives (fp) threshold.

  • minimum_n (typing.Optional[int]) – the minimum subgroup description size (n) threshold.

  • additional_parameters_for_the_quality_measure (dict[str, typing.Union[int, float]]) – if the quality measure passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • write_results_in_file (bool) – whether the results obtained will be written in a file. By default, False.

  • file_path (typing.Optional[str]) – if ‘write_results_in_file’ is True, path of the file in which the results will be written.

  • num_subgroups (int) – the number of subgroups used to prune the search space. By default, 0. This value is equivalent to using the SDMap algorithm.

property additional_parameters_for_the_quality_measure: dict[str, int | float]

The additional needed parameters with which to compute the quality measure.

property conditional_pruned_branches: int

The number of conditional pruned branches.

fit(pandas_dataframe, target)[source]

Main method to run the SDMapStar algorithm. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

Parameters:
  • pandas_dataframe (pandas.core.frame.DataFrame) – the DataFrame which is scanned. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported yet.

  • target (tuple[str, str]) – a tuple with 2 elements: the target attribute name and the target value.

Return type:

None

property k_subgroups: list

The list of the k subgroups used to prune.

property minimum_fp: int | None

The minimum false positives (fp) threshold.

property minimum_n: int | None

The minimum subgroup description size (n) threshold.

property minimum_quality_measure_value: int | float

The minimum quality measure value threshold.

property minimum_tp: int | None

The minimum true positives (tp) threshold.

property num_subgroups: int

The maximum number of subgroups in ‘k_subgroups’.

property optimistic_estimate: QualityMeasure

The optimistic estimate of the quality measure which is used.

property pruned_subgroups: int

The number of pruned subgroups because of the top k threshold.

property quality_measure: QualityMeasure

The quality measure which is used.

property selected_subgroups: int

Number of selected subgroups after executing the SDMapStar algorithm (before executing the ‘fit’ method, this attribute is 0).

property unselected_subgroups: int

Number of unselected subgroups after executing the SDMapStar algorithm (before executing the ‘fit’ method, this attribute is 0).

property visited_nodes: int

Number of visited nodes after executing the SDMapStar algorithm (before executing the ‘fit’ method, this attribute is 0).

subgroups.algorithms.subgroup_sets.vlsd module

This file contains the implementation of the VLSD algorithm.

class subgroups.algorithms.subgroup_sets.vlsd.VLSD(quality_measure, q_minimum_threshold, optimistic_estimate, oe_minimum_threshold, additional_parameters_for_the_quality_measure={}, additional_parameters_for_the_optimistic_estimate={}, sort_criterion_in_s1='no-order', sort_criterion_in_other_sizes='no-order', vertical_lists_implementation='bitsets', write_results_in_file=False, file_path=None)[source]

Bases: Algorithm

This class represents the VLSD algorithm.

Parameters:
  • quality_measure (subgroups.quality_measures.quality_measure.QualityMeasure) – the quality measure which is used.

  • q_minimum_threshold (typing.Union[int, float]) – the minimum quality threshold for the quality measure.

  • optimistic_estimate (subgroups.quality_measures.quality_measure.QualityMeasure) – the optimistic estimate of the quality measure which is used.

  • oe_minimum_threshold (typing.Union[int, float]) – the minimum quality threshold for the optimistic estimate.

  • additional_parameters_for_the_quality_measure (dict[str, typing.Union[int, float]]) – if the quality measure passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • additional_parameters_for_the_optimistic_estimate (dict[str, typing.Union[int, float]]) – if the optimistic estimate passed by parameter needs more parameters apart from tp, fp, TP and FP to be computed, they need to be specified here.

  • sort_criterion_in_s1 (str) – the criterion to use in order to sort the Vertical Lists with only one selector. Three values are possible: “quality-ascending” (sort ascending by quality value), “quality-descending” (sort descending by quality value), and “no-order” (do not sort and maintain the generation order). By default, “no-order”.

  • sort_criterion_in_other_sizes (str) – the criterion to use in order to sort the Vertical Lists with more than one selector. Three values are possible: “quality-ascending” (sort ascending by quality value), “quality-descending” (sort descending by quality value), and “no-order” (do not sort and maintain the generation order). By default, “no-order”.

  • write_results_in_file (bool) – whether the results obtained will be written in a file. By default, False.

  • file_path (typing.Optional[str]) – if ‘write_results_in_file’ is True, path of the file in which the results will be written.

SORT_CRITERION: typing.ClassVar[list[str]] = ['quality-ascending', 'quality-descending', 'no-order']
SORT_CRITERION_NO_ORDER: typing.ClassVar[str] = 'no-order'
SORT_CRITERION_QUALITY_ASCENDING: typing.ClassVar[str] = 'quality-ascending'
SORT_CRITERION_QUALITY_DESCENDING: typing.ClassVar[str] = 'quality-descending'
VERTICAL_LISTS_IMPLEMENTATION: typing.ClassVar[list[str]] = ['bitsets', 'sets']
VERTICAL_LISTS_WITH_BITSETS: typing.ClassVar[str] = 'bitsets'
VERTICAL_LISTS_WITH_SETS: typing.ClassVar[str] = 'sets'
property additional_parameters_for_the_optimistic_estimate: dict[str, int | float]

The additional needed parameters with which to compute the optimistic estimate.

property additional_parameters_for_the_quality_measure: dict[str, int | float]

The additional needed parameters with which to compute the quality measure.

fit(pandas_dataframe, target)[source]

Main method to run the VLSD algorithm. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported.

Parameters:
  • pandas_dataframe (pandas.core.frame.DataFrame) – the DataFrame which is scanned. This algorithm only supports nominal attributes (i.e., type ‘str’). IMPORTANT: missing values are not supported.

  • target (tuple[str, str]) – a tuple with 2 elements: the target attribute name and the target value.

Return type:

None

property oe_minimum_threshold: int | float

The minimum quality threshold for the optimistic estimate.

property optimistic_estimate: QualityMeasure

The optimistic estimate of the quality measure which is used.

property q_minimum_threshold: int | float

The minimum quality threshold for the quality measure.

property quality_measure: QualityMeasure

The quality measure which is used.

property selected_subgroups: int

Number of selected subgroups after executing the VLSD algorithm (before executing the ‘fit’ method, this attribute is 0).

property sort_criterion_in_other_sizes: str

The criterion to use in order to sort the Vertical Lists with more than one selector.

property sort_criterion_in_s1: str

The criterion to use in order to sort the Vertical Lists with only one selector.

property unselected_subgroups: int

Number of unselected subgroups after executing the VLSD algorithm (before executing the ‘fit’ method, this attribute is 0).

property visited_nodes: int

Number of visited nodes after executing the VLSD algorithm (before executing the ‘fit’ method, this attribute is 0).