phenotypic.analysis.TukeyOutlierRemover#

class phenotypic.analysis.TukeyOutlierRemover(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]

Bases: SetAnalyzer

Analyzer for removing outliers using Tukey’s fence method.

This class removes outliers from measurement data by applying Tukey’s fence test within groups. The method calculates the interquartile range (IQR) and removes values that fall outside Q1 - k*IQR or Q3 + k*IQR, where k is a tunable multiplier (typically 1.5 for outliers or 3.0 for extreme outliers).

Parameters:

on (str) – Name of measurement column to test for outliers (e.g., ‘Shape_Area’, ‘Intensity_IntegratedIntensity’).
groupby (list[str]) – List of column names to group by (e.g., [‘StrainID’, ‘Time’]).
k (float) – IQR multiplier for fence calculation. Default is 1.5 (standard outliers). Use 3.0 for extreme outliers only.
num_workers (int) – Number of parallel workers. Default is 1.

groupby: List of column names to group by.

on: Column to test for outliers.

k: IQR multiplier used for fence calculation.

num_workers: Number of parallel workers. Default is 1.

Examples

Remove outliers and visualize results:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data with some outliers
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 50 + ['img2'] * 50,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 48),
...         [500, 550],  # outliers in img1
...         np.random.normal(180, 25, 48),
...         [50, 600]  # outliers in img2
...     ])
... })
>>> # Initialize detector
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName'],
...     k=1.5
... )
>>> # Remove outliers
>>> filtered_data = detector.analyze(data)
>>> # Check how many were removed
>>> print(f"Original: {len(data)}, Filtered: {len(filtered_data)}")  
>>> # Visualize removed outliers
>>> fig = detector.show()  

Methods

`__init__`	Initialize TukeyOutlierRemover with test parameters.
`analyze`	Remove outliers from data using Tukey's fence method.
`dash`	Interactive Plotly visualization of analysis results.
`results`	Return the filtered results (outliers removed).
`show`	Visualize outlier detection results.

__init__(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]

Initialize TukeyOutlierRemover with test parameters.

Parameters:

on (str) – Column name for grouping/aggregation operations.
groupby (list[str]) – List of column names to group by.
measurement_col – Name of measurement column to test for outliers.
k (float) – IQR multiplier for fence calculation. Default is 1.5.
agg_func – Aggregation function. Default is ‘mean’.
num_workers (int) – Number of workers. Default is 1.

Raises:

ValueError – If k is not positive.

analyze(data: pandas.DataFrame) → pandas.DataFrame[source]

Remove outliers from data using Tukey’s fence method.

This method processes the input DataFrame by grouping according to specified columns and removing outliers within each group independently. Outliers are identified using the IQR method and filtered out. The original data is stored internally for visualization purposes.

Parameters:

data (pandas.DataFrame) – DataFrame containing measurement data. Must include all columns specified in self.groupby and self.on.

Returns:

DataFrame with outliers removed. Contains only the original columns (no additional outlier flag columns).

Raises:

KeyError – If required columns are missing from input DataFrame.
ValueError – If data is empty or malformed.

Return type:

pandas.DataFrame

Examples

Analyze and filter outliers from measurement data:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 100,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 98),
...         [500, 50]  # outliers
...     ])
... })
>>> # Remove outliers
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName'],
...     k=1.5
... )
>>> filtered_data = detector.analyze(data)
>>> # Check results
>>> print(f"Original: {len(data)} rows, Filtered: {len(filtered_data)} rows")  
>>> print(f"Removed {len(data) - len(filtered_data)} outliers")  

Notes

Stores original data in self._original_data for visualization
Stores filtered results in self._latest_measurements for retrieval
Groups are processed independently with their own fences
NaN values in measurement column are preserved in output

show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs)[source]

Visualize outlier detection results.

Creates a visualization showing the distribution of values with outliers highlighted and fence boundaries displayed. Can display as individual subplots or as a collapsed stacked view with all groups in a single plot. Outlier flags are computed dynamically for visualization only.

Parameters:

figsize (tuple[int, int] | None) – Figure size as (width, height). If None, automatically determined based on number of groups and mode.
max_groups (int) – Maximum number of groups to display. If there are more groups, only the first max_groups will be shown. Default is 20.
collapsed (bool) – If True, show all groups stacked vertically in a single plot. If False, show each group in its own subplot. Default is False.
criteria (dict[str, any] | None) – Optional dictionary specifying filtering criteria for data selection. When provided, only groups matching the criteria will be displayed. Format: {‘column_name’: value} or {‘column_name’: [value1, value2]}. Default is None (show all groups).
**kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines (default 0.3) - grid_axis: Which axis to apply grid to (‘both’, ‘x’, ‘y’) - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines

Returns:

Tuple of (Figure, Axes) containing the visualization.

Raises:

ValueError – If analyze() has not been called yet (no results to display).
KeyError – If criteria references columns not present in the data.

Return type:

(plt.Figure, plt.Axes)

Examples

Visualize outlier detection with multiple grouping options:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data with multiple grouping columns
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1', 'img2'] * 50,
...     'Plate': ['P1'] * 50 + ['P2'] * 50,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 48), [500, 550],
...         np.random.normal(180, 25, 48), [50, 600]
...     ])
... })
>>> # Remove outliers and visualize all groups
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['Plate', 'ImageName'],
...     k=1.5
... )
>>> results = detector.analyze(data)  
>>> fig, axes = detector.show(figsize=(12, 5))  
>>> # Visualize only specific plate
>>> fig, axes = detector.show(criteria={'Plate': 'P1'})  
>>> # Visualize specific images across plates using collapsed view
>>> fig, ax = detector.show(criteria={'ImageName': 'img1'}, collapsed=True)  

Notes

Individual mode (collapsed=False): - Each group gets its own subplot with box plot - Outliers shown in red, normal values in blue - Horizontal lines show fence boundaries

Collapsed mode (collapsed=True): - All groups stacked vertically in single plot - Each group shown as horizontal line with median marker - Vertical bars show fence boundaries - Normal points as circles, outliers as diamonds - More compact for comparing many groups

Filtering with criteria: - Only groups matching all criteria are displayed - Useful for focusing on specific plates, conditions, or subsets - Can be combined with both individual and collapsed modes

results() → pandas.DataFrame[source]

Return the filtered results (outliers removed).

Returns the DataFrame with outliers removed from the most recent call to analyze().

Returns:: DataFrame with outliers filtered out. Contains only the original columns without additional outlier flag columns. If analyze() has not been called, returns an empty DataFrame.
Return type:: pandas.DataFrame

Examples

Retrieve filtered results after analysis:

>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName']
... )
>>> filtered_data = detector.analyze(data)  
>>> results_copy = detector.results()  # Same as filtered_data  
>>> assert results_copy.equals(filtered_data)  
>>> # Check how many rows were removed
>>> num_removed = len(data) - len(filtered_data)  
>>> print(f"Removed {num_removed} outliers")  

Notes

Returns the DataFrame stored in self._latest_measurements
Contains only inliers (outliers have been removed)
Use this method to retrieve results after calling analyze()

dash(**kwargs)

Interactive Plotly visualization of analysis results.

Subclasses may override this method to provide an interactive Plotly figure equivalent to show().

Raises:: NotImplementedError – Unless overridden by a subclass.

phenotypic.analysis.TukeyOutlierRemover#

This Page