phenotypic.analysis#

Analytics for quantified fungal colony plates.

Provides post-measurement tools that adjust colony statistics for plate layout artifacts, fit growth curves, and prune outliers so downstream comparisons reflect biology rather than imaging geometry. Includes edge correction for grid layouts, log-phase growth modeling across time courses, and Tukey-style outlier removal for colony metrics.

Classes

`EdgeCorrector`	Analyzer for detecting and correcting edge effects in arrayed colony growth.
`LinearSoftplusModel`	Linear-with-softplus lag-phase growth fitter.
`LogGrowthModel`	Logistic-growth model fitter with regularized least-squares objective.
`TukeyOutlierRemover`	Analyzer for removing outliers using Tukey's fence method.

class phenotypic.analysis.EdgeCorrector(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]

Bases: SetAnalyzer

Analyzer for detecting and correcting edge effects in arrayed colony growth.

This class identifies colonies at grid edges (missing orthogonal neighbors) and caps their measurement values to prevent edge effects in high-throughput phenotyping assays. Edge colonies often show artificially inflated measurements (larger areas, higher color intensity) due to lack of competition for resources from missing neighbors. The corrector uses permutation testing to determine if edge and interior colonies are statistically different before applying correction.

Intuition: In plate-based assays (96-well, 384-well), colonies at grid edges experience fundamentally different growth conditions: they lack orthogonal neighbors that would otherwise compete for nutrients and space. This causes edge colonies to appear larger/brighter than interior colonies under identical conditions, biasing downstream analyses. EdgeCorrector detects this asymmetry and caps measurements to a threshold derived from top interior colonies, preventing this systematic bias.

Use cases:

High-throughput phenotyping on standard plate layouts (8x12, 16x24, etc.)

Growth assays where colony size/intensity is a fitness proxy

Comparing genotypes across plates with multiple replicates per condition

Any analysis where spatial position should not correlate with phenotype

Caveats:

Requires multiple interior colonies to establish a reliable threshold

Edge correction assumes interior and edge colonies should have similar distributions; this may not hold in some experimental designs

If too many wells are empty or dead, surrounded position detection may fail

Permutation testing requires adequate sample sizes for statistical power

All measurements (not just edge colonies) are capped when correction is applied

Attributes:
nrows (int): Number of rows in the grid layout. ncols (int): Number of columns in the grid layout. top_n (int): Number of top-valued interior colonies to use for threshold calculation. connectivity (int): Neighbor pattern: 4 (orthogonal) or 8 (with diagonals). time_label (str): Column name containing time point information. pvalue (float): P-value threshold for permutation test (0.0 disables test). on (str): Name of measurement column to analyze and correct. groupby (list[str]): Column names for grouping data by experiment/plate/condition.

Category: **EDGE_CORRECTION**#
Name	Description
`Cap`	The carrying capacity for the target measurement
`NewVal`	The new value of the target measurement

Parameters:

on (str)
groupby (list[str])
time_label (str)
nrows (int)
ncols (int)
top_n (int)
pvalue (float)
connectivity (int)
agg_func (str)
num_workers (int)

__init__(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]

Initialize EdgeCorrector with grid layout and correction parameters.

Parameters:

on (str) – Name of the measurement column to analyze (e.g., ‘Area’, ‘Intensity’). Corrected values will be placed in a new column EDGE_CORRECTION.NEW_VAL-{on}. Original column is preserved unchanged.
groupby (list[str]) – Column names for grouping data independently (e.g., [‘ImageName’, ‘Condition’]). Each group gets its own threshold calculation.
time_label (str, optional) – Column name containing time point information. Defaults to “Metadata_Time”. The maximum time point per group is used to identify interior vs. edge colonies.
nrows (int, optional) – Number of rows in the grid layout. Defaults to 8 (standard 96-well plate). Must be positive. Affects edge detection logic.
ncols (int, optional) – Number of columns in the grid layout. Defaults to 12 (standard 96-well plate). Must be positive. Affects edge detection logic.
top_n (int, optional) – Number of top-valued interior colonies to use for threshold calculation. Defaults to 3. The threshold is the mean of the top_n interior colonies; larger values give more stable thresholds but may miss subtle edge effects.
pvalue (float, optional) – P-value threshold for permutation test comparing interior vs. edge distributions. Defaults to 0.05. Set to 0.0 to disable statistical testing and apply correction to all groups. Values are passed to scipy.stats.permutation_test with 1000 resamples.
connectivity (int, optional) – Neighbor pattern for interior cell detection. Defaults to 4 (orthogonal: North, South, East, West). Set to 8 to include diagonal neighbors. Affects how strictly “surrounded” is defined.
agg_func (str, optional) – Aggregation function for multiple measurements per section (well). Defaults to “mean”. See pandas.DataFrame.agg for options.
num_workers (int, optional) – Number of parallel workers for group processing. Defaults to 1 (serial). Use -1 for all CPU cores via joblib.Parallel.

Raises:

ValueError – If connectivity is not 4 or 8.
ValueError – If nrows or ncols are not positive integers.
ValueError – If top_n is not a positive integer.

Examples

Basic initialization with 96-well plate defaults:

>>> from phenotypic.analysis import EdgeCorrector
>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['ImageName'],
...     top_n=3,
...     pvalue=0.05
... )
>>> # nrows=8, ncols=12 are defaults for 96-well format

Custom grid layout (384-well format, 16x24):

>>> corrector = EdgeCorrector(
...     on='ColonyIntensity',
...     groupby=['Plate', 'Condition'],
...     nrows=16,
...     ncols=24,
...     top_n=3,
...     connectivity=8,  # Include diagonal neighbors
...     num_workers=4
... )

Aggressive correction (no statistical test):

>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['ImageName'],
...     pvalue=0.0,  # Apply to all groups regardless of stats
...     top_n=1  # Use single top value as threshold
... )

analyze(data: pandas.DataFrame) → pandas.DataFrame[source]

Analyze and apply edge correction to grid-based colony measurements.

This method processes the input DataFrame by grouping according to specified columns and applying edge correction to each group independently. For each group, it identifies edge colonies (those missing orthogonal neighbors at the final time point), compares their distributions to interior colonies via permutation test, and caps all measurements to a threshold derived from top interior colonies.

Edge correction assumes that interior and edge colonies under identical conditions should have similar phenotypic distributions. When they differ significantly (p < pvalue threshold), measurements are capped to prevent edge-driven bias in downstream analyses.

Parameters:

data (pd.DataFrame) –

Input DataFrame containing grid measurements. Must include:

GRID.SECTION_NUM (str): Column with well/section indices (0-indexed flattened position: row * ncols + col)
self.on (str): Measurement column to analyze and correct
All columns in self.groupby: For independent group processing
self.time_label (str, optional): Time point column if not all observations are at the same time

Returns:

Measurements with two new correction columns added:

EdgeCorrection_Size-{on}: Capped measurement values (clipped to threshold where edge effect detected)
EdgeCorrection_-{self.on}: Threshold value used for correction

Original measurement column (self.on) remains unchanged. All other columns preserved from input. One row per well per group.

Return type:

pd.DataFrame

Raises:

KeyError – If required columns (GRID.SECTION_NUM, self.on, or any in self.groupby) are missing.
ValueError – If data is empty or has zero rows.

Notes

Stores original data in self._original_data for later visualization
Stores corrected data in self._latest_measurements for retrieval via results()
Groups are processed independently via joblib.Parallel if num_workers > 1
Aggregation (default: mean) is applied to multiple measurements per well
Edge correction is only applied if permutation test p-value < self.pvalue
If pvalue=0.0, correction is applied to all groups regardless of statistics

Examples

Basic edge correction on 96-well data:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import EdgeCorrector
>>> from phenotypic.tools\_.measurement_info_ import GRID
>>> # Create sample 96-well data (8 rows x 12 cols)
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 96,
...     GRID.ROW_MAJOR_IDX: range(96),
...     'Metadata_Time': [1] * 96,
...     'Shape_Area': np.random.uniform(100, 500, 96)
... })
>>> # Edge colonies (row/col 0 or 7/11) have larger areas
>>> edge_idx = [i for i in range(96) if i//12 in (0,7) or i%12 in (0,11)]
>>> data.loc[edge_idx, 'Shape_Area'] *= 1.5
>>> # Apply correction
>>> corrector = EdgeCorrector(
...     on='Shape_Area',
...     groupby=['ImageName'],
...     top_n=5,
...     pvalue=0.05
... )
>>> corrected = corrector.analyze(data)  
>>> # New columns created:
>>> # - 'EdgeCorrection_NewVal-Area': Capped area values at threshold
>>> # - 'EdgeCorrection_Cap-Area': Threshold value used
>>> # Original 'Area' column unchanged

Multi-group edge correction (multiple plates and conditions):

>>> # Data from multiple plates and conditions
>>> data = pd.DataFrame({
...     'Plate': ['P1']*96 + ['P2']*96,
...     'Condition': ['WT']*48 + ['KO']*48 + ['WT']*48 + ['KO']*48,
...     GRID.ROW_MAJOR_IDX: list(range(96))*2,
...     'Metadata_Time': [1]*192,
...     'Area': np.random.uniform(100, 500, 192)
... })  
>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['Plate', 'Condition'],  # 4 independent corrections
...     nrows=8, ncols=12,
...     num_workers=4
... )
>>> corrected = corrector.analyze(data)  
>>> # Each plate-condition combo gets its own threshold

dash(**kwargs)

Interactive Plotly visualization of analysis results.

Subclasses may override this method to provide an interactive Plotly figure equivalent to show().

Raises:: NotImplementedError – Unless overridden by a subclass.

results() → pandas.DataFrame[source]

Return the corrected measurement DataFrame from the last analyze() call.

Retrieves the DataFrame with edge-corrected measurements produced by the most recent call to analyze(). Provides convenient access to results without retaining a local reference.

Returns:

Edge-corrected measurements with original data plus two new: correction columns: - EDGE_CORRECTION.NEW_VAL-{self.on}: Capped measurement values - EDGE_CORRECTION.CORRECTED_CAP-{self.on}: Threshold value used Original measurement column (self.on) is preserved unchanged. If analyze() has not been called, returns an empty DataFrame.

Return type:

pd.DataFrame

Examples

Retrieving corrected measurements after analysis:

>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['ImageName']
... )
>>> corrected = corrector.analyze(data)  
>>> results = corrector.results()  
>>> assert results.equals(corrected)  
>>> # Access corrected values
>>> corrected_areas = results['Size-Area']  
>>> thresholds = results['Cap-Area']  
>>> # Original 'Area' column also available for comparison
>>> original_areas = results['Area']  

Notes

Returns the DataFrame stored in self._latest_measurements
Same as the return value of analyze()
Always use this method rather than direct attribute access

show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs) → tuple[Figure, matplotlib.axes.Axes][source]

Visualize edge correction results with interior/edge colony comparisons.

Displays the distribution of measurements for the last time point per group, highlighting interior (surrounded) vs. edge colonies. Shows the calculated correction threshold and permutation test p-values. Interior colonies are shown in blue, edge colonies in red. Circles indicate measurements passing the threshold, X’s indicate capped measurements.

Parameters:

figsize (tuple[int, int], optional) – Figure size as (width, height) in inches. If None, auto-sized based on number of groups (single-group: 10x6, many groups: 10x max(6, 0.5*ngroups+2)).
max_groups (int, optional) – Maximum number of groups to display. Defaults to 20. If data has more groups, a warning is printed and only the first 20 are shown.
collapsed (bool, optional) – If True (default), show all groups stacked vertically on a single axis with y-offsets. If False, create a grid of subplots with one group per subplot.
criteria (dict[str, any], optional) – Filter groups before visualization using column-value criteria (e.g., {‘Plate’: ‘P1’, ‘Condition’: [‘WT’, ‘KO’]}). Filtering uses SetAnalyzer._filter_by with AND logic across criteria.
**kwargs –
Additional matplotlib parameters:
- dpi (int): Figure resolution, passed to plt.subplots()
- facecolor (str): Figure background color
- edgecolor (str): Figure edge color
- legend_fontsize (int): Font size for legend (default 9 for collapsed, 8 for individual)

Returns:

Tuple of (matplotlib Figure, Axes object(s)):

If collapsed=True: (Figure, single Axes)

If collapsed=False: (Figure, array of Axes)

Return type:

tuple[Figure, plt.Axes]

Raises:

RuntimeError – If analyze() has not been called (no results to display).
ValueError – If criteria filter leaves no matching data.

Notes

Interior colonies are those with all orthogonal neighbors present (4-connectivity)
Edge colonies are detected but lack all orthogonal neighbors
Threshold line (orange) is derived from top interior colonies
P-values displayed between interior and edge means (if pvalue != 0)
Permutation test uses 1000 resamples with two-sided alternative
Call analyze() before show()

Examples

Basic visualization of edge correction results:

>>> corrector = EdgeCorrector(on='Area', groupby=['ImageName'])
>>> corrected = corrector.analyze(data)  
>>> fig, ax = corrector.show()  
>>> # Single collapsed plot with all groups stacked vertically

Individual subplots per group:

>>> fig, axes = corrector.show(
...     collapsed=False,
...     figsize=(15, 10)
... )  
>>> # Grid of subplots, max 3 columns

Filtered visualization for specific plate:

>>> fig, ax = corrector.show(
...     criteria={'Plate': 'P1'},
...     max_groups=10,
...     figsize=(12, 8)
... )  

class phenotypic.analysis.LinearSoftplusModel(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', *, smax: float | None = None, beta: float = 10, stderr_label: str | None = None, inoc_size_label: str | None = None, prune_saturated: bool = True, saturation_threshold: float = 0.05, saturation_buffer: int = 2, v_upper: float = 50.0, num_workers: int = 1, loss: Literal['linear'] = 'linear', verbose: bool = False)[source]

Bases: ModelFitter

Linear-with-softplus lag-phase growth fitter.

The model combines a linear post-lag growth phase with a softplus lag transition and an optional softplus saturation ceiling:

\[s(t) = \frac{v}{\alpha}\, \ln\!\bigl(1 + e^{\alpha(t-\lambda)}\bigr) + s_0\]

When smax is provided (or inferred per-group as the observed maximum), a second softplus clamps the curve to the saturation ceiling:

\[s(t) = s_{\max} - \frac{1}{\beta}\,\ln\!\bigl(1 + e^{\beta(s_{\max} - s_{\text{unclamped}}(t))}\bigr)\]

Parameters:

on (str)
groupby (List[str])
time_label (str)
agg_func (Callable | str | list | dict | None)
smax (float | None)
beta (float)
stderr_label (str | None)
inoc_size_label (str | None)
prune_saturated (bool)
saturation_threshold (float)
saturation_buffer (int)
v_upper (float)
num_workers (int)
loss (Literal['linear'])
verbose (bool)

smax

Fixed carrying capacity. None falls back to per-group observed maximum.

Type:: float | None

beta

Saturation transition sharpness.

Type:: float

stderr_label

Column providing per-timepoint standard errors used as weights in the fit. When None, the fit auto-derives a replicate-SE column during aggregation.

Type:: str | None

inoc_size_label

Column of per-row inoculum size measurements. When supplied, per-group mean and sample std are computed automatically and used as an informative Gaussian prior on s0; when None, no prior is applied (appropriate for yeast where inoculum size is not imaged).

Type:: str | None

prune_saturated

Whether to drop post-saturation timepoints before fitting.

Type:: bool

saturation_threshold

Fraction of peak ds/dt below which the curve is considered saturated.

Type:: float

saturation_buffer

Extra rows past the saturation index kept so the fit still sees some plateau evidence.

Type:: int

v_upper

Upper bound on v in the optimizer.

Type:: float

__init__(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', *, smax: float | None = None, beta: float = 10, stderr_label: str | None = None, inoc_size_label: str | None = None, prune_saturated: bool = True, saturation_threshold: float = 0.05, saturation_buffer: int = 2, v_upper: float = 50.0, num_workers: int = 1, loss: Literal['linear'] = 'linear', verbose: bool = False)[source]

Initialize the linear-softplus fitter.

Parameters:

on (str) – Target column (size measurement) to fit.
groupby (List[str]) – Columns defining the per-fit grouping structure.
time_label (str) – Column name representing time. Defaults to "Metadata_Time".
agg_func (Callable | str | list | dict | None) – Aggregation function for the on column when stderr_label is provided. Ignored when stderr_label is None because the fitter uses pandas named aggregation to derive mean and SE together. Defaults to "mean".
smax (float | None) – Fixed carrying capacity for every group. When None, each group uses its own post-pruning observed maximum of on.
beta (float) – Saturation transition sharpness.
stderr_label (str | None) – Column providing per-timepoint standard errors used as weights. When None, replicate SE is computed automatically during aggregation.
inoc_size_label (str | None) – Optional column providing per-row inoculum size measurements (typically available for filamentous fungi from the t=0 image, absent for yeast). When supplied, per-group mean and sample standard deviation of the column are computed automatically via groupby.transform and fed into a single virtual Gaussian residual on s0 (spec §5.3). When None (the default, appropriate for yeast), the inoculum prior is omitted from the residual vector entirely.
prune_saturated (bool) – Whether to drop post-saturation timepoints before fitting.
saturation_threshold (float) – Fraction of peak ds/dt below which the curve is considered saturated.
saturation_buffer (int) – Extra rows past the saturation index retained so the fit still sees plateau evidence.
v_upper (float) – Upper bound on v.
num_workers (int) – Number of parallel workers for per-group fits.
loss (Literal['linear']) – Loss method passed through to scipy.optimize.least_squares(). Defaults to "linear".
verbose (bool) – If True, enables optimizer verbose output.

analyze(data: pandas.DataFrame) → pandas.DataFrame[source]

Fit the model to every group of data.

Pre-computes two broadcasted helper columns on the raw data before delegating to the base-class aggregation pipeline:

When stderr_label is None, a replicate-SEM column derived via groupby.transform("sem") so the weighted loss can downweight noisy timepoints automatically.
When inoc_size_label is set, per-group mean and sample std of the inoculum-size column via groupby.transform — these feed the optional Gaussian prior on s0 (_inoc_stats()).

Each helper is constant within its group, so the base-class dict-style aggregation carries it through as a flat column without MultiIndex juggling.

Parameters:: data (pandas.DataFrame)
Return type:: pandas.DataFrame

Interactive Plotly version of show().

Hover tooltips are populated from _hover_fields so subclasses can expose whichever fitted parameters and metrics are most meaningful for their model.

Raises:

ImportError – If plotly is not installed.

Parameters:

tmax (int | float | None)
criteria (Dict[str, Union[Any, List[Any]]] | None)
cmap (str | None)
legend (bool)

Return type:

go.Figure

static model_func(t: np.ndarray | float, v: float, s0: float, lam: float, alpha: float, smax: float | None = None, beta: float = 10) → float | np.ndarray[source]

Linear-softplus growth curve with optional saturation ceiling.

Parameters:

t (np.ndarray | float) – Time (scalar or array).
v (float) – Post-lag growth rate.
s0 (float) – Initial size.
lam (float) – Lag duration.
alpha (float) – Lag transition sharpness.
smax (float | None) – Optional carrying capacity. When None, the curve grows linearly forever past the lag.
beta (float) – Saturation transition sharpness.

Returns:

Predicted size at t; scalar when t is scalar, otherwise an array.

Return type:

float | np.ndarray

results() → pandas.DataFrame

Return the most recent fit results produced by analyze().

Return type:: pandas.DataFrame

Plot model predictions alongside measurements with optional filtering.

Parameters:

tmax (int | float | None) – Upper bound of the prediction curve. If None, uses the maximum observed time.
criteria (Dict[str, Union[Any, List[Any]]] | None) – Column/value filter applied to both fitted results and raw measurements before plotting.
figsize – Matplotlib figure size. Used only when ax is None.
cmap (str | None) – Matplotlib colormap name, a single color string, or None for matplotlib’s default color cycle.
legend (bool) – Whether to render a legend (auto-removed if larger than the axes).
ax (plt.Axes | None) – Existing axes to draw into. A new figure is created when omitted.
**kwargs – Styling overrides — dpi, facecolor, edgecolor, line_width, marker_size, elinewidth, capsize, legend_loc, legend_fontsize, label.

Returns:

A (Figure, Axes) pair.

Return type:

Tuple[plt.Figure, plt.Axes]

class phenotypic.analysis.LogGrowthModel(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam: float = 1.2, beta: float = 2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]

Bases: ModelFitter

Logistic-growth model fitter with regularized least-squares objective.

Logistic Kinetics Model:

\[N(t) = \frac{K}{1 + \frac{K - N_0}{N_0} e^{-rt}}\]

\(N_t\): population size at time \(t\)

\(N_0\): initial population size at time \(t\)

\(r\): growth rate

\(K\): carrying capacity (maximum population size)

From this we derive:

\[\mu_{\max} = \frac{K r}{4}\]

\(\mu_{\max}\): maximum specific growth rate

Loss Function:

To solve for the parameters, we use the following loss function with the SciPy linear least-squares solver:

\[J(K, N_0, r) = \frac{1}{n}\sum_{i=1}^{n} \frac{1}{2}\left(f_{K,N_0,r}(t^{(i)}) - N_t^{(i)}\right)^2 + \lambda\left(\left(\frac{dN}{dt}\right)^2 + N_0^2\right) + \beta \frac{\lvert K - \max(N_t) \rvert}{N_t}\]

\(\lambda\): regularization term for growth rate and initial population size

\(\beta\): penalty term for deviations in carrying capacity relative to
the largest measurement

Parameters:

on (str)
groupby (List[str])
time_label (str)
agg_func (Callable | str | list | dict | None)
lam (float)
beta (float)
Kmax_label (str | None)
loss (Literal['linear'])
verbose (bool)
n_jobs (int)

lam

The penalty factor applied to growth rates.

Type:: float

beta

The maximum penalty factor applied to the carrying capacity.

Type:: float

Kmax_label

The column name for the maximum carrying capacity values, if provided.

Type:: str | None

__init__(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam: float = 1.2, beta: float = 2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]

Initialize the log-growth fitter.

Parameters:

on (str) – Target column (population-size measurement) to fit.
groupby (List[str]) – Columns defining the per-fit grouping structure.
time_label (str) – Column name representing time. Defaults to "Metadata_Time".
agg_func (Callable | str | list | dict | None) – Aggregation function fed to DataFrame.groupby.agg(). Defaults to "mean".
lam (float) – Regularization factor applied to the maximum specific growth rate and initial population size. Defaults to 1.2.
beta (float) – Penalty factor applied to the relative difference between K and the largest observed measurement. Defaults to 2.
Kmax_label (str | None) – Optional column providing a per-group upper bound on K. When omitted, the observed maximum of on is used.
loss (Literal['linear']) – Loss method passed through to scipy.optimize.least_squares(). Defaults to "linear".
verbose (bool) – If True, enables the optimizer’s verbose output.
n_jobs (int) – Number of parallel workers for per-group fits.

analyze(data: pandas.DataFrame) → pandas.DataFrame

Fit the model to every group of data and return the results.

Standard template: copy, float-coerce the time column, aggregate to one sample per timepoint, dispatch per-group fits (serial or parallel via joblib.Parallel), concatenate, and append constant hyperparameter columns from _post_fit_columns.

Parameters:: data (pandas.DataFrame)
Return type:: pandas.DataFrame

Interactive Plotly version of show().

Hover tooltips are populated from _hover_fields so subclasses can expose whichever fitted parameters and metrics are most meaningful for their model.

Raises:

ImportError – If plotly is not installed.

Parameters:

tmax (int | float | None)
criteria (Dict[str, Union[Any, List[Any]]] | None)
cmap (str | None)
legend (bool)

Return type:

go.Figure

static model_func(t: ndarray | float, r: float, K: float, N0: float)[source]

Logistic growth model evaluated at t.

\[N(t) = K / \left(1 + \frac{K - N_0}{N_0} e^{-rt}\right)\]

Parameters:

t (ndarray | float) – Time at which the population is evaluated (scalar or array).
r (float) – Growth rate.
K (float) – Carrying capacity.
N0 (float) – Initial population size at t = 0.

Returns:

Population size at t. Scalar when t is scalar, otherwise an array.

results() → pandas.DataFrame

Return the most recent fit results produced by analyze().

Return type:: pandas.DataFrame

Plot model predictions alongside measurements with optional filtering.

Parameters:

tmax (int | float | None) – Upper bound of the prediction curve. If None, uses the maximum observed time.
criteria (Dict[str, Union[Any, List[Any]]] | None) – Column/value filter applied to both fitted results and raw measurements before plotting.
figsize – Matplotlib figure size. Used only when ax is None.
cmap (str | None) – Matplotlib colormap name, a single color string, or None for matplotlib’s default color cycle.
legend (bool) – Whether to render a legend (auto-removed if larger than the axes).
ax (plt.Axes | None) – Existing axes to draw into. A new figure is created when omitted.
**kwargs – Styling overrides — dpi, facecolor, edgecolor, line_width, marker_size, elinewidth, capsize, legend_loc, legend_fontsize, label.

Returns:

A (Figure, Axes) pair.

Return type:

Tuple[plt.Figure, plt.Axes]

class phenotypic.analysis.TukeyOutlierRemover(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]

Bases: SetAnalyzer

Analyzer for removing outliers using Tukey’s fence method.

This class removes outliers from measurement data by applying Tukey’s fence test within groups. The method calculates the interquartile range (IQR) and removes values that fall outside Q1 - k*IQR or Q3 + k*IQR, where k is a tunable multiplier (typically 1.5 for outliers or 3.0 for extreme outliers).

Parameters:

on (str) – Name of measurement column to test for outliers (e.g., ‘Shape_Area’, ‘Intensity_IntegratedIntensity’).
groupby (list[str]) – List of column names to group by (e.g., [‘StrainID’, ‘Time’]).
k (float) – IQR multiplier for fence calculation. Default is 1.5 (standard outliers). Use 3.0 for extreme outliers only.
num_workers (int) – Number of parallel workers. Default is 1.

groupby: List of column names to group by.

on: Column to test for outliers.

k: IQR multiplier used for fence calculation.

num_workers: Number of parallel workers. Default is 1.

Examples

Remove outliers and visualize results:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data with some outliers
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 50 + ['img2'] * 50,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 48),
...         [500, 550],  # outliers in img1
...         np.random.normal(180, 25, 48),
...         [50, 600]  # outliers in img2
...     ])
... })
>>> # Initialize detector
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName'],
...     k=1.5
... )
>>> # Remove outliers
>>> filtered_data = detector.analyze(data)
>>> # Check how many were removed
>>> print(f"Original: {len(data)}, Filtered: {len(filtered_data)}")  
>>> # Visualize removed outliers
>>> fig = detector.show()  

__init__(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]

Initialize TukeyOutlierRemover with test parameters.

Parameters:

on (str) – Column name for grouping/aggregation operations.
groupby (list[str]) – List of column names to group by.
measurement_col – Name of measurement column to test for outliers.
k (float) – IQR multiplier for fence calculation. Default is 1.5.
agg_func – Aggregation function. Default is ‘mean’.
num_workers (int) – Number of workers. Default is 1.

Raises:

ValueError – If k is not positive.

analyze(data: pandas.DataFrame) → pandas.DataFrame[source]

Remove outliers from data using Tukey’s fence method.

This method processes the input DataFrame by grouping according to specified columns and removing outliers within each group independently. Outliers are identified using the IQR method and filtered out. The original data is stored internally for visualization purposes.

Parameters:

data (pandas.DataFrame) – DataFrame containing measurement data. Must include all columns specified in self.groupby and self.on.

Returns:

DataFrame with outliers removed. Contains only the original columns (no additional outlier flag columns).

Raises:

KeyError – If required columns are missing from input DataFrame.
ValueError – If data is empty or malformed.

Return type:

pandas.DataFrame

Examples

Analyze and filter outliers from measurement data:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 100,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 98),
...         [500, 50]  # outliers
...     ])
... })
>>> # Remove outliers
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName'],
...     k=1.5
... )
>>> filtered_data = detector.analyze(data)
>>> # Check results
>>> print(f"Original: {len(data)} rows, Filtered: {len(filtered_data)} rows")  
>>> print(f"Removed {len(data) - len(filtered_data)} outliers")  

Notes

Stores original data in self._original_data for visualization
Stores filtered results in self._latest_measurements for retrieval
Groups are processed independently with their own fences
NaN values in measurement column are preserved in output

dash(**kwargs)

Interactive Plotly visualization of analysis results.

Subclasses may override this method to provide an interactive Plotly figure equivalent to show().

Raises:: NotImplementedError – Unless overridden by a subclass.

results() → pandas.DataFrame[source]

Return the filtered results (outliers removed).

Returns the DataFrame with outliers removed from the most recent call to analyze().

Returns:: DataFrame with outliers filtered out. Contains only the original columns without additional outlier flag columns. If analyze() has not been called, returns an empty DataFrame.
Return type:: pandas.DataFrame

Examples

Retrieve filtered results after analysis:

>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['ImageName']
... )
>>> filtered_data = detector.analyze(data)  
>>> results_copy = detector.results()  # Same as filtered_data  
>>> assert results_copy.equals(filtered_data)  
>>> # Check how many rows were removed
>>> num_removed = len(data) - len(filtered_data)  
>>> print(f"Removed {num_removed} outliers")  

Notes

Returns the DataFrame stored in self._latest_measurements
Contains only inliers (outliers have been removed)
Use this method to retrieve results after calling analyze()

show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs)[source]

Visualize outlier detection results.

Creates a visualization showing the distribution of values with outliers highlighted and fence boundaries displayed. Can display as individual subplots or as a collapsed stacked view with all groups in a single plot. Outlier flags are computed dynamically for visualization only.

Parameters:

figsize (tuple[int, int] | None) – Figure size as (width, height). If None, automatically determined based on number of groups and mode.
max_groups (int) – Maximum number of groups to display. If there are more groups, only the first max_groups will be shown. Default is 20.
collapsed (bool) – If True, show all groups stacked vertically in a single plot. If False, show each group in its own subplot. Default is False.
criteria (dict[str, any] | None) – Optional dictionary specifying filtering criteria for data selection. When provided, only groups matching the criteria will be displayed. Format: {‘column_name’: value} or {‘column_name’: [value1, value2]}. Default is None (show all groups).
**kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines (default 0.3) - grid_axis: Which axis to apply grid to (‘both’, ‘x’, ‘y’) - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines

Returns:

Tuple of (Figure, Axes) containing the visualization.

Raises:

ValueError – If analyze() has not been called yet (no results to display).
KeyError – If criteria references columns not present in the data.

Return type:

(plt.Figure, plt.Axes)

Examples

Visualize outlier detection with multiple grouping options:

>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import TukeyOutlierRemover
>>> # Create sample data with multiple grouping columns
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1', 'img2'] * 50,
...     'Plate': ['P1'] * 50 + ['P2'] * 50,
...     'Area': np.concatenate([
...         np.random.normal(200, 30, 48), [500, 550],
...         np.random.normal(180, 25, 48), [50, 600]
...     ])
... })
>>> # Remove outliers and visualize all groups
>>> detector = TukeyOutlierRemover(
...     on='Area',
...     groupby=['Plate', 'ImageName'],
...     k=1.5
... )
>>> results = detector.analyze(data)  
>>> fig, axes = detector.show(figsize=(12, 5))  
>>> # Visualize only specific plate
>>> fig, axes = detector.show(criteria={'Plate': 'P1'})  
>>> # Visualize specific images across plates using collapsed view
>>> fig, ax = detector.show(criteria={'ImageName': 'img1'}, collapsed=True)  

Notes

Individual mode (collapsed=False): - Each group gets its own subplot with box plot - Outliers shown in red, normal values in blue - Horizontal lines show fence boundaries

Collapsed mode (collapsed=True): - All groups stacked vertically in single plot - Each group shown as horizontal line with median marker - Vertical bars show fence boundaries - Normal points as circles, outliers as diamonds - More compact for comparing many groups

Filtering with criteria: - Only groups matching all criteria are displayed - Useful for focusing on specific plates, conditions, or subsets - Can be combined with both individual and collapsed modes

phenotypic.analysis#

This Page