phenotypic.analysis#

Analytics for quantified fungal colony plates.

Provides post-measurement tools that adjust colony statistics for plate layout artifacts, fit growth curves, and prune outliers so downstream comparisons reflect biology rather than imaging geometry. Includes edge correction for grid layouts, log-phase growth modeling across time courses, and Tukey-style outlier removal for colony metrics.

class phenotypic.analysis.EdgeCorrector(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]#

Bases: SetAnalyzer

Analyzer for detecting and correcting edge effects in colony detection.

This class identifies colonies at grid edges (missing orthogonal neighbors) and caps their measurement values to prevent edge effects in growth assays. Edge colonies often show artificially inflated measurements due to lack of competition for resources.

Category: EDGE_CORRECTION#

Name

Description

CorrectedCap

The carrying capacity for the target measurement

Parameters:
__init__(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]#

Initializes the class with specified parameters to configure the state of the object. The class is aimed at processing and analyzing connectivity data with multiple grouping and aggregation options, while ensuring input validation.

Parameters:
  • on (str) – The dataset column to analyze or process.

  • groupby (list[str]) – List of column names for grouping the data.

  • time_label (str) – Specific time reference column, defaulting to “Metadata_Time”.

  • nrows (int) – Number of rows in the dataset, must be positive.

  • ncols (int) – Number of columns in the dataset, must be positive.

  • top_n (int) – Number of top results to analyze. Must be a positive integer.

  • pvalue (float) – Statistical threshold for significance testing between the surrounded and edge colonies. defaults to 0.05. Set to 0.0 to apply to all plates.

  • connectivity (int) – The connectivity mode to use. Must be either 4 or 8.

  • agg_func (str) – Aggregation function to apply, defaulting to ‘mean’.

  • num_workers (int) – Number of workers for parallel processing.

Raises:
  • ValueError – If connectivity is not 4 or 8.

  • ValueError – If nrows or ncols are not positive integers.

  • ValueError – If top_n is not a positive integer.

analyze(data: pandas.DataFrame) pandas.DataFrame[source]#

Analyze and apply edge correction to grid-based colony measurements.

This method processes the input DataFrame by grouping according to specified columns and applying edge correction to each group independently. Edge colonies (those missing orthogonal neighbors) have their measurements capped to prevent artificially inflated values.

Parameters:

data (pandas.DataFrame) – DataFrame containing grid section numbers (GRID.SECTION_NUM) and measurement data. Must include all columns specified in self.groupby and self.on.

Returns:

DataFrame with corrected measurement values. Original structure is preserved with only the measurement column modified for edge-affected rows.

Raises:
  • KeyError – If required columns are missing from input DataFrame.

  • ValueError – If data is empty or malformed.

Return type:

pandas.DataFrame

Examples

Applying edge correction to a 96-well plate dataset
>>> import pandas as pd
>>> import numpy as np
>>> from phenotypic.analysis import EdgeCorrector
>>> from phenotypic.tools.constants_ import GRID
>>>
>>> # Create sample grid data with measurements
>>> np.random.seed(42)
>>> data = pd.DataFrame({
...     'ImageName': ['img1'] * 96,
...     GRID.SECTION_NUM: range(96),
...     'Area': np.random.uniform(100, 500, 96)
... })
>>>
>>> # Apply edge correction
>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['ImageName'],
...     nrows=8,
...     ncols=12,
...     top_n=10
... )
>>> corrected = corrector.analyze(data)
>>>
>>> # Check results
>>> results = corrector.results()

Notes

  • Stores original data in self._original_data for comparison

  • Stores corrected data in self._latest_measurements for retrieval

  • Groups are processed independently with their own thresholds

results() pandas.DataFrame[source]#

Return the corrected measurement DataFrame.

Returns the DataFrame with edge-corrected measurements from the most recent call to analyze(). This allows retrieval of results after processing.

Returns:

DataFrame with corrected measurements. If analyze() has not been called, returns an empty DataFrame.

Return type:

pandas.DataFrame

Examples

Retrieving corrected measurements after analysis
>>> corrector = EdgeCorrector(
...     on='Area',
...     groupby=['ImageName']
... )
>>> corrected = corrector.analyze(data)
>>> results = corrector.results()  # Same as corrected
>>> assert results.equals(corrected)

Notes

  • Returns the DataFrame stored in self._latest_measurements

  • Contains the same structure as input but with corrected values

  • Use this method to retrieve results after calling analyze()

show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs) tuple[Figure, matplotlib.axes.Axes][source]#

Visualize edge correction results.

Displays the distribution of measurements for the last time point, highlighting surrounded vs. edge colonies and the calculated correction threshold.

Parameters:
  • figsize (tuple[int, int] | None) – Figure size (width, height).

  • max_groups (int) – Maximum number of groups to display.

  • collapsed (bool) – If True, show groups stacked vertically.

  • criteria (dict[str, any] | None) – Filtering criteria.

  • **kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8 or 9) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines

Returns:

Tuple of (Figure, Axes).

Return type:

tuple[Figure, matplotlib.axes.Axes]

class phenotypic.analysis.LogGrowthModel(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam=1.2, alpha=2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]#

Bases: ModelFitter

Represents a log growth model fitter.

This class defines methods and attributes to configure and fit logarithmic growth models to grouped data. It provides functionality for analyzing and visualizing the fitted models as well as exposing the results for further processing.

Logistic Kinetics Model:

\[N(t) = \frac{K}{1 + \frac{K - N_0}{N_0} e^{-rt}}\]

\(N_t\): population size at time \(t\)

\(N_0\): initial population size at time \(t\)

\(r\): growth rate

\(K\): carrying capacity (maximum population size)

From this we derive:

\[\mu_{\max} = \frac{K r}{4}\]

\(\mu_{\max}\): maximum specific growth rate

Loss Function:

To solve for the parameters, we use the following loss function with the SciPy linear least-squares solver:

\[J(K, N_0, r) = \frac{1}{n}\sum_{i=1}^{n} \frac{1}{2}\left(f_{K,N_0,r}(t^{(i)}) - N_t^{(i)}\right)^2 + \lambda\left(\left(\frac{dN}{dt}\right)^2 + N_0^2\right) + \alpha \frac{\lvert K - \max(N_t) \rvert}{N_t}\]

\(\lambda\): regularization term for growth rate and initial population size

\(\alpha\): penalty term for deviations in carrying capacity relative to

the largest measurement

Parameters:
lam#

The penalty factor applied to growth rates.

Type:

float

alpha#

The maximum penalty factor applied to the carrying capacity.

Type:

float

loss#

The loss calculation method used for fitting.

Type:

Literal[“linear”]

verbose#

A flag to enable or disable detailed logging.

Type:

bool

time_label#

The column name representing the time dimension in the input data.

Type:

str

Kmax_label#

The column name for the maximum carrying capacity values, if provided.

Type:

str | None

__init__(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam=1.2, alpha=2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]#

This class initializes parameters for a data processing or modeling procedure. It takes configuration arguments for handling data grouping, time management, aggregation, penalties, loss calculation, and verbosity.

Parameters:
  • on (str) – The target variable or column to process.

  • groupby (List[str]) – The columns that define the grouping structure.

  • time_label (str) – Column name that represents time in the data. Defaults to ‘Metadata_Time’.

  • agg_func (Callable | str | list | dict | None) –

    Aggregation function(s) to apply to grouped data. Parameter is fed to

    pandas.DataFrame.groupby.agg(). Defaults to ‘mean’.

  • lam – The penalty factor applied to growth rates. Defaults to 1.2.

  • alpha – The maximum penalty factor applied to the carrying capacity. Defaults to 2.

  • Kmax_label (str | None) – Column name that provides maximum K value for processing. Defaults to None.

  • loss (Literal["linear"]) – Loss calculation method to apply. Defaults to “linear”.

  • verbose (bool) – If True, enables detailed logging for process execution. Defaults to False.

  • n_jobs (int) – Number of parallel jobs to execute. Defaults to 1.

analyze(data: DataFrame) DataFrame[source]#
Parameters:

data (DataFrame)

Return type:

DataFrame

static model_func(t: ndarray[float] | float, r: float, K: float, N0: float)[source]#

Computes the value of the logistic growth model for a given time point or array of time points and parameters. The logistic model describes growth that initially increases exponentially but levels off as the population reaches a carrying capacity.

This static method uses the formula:

N(t) = K / (1 + [(K - N0) / N0] * exp(-r * t))

Where:

t: Time (independent variable, can be scalar or array). r: Growth rate. K: Carrying capacity (maximum population size). N0: Initial population size.

Parameters:
  • t (np.ndarray[float] | float) – Time at which the population is calculated. Can be a single value or an array of values.

  • r (float) – Growth rate of the population.

  • K (float) – Carrying capacity or the maximum population size.

  • N0 (float) – Initial population size at time t=0.

Returns:

The computed population size at the given time or array of times based on the logistic growth model.

Return type:

float | np.ndarray[float]

results() DataFrame[source]#
Return type:

DataFrame

show(tmax: int | float | None = None, criteria: Dict[str, Any | List[Any]] | None = None, figsize=(6, 4), cmap: str | None = 'tab20', legend=True, ax: Axes | None = None, **kwargs) Tuple[Figure, Axes][source]#

Visualizes model predictions alongside measurements, allowing optional filtering by specified criteria and plotting configuration.

Parameters:
  • tmax (int | float | None, optional) – The maximum time value for plotting. If set to None, the maximum time value will be determined from the data automatically.

  • criteria (Dict[str, Union[Any, List[Any]]] | None, optional) – A dictionary specifying filtering criteria for data selection. When provided, only data matching the criteria will be used for plotting.

  • figsize (tuple, optional) – A tuple specifying the size of the figure. Defaults to (6, 4).

  • cmap (str | None, optional) – A string representing either a matplotlib colormap name or a single color (e.g., ‘red’, ‘#FF0000’). If a matplotlib colormap is provided, colors will be cycled through it. If a single color is provided, all lines will use that color. Defaults to ‘tab20’.

  • legend (bool, optional) – A boolean that controls whether a legend is displayed on the plot. Defaults to True.

  • ax (plt.Axes, optional) – A matplotlib Axes object on which to plot. If not provided, a new figure and axes object will be created.

  • **kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - line_width: Line width for prediction lines - marker_size: Size of data point markers - elinewidth: Error bar line width - capsize: Error bar cap size - title: Custom figure title - xlabel: Custom x-axis label - ylabel: Custom y-axis label - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend

Returns:

A tuple containing the matplotlib Figure and

Axes objects used for plotting.

Return type:

Tuple[plt.Figure, plt.Axes]

Raises:

KeyError – If the group keys for model results and measurements do not align, or if specified columns are missing from the input data.

class phenotypic.analysis.TukeyOutlierRemover(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]#

Bases: SetAnalyzer

Analyzer for removing outliers using Tukey’s fence method.

This class removes outliers from measurement data by applying Tukey’s fence test within groups. The method calculates the interquartile range (IQR) and removes values that fall outside Q1 - k*IQR or Q3 + k*IQR, where k is a tunable multiplier (typically 1.5 for outliers or 3.0 for extreme outliers).

Parameters:
  • on (str) – Name of measurement column to test for outliers (e.g., ‘Shape_Area’, ‘Intensity_IntegratedIntensity’).

  • groupby (list[str]) – List of column names to group by (e.g., [‘ImageName’, ‘Metadata_Plate’]).

  • k (float) – IQR multiplier for fence calculation. Default is 1.5 (standard outliers). Use 3.0 for extreme outliers only.

  • num_workers (int) – Number of parallel workers. Default is 1.

groupby#

List of column names to group by.

on#

Column to test for outliers.

k#

IQR multiplier used for fence calculation.

num_workers#

Number of parallel workers. Default is 1.

Examples

Remove outliers and visualize results
import pandas as pd
import numpy as np
from phenotypic.analysis import TukeyOutlierRemover

# Create sample data with some outliers
np.random.seed(42)
data = pd.DataFrame({
    'ImageName': ['img1'] * 50 + ['img2'] * 50,
    'Area': np.concatenate([
        np.random.normal(200, 30, 48),
        [500, 550],  # outliers in img1
        np.random.normal(180, 25, 48),
        [50, 600]  # outliers in img2
    ])
})

# Initialize detector
detector = TukeyOutlierRemover(
    on='Area',
    groupby=['ImageName'],
    k=1.5
)

# Remove outliers
filtered_data = detector.analyze(data)

# Check how many were removed
print(f"Original: {len(data)}, Filtered: {len(filtered_data)}")

# Visualize removed outliers
fig = detector.show()
__init__(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]#

Initialize TukeyOutlierRemover with test parameters.

Parameters:
  • on (str) – Column name for grouping/aggregation operations.

  • groupby (list[str]) – List of column names to group by.

  • measurement_col – Name of measurement column to test for outliers.

  • k (float) – IQR multiplier for fence calculation. Default is 1.5.

  • agg_func – Aggregation function. Default is ‘mean’.

  • num_workers (int) – Number of workers. Default is 1.

Raises:

ValueError – If k is not positive.

analyze(data: pandas.DataFrame) pandas.DataFrame[source]#

Remove outliers from data using Tukey’s fence method.

This method processes the input DataFrame by grouping according to specified columns and removing outliers within each group independently. Outliers are identified using the IQR method and filtered out. The original data is stored internally for visualization purposes.

Parameters:

data (pandas.DataFrame) – DataFrame containing measurement data. Must include all columns specified in self.groupby and self.on.

Returns:

DataFrame with outliers removed. Contains only the original columns (no additional outlier flag columns).

Raises:
  • KeyError – If required columns are missing from input DataFrame.

  • ValueError – If data is empty or malformed.

Return type:

pandas.DataFrame

Examples

Analyze and filter outliers from measurement data
import pandas as pd
import numpy as np
from phenotypic.analysis import TukeyOutlierRemover

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'ImageName': ['img1'] * 100,
    'Area': np.concatenate([
        np.random.normal(200, 30, 98),
        [500, 50]  # outliers
    ])
})

# Remove outliers
detector = TukeyOutlierRemover(
    on='Area',
    groupby=['ImageName'],
    k=1.5
)
filtered_data = detector.analyze(data)

# Check results
print(f"Original: {len(data)} rows, Filtered: {len(filtered_data)} rows")
print(f"Removed {len(data) - len(filtered_data)} outliers")

Notes

  • Stores original data in self._original_data for visualization

  • Stores filtered results in self._latest_measurements for retrieval

  • Groups are processed independently with their own fences

  • NaN values in measurement column are preserved in output

results() pandas.DataFrame[source]#

Return the filtered results (outliers removed).

Returns the DataFrame with outliers removed from the most recent call to analyze().

Returns:

DataFrame with outliers filtered out. Contains only the original columns without additional outlier flag columns. If analyze() has not been called, returns an empty DataFrame.

Return type:

pandas.DataFrame

Examples

Retrieve filtered results after analysis
detector = TukeyOutlierRemover(
    on='Area',
    groupby=['ImageName']
)
filtered_data = detector.analyze(data)
results_copy = detector.results()  # Same as filtered_data
assert results_copy.equals(filtered_data)

# Check how many rows were removed
num_removed = len(data) - len(filtered_data)
print(f"Removed {num_removed} outliers")

Notes

  • Returns the DataFrame stored in self._latest_measurements

  • Contains only inliers (outliers have been removed)

  • Use this method to retrieve results after calling analyze()

show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs)[source]#

Visualize outlier detection results.

Creates a visualization showing the distribution of values with outliers highlighted and fence boundaries displayed. Can display as individual subplots or as a collapsed stacked view with all groups in a single plot. Outlier flags are computed dynamically for visualization only.

Parameters:
  • figsize (tuple[int, int] | None) – Figure size as (width, height). If None, automatically determined based on number of groups and mode.

  • max_groups (int) – Maximum number of groups to display. If there are more groups, only the first max_groups will be shown. Default is 20.

  • collapsed (bool) – If True, show all groups stacked vertically in a single plot. If False, show each group in its own subplot. Default is False.

  • criteria (dict[str, any] | None) – Optional dictionary specifying filtering criteria for data selection. When provided, only groups matching the criteria will be displayed. Format: {‘column_name’: value} or {‘column_name’: [value1, value2]}. Default is None (show all groups).

  • **kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines (default 0.3) - grid_axis: Which axis to apply grid to (‘both’, ‘x’, ‘y’) - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines

Returns:

Tuple of (Figure, Axes) containing the visualization.

Raises:
  • ValueError – If analyze() has not been called yet (no results to display).

  • KeyError – If criteria references columns not present in the data.

Return type:

(plt.Figure, plt.Axes)

Examples

Visualize outlier detection with multiple grouping options
import pandas as pd
import numpy as np
from phenotypic.analysis import TukeyOutlierRemover

# Create sample data with multiple grouping columns
np.random.seed(42)
data = pd.DataFrame({
    'ImageName': ['img1', 'img2'] * 50,
    'Plate': ['P1'] * 50 + ['P2'] * 50,
    'Area': np.concatenate([
        np.random.normal(200, 30, 48), [500, 550],
        np.random.normal(180, 25, 48), [50, 600]
    ])
})

# Remove outliers and visualize all groups
detector = TukeyOutlierRemover(
    on='Area',
    groupby=['Plate', 'ImageName'],
    k=1.5
)
results = detector.analyze(data)
fig, axes = detector.show(figsize=(12, 5))

# Visualize only specific plate
fig, axes = detector.show(criteria={'Plate': 'P1'})

# Visualize specific images across plates using collapsed view
fig, ax = detector.show(criteria={'ImageName': 'img1'}, collapsed=True)

Notes

Individual mode (collapsed=False): - Each group gets its own subplot with box plot - Outliers shown in red, normal values in blue - Horizontal lines show fence boundaries

Collapsed mode (collapsed=True): - All groups stacked vertically in single plot - Each group shown as horizontal line with median marker - Vertical bars show fence boundaries - Normal points as circles, outliers as diamonds - More compact for comparing many groups

Filtering with criteria: - Only groups matching all criteria are displayed - Useful for focusing on specific plates, conditions, or subsets - Can be combined with both individual and collapsed modes

Subpackages#