phenotypic.analysis#
Analytics for quantified fungal colony plates.
Provides post-measurement tools that adjust colony statistics for plate layout artifacts, fit growth curves, and prune outliers so downstream comparisons reflect biology rather than imaging geometry. Includes edge correction for grid layouts, log-phase growth modeling across time courses, and Tukey-style outlier removal for colony metrics.
Classes
Analyzer for detecting and correcting edge effects in colony detection. |
|
Represents a log growth model fitter. |
|
Analyzer for removing outliers using Tukey's fence method. |
- class phenotypic.analysis.EdgeCorrector(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]#
Bases:
SetAnalyzerAnalyzer for detecting and correcting edge effects in colony detection.
This class identifies colonies at grid edges (missing orthogonal neighbors) and caps their measurement values to prevent edge effects in growth assays. Edge colonies often show artificially inflated measurements due to lack of competition for resources.
Category: EDGE_CORRECTION# Name
Description
CorrectedCapThe carrying capacity for the target measurement
- Parameters:
- __init__(on: str, groupby: list[str], time_label: str = 'Metadata_Time', nrows: int = 8, ncols: int = 12, top_n: int = 3, pvalue: float = 0.05, connectivity: int = 4, agg_func: str = 'mean', num_workers: int = 1)[source]#
Initializes the class with specified parameters to configure the state of the object. The class is aimed at processing and analyzing connectivity data with multiple grouping and aggregation options, while ensuring input validation.
- Parameters:
on (str) – The dataset column to analyze or process.
groupby (list[str]) – List of column names for grouping the data.
time_label (str) – Specific time reference column, defaulting to “Metadata_Time”.
nrows (int) – Number of rows in the dataset, must be positive.
ncols (int) – Number of columns in the dataset, must be positive.
top_n (int) – Number of top results to analyze. Must be a positive integer.
pvalue (float) – Statistical threshold for significance testing between the surrounded and edge colonies. defaults to 0.05. Set to 0.0 to apply to all plates.
connectivity (int) – The connectivity mode to use. Must be either 4 or 8.
agg_func (str) – Aggregation function to apply, defaulting to ‘mean’.
num_workers (int) – Number of workers for parallel processing.
- Raises:
ValueError – If connectivity is not 4 or 8.
ValueError – If nrows or ncols are not positive integers.
ValueError – If top_n is not a positive integer.
- analyze(data: pandas.DataFrame) pandas.DataFrame[source]#
Analyze and apply edge correction to grid-based colony measurements.
This method processes the input DataFrame by grouping according to specified columns and applying edge correction to each group independently. Edge colonies (those missing orthogonal neighbors) have their measurements capped to prevent artificially inflated values.
- Parameters:
data (pandas.DataFrame) – DataFrame containing grid section numbers (GRID.SECTION_NUM) and measurement data. Must include all columns specified in self.groupby and self.on.
- Returns:
DataFrame with corrected measurement values. Original structure is preserved with only the measurement column modified for edge-affected rows.
- Raises:
KeyError – If required columns are missing from input DataFrame.
ValueError – If data is empty or malformed.
- Return type:
Examples
Applying edge correction to a 96-well plate dataset
>>> import pandas as pd >>> import numpy as np >>> from phenotypic.analysis import EdgeCorrector >>> from phenotypic.tools.constants_ import GRID >>> >>> # Create sample grid data with measurements >>> np.random.seed(42) >>> data = pd.DataFrame({ ... 'ImageName': ['img1'] * 96, ... GRID.SECTION_NUM: range(96), ... 'Area': np.random.uniform(100, 500, 96) ... }) >>> >>> # Apply edge correction >>> corrector = EdgeCorrector( ... on='Area', ... groupby=['ImageName'], ... nrows=8, ... ncols=12, ... top_n=10 ... ) >>> corrected = corrector.analyze(data) >>> >>> # Check results >>> results = corrector.results()
Notes
Stores original data in self._original_data for comparison
Stores corrected data in self._latest_measurements for retrieval
Groups are processed independently with their own thresholds
- results() pandas.DataFrame[source]#
Return the corrected measurement DataFrame.
Returns the DataFrame with edge-corrected measurements from the most recent call to analyze(). This allows retrieval of results after processing.
- Returns:
DataFrame with corrected measurements. If analyze() has not been called, returns an empty DataFrame.
- Return type:
Examples
Retrieving corrected measurements after analysis
>>> corrector = EdgeCorrector( ... on='Area', ... groupby=['ImageName'] ... ) >>> corrected = corrector.analyze(data) >>> results = corrector.results() # Same as corrected >>> assert results.equals(corrected)
Notes
Returns the DataFrame stored in self._latest_measurements
Contains the same structure as input but with corrected values
Use this method to retrieve results after calling analyze()
- show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs) tuple[Figure, matplotlib.axes.Axes][source]#
Visualize edge correction results.
Displays the distribution of measurements for the last time point, highlighting surrounded vs. edge colonies and the calculated correction threshold.
- Parameters:
figsize (tuple[int, int] | None) – Figure size (width, height).
max_groups (int) – Maximum number of groups to display.
collapsed (bool) – If True, show groups stacked vertically.
**kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8 or 9) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines
- Returns:
Tuple of (Figure, Axes).
- Return type:
- class phenotypic.analysis.LogGrowthModel(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam=1.2, alpha=2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]#
Bases:
ModelFitterRepresents a log growth model fitter.
This class defines methods and attributes to configure and fit logarithmic growth models to grouped data. It provides functionality for analyzing and visualizing the fitted models as well as exposing the results for further processing.
Logistic Kinetics Model:
\[N(t) = \frac{K}{1 + \frac{K - N_0}{N_0} e^{-rt}}\]\(N_t\): population size at time \(t\)
\(N_0\): initial population size at time \(t\)
\(r\): growth rate
\(K\): carrying capacity (maximum population size)
From this we derive:
\[\mu_{\max} = \frac{K r}{4}\]\(\mu_{\max}\): maximum specific growth rate
Loss Function:
To solve for the parameters, we use the following loss function with the SciPy linear least-squares solver:
\[J(K, N_0, r) = \frac{1}{n}\sum_{i=1}^{n} \frac{1}{2}\left(f_{K,N_0,r}(t^{(i)}) - N_t^{(i)}\right)^2 + \lambda\left(\left(\frac{dN}{dt}\right)^2 + N_0^2\right) + \alpha \frac{\lvert K - \max(N_t) \rvert}{N_t}\]\(\lambda\): regularization term for growth rate and initial population size
- \(\alpha\): penalty term for deviations in carrying capacity relative to
the largest measurement
- Parameters:
- loss#
The loss calculation method used for fitting.
- Type:
Literal[“linear”]
- __init__(on: str, groupby: List[str], time_label: str = 'Metadata_Time', agg_func: Callable | str | list | dict | None = 'mean', lam=1.2, alpha=2, Kmax_label: str | None = None, loss: Literal['linear'] = 'linear', verbose: bool = False, n_jobs: int = 1)[source]#
This class initializes parameters for a data processing or modeling procedure. It takes configuration arguments for handling data grouping, time management, aggregation, penalties, loss calculation, and verbosity.
- Parameters:
on (str) – The target variable or column to process.
groupby (List[str]) – The columns that define the grouping structure.
time_label (str) – Column name that represents time in the data. Defaults to ‘Metadata_Time’.
agg_func (Callable | str | list | dict | None) –
Aggregation function(s) to apply to grouped data. Parameter is fed to
pandas.DataFrame.groupby.agg(). Defaults to ‘mean’.
lam – The penalty factor applied to growth rates. Defaults to 1.2.
alpha – The maximum penalty factor applied to the carrying capacity. Defaults to 2.
Kmax_label (str | None) – Column name that provides maximum K value for processing. Defaults to None.
loss (Literal["linear"]) – Loss calculation method to apply. Defaults to “linear”.
verbose (bool) – If True, enables detailed logging for process execution. Defaults to False.
n_jobs (int) – Number of parallel jobs to execute. Defaults to 1.
- static model_func(t: ndarray[float] | float, r: float, K: float, N0: float)[source]#
Computes the value of the logistic growth model for a given time point or array of time points and parameters. The logistic model describes growth that initially increases exponentially but levels off as the population reaches a carrying capacity.
- This static method uses the formula:
N(t) = K / (1 + [(K - N0) / N0] * exp(-r * t))
- Where:
t: Time (independent variable, can be scalar or array). r: Growth rate. K: Carrying capacity (maximum population size). N0: Initial population size.
- Parameters:
- Returns:
The computed population size at the given time or array of times based on the logistic growth model.
- Return type:
- show(tmax: int | float | None = None, criteria: Dict[str, Any | List[Any]] | None = None, figsize=(6, 4), cmap: str | None = 'tab20', legend=True, ax: Axes | None = None, **kwargs) Tuple[Figure, Axes][source]#
Visualizes model predictions alongside measurements, allowing optional filtering by specified criteria and plotting configuration.
- Parameters:
tmax (int | float | None, optional) – The maximum time value for plotting. If set to None, the maximum time value will be determined from the data automatically.
criteria (Dict[str, Union[Any, List[Any]]] | None, optional) – A dictionary specifying filtering criteria for data selection. When provided, only data matching the criteria will be used for plotting.
figsize (tuple, optional) – A tuple specifying the size of the figure. Defaults to (6, 4).
cmap (str | None, optional) – A string representing either a matplotlib colormap name or a single color (e.g., ‘red’, ‘#FF0000’). If a matplotlib colormap is provided, colors will be cycled through it. If a single color is provided, all lines will use that color. Defaults to ‘tab20’.
legend (bool, optional) – A boolean that controls whether a legend is displayed on the plot. Defaults to True.
ax (plt.Axes, optional) – A matplotlib Axes object on which to plot. If not provided, a new figure and axes object will be created.
**kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - line_width: Line width for prediction lines - marker_size: Size of data point markers - elinewidth: Error bar line width - capsize: Error bar cap size - title: Custom figure title - xlabel: Custom x-axis label - ylabel: Custom y-axis label - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend
- Returns:
- A tuple containing the matplotlib Figure and
Axes objects used for plotting.
- Return type:
Tuple[plt.Figure, plt.Axes]
- Raises:
KeyError – If the group keys for model results and measurements do not align, or if specified columns are missing from the input data.
- class phenotypic.analysis.TukeyOutlierRemover(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]#
Bases:
SetAnalyzerAnalyzer for removing outliers using Tukey’s fence method.
This class removes outliers from measurement data by applying Tukey’s fence test within groups. The method calculates the interquartile range (IQR) and removes values that fall outside Q1 - k*IQR or Q3 + k*IQR, where k is a tunable multiplier (typically 1.5 for outliers or 3.0 for extreme outliers).
- Parameters:
on (str) – Name of measurement column to test for outliers (e.g., ‘Shape_Area’, ‘Intensity_IntegratedIntensity’).
groupby (list[str]) – List of column names to group by (e.g., [‘ImageName’, ‘Metadata_Plate’]).
k (float) – IQR multiplier for fence calculation. Default is 1.5 (standard outliers). Use 3.0 for extreme outliers only.
num_workers (int) – Number of parallel workers. Default is 1.
- groupby#
List of column names to group by.
- on#
Column to test for outliers.
- k#
IQR multiplier used for fence calculation.
- num_workers#
Number of parallel workers. Default is 1.
Examples
Remove outliers and visualize results
import pandas as pd import numpy as np from phenotypic.analysis import TukeyOutlierRemover # Create sample data with some outliers np.random.seed(42) data = pd.DataFrame({ 'ImageName': ['img1'] * 50 + ['img2'] * 50, 'Area': np.concatenate([ np.random.normal(200, 30, 48), [500, 550], # outliers in img1 np.random.normal(180, 25, 48), [50, 600] # outliers in img2 ]) }) # Initialize detector detector = TukeyOutlierRemover( on='Area', groupby=['ImageName'], k=1.5 ) # Remove outliers filtered_data = detector.analyze(data) # Check how many were removed print(f"Original: {len(data)}, Filtered: {len(filtered_data)}") # Visualize removed outliers fig = detector.show()
- __init__(on: str, groupby: list[str], k: float = 1.5, num_workers: int = 1)[source]#
Initialize TukeyOutlierRemover with test parameters.
- Parameters:
- Raises:
ValueError – If k is not positive.
- analyze(data: pandas.DataFrame) pandas.DataFrame[source]#
Remove outliers from data using Tukey’s fence method.
This method processes the input DataFrame by grouping according to specified columns and removing outliers within each group independently. Outliers are identified using the IQR method and filtered out. The original data is stored internally for visualization purposes.
- Parameters:
data (pandas.DataFrame) – DataFrame containing measurement data. Must include all columns specified in self.groupby and self.on.
- Returns:
DataFrame with outliers removed. Contains only the original columns (no additional outlier flag columns).
- Raises:
KeyError – If required columns are missing from input DataFrame.
ValueError – If data is empty or malformed.
- Return type:
Examples
Analyze and filter outliers from measurement data
import pandas as pd import numpy as np from phenotypic.analysis import TukeyOutlierRemover # Create sample data np.random.seed(42) data = pd.DataFrame({ 'ImageName': ['img1'] * 100, 'Area': np.concatenate([ np.random.normal(200, 30, 98), [500, 50] # outliers ]) }) # Remove outliers detector = TukeyOutlierRemover( on='Area', groupby=['ImageName'], k=1.5 ) filtered_data = detector.analyze(data) # Check results print(f"Original: {len(data)} rows, Filtered: {len(filtered_data)} rows") print(f"Removed {len(data) - len(filtered_data)} outliers")
Notes
Stores original data in self._original_data for visualization
Stores filtered results in self._latest_measurements for retrieval
Groups are processed independently with their own fences
NaN values in measurement column are preserved in output
- results() pandas.DataFrame[source]#
Return the filtered results (outliers removed).
Returns the DataFrame with outliers removed from the most recent call to analyze().
- Returns:
DataFrame with outliers filtered out. Contains only the original columns without additional outlier flag columns. If analyze() has not been called, returns an empty DataFrame.
- Return type:
Examples
Retrieve filtered results after analysis
detector = TukeyOutlierRemover( on='Area', groupby=['ImageName'] ) filtered_data = detector.analyze(data) results_copy = detector.results() # Same as filtered_data assert results_copy.equals(filtered_data) # Check how many rows were removed num_removed = len(data) - len(filtered_data) print(f"Removed {num_removed} outliers")
Notes
Returns the DataFrame stored in self._latest_measurements
Contains only inliers (outliers have been removed)
Use this method to retrieve results after calling analyze()
- show(figsize: tuple[int, int] | None = None, max_groups: int = 20, collapsed: bool = True, criteria: dict[str, any] | None = None, **kwargs)[source]#
Visualize outlier detection results.
Creates a visualization showing the distribution of values with outliers highlighted and fence boundaries displayed. Can display as individual subplots or as a collapsed stacked view with all groups in a single plot. Outlier flags are computed dynamically for visualization only.
- Parameters:
figsize (tuple[int, int] | None) – Figure size as (width, height). If None, automatically determined based on number of groups and mode.
max_groups (int) – Maximum number of groups to display. If there are more groups, only the first max_groups will be shown. Default is 20.
collapsed (bool) – If True, show all groups stacked vertically in a single plot. If False, show each group in its own subplot. Default is False.
criteria (dict[str, any] | None) – Optional dictionary specifying filtering criteria for data selection. When provided, only groups matching the criteria will be displayed. Format: {‘column_name’: value} or {‘column_name’: [value1, value2]}. Default is None (show all groups).
**kwargs – Additional matplotlib parameters to customize the plot. Common options include: - dpi: Figure resolution (default 100) - facecolor: Figure background color - edgecolor: Figure edge color - grid_alpha: Alpha value for grid lines (default 0.3) - grid_axis: Which axis to apply grid to (‘both’, ‘x’, ‘y’) - legend_loc: Legend location (default ‘best’) - legend_fontsize: Font size for legend (default 8) - marker_alpha: Alpha value for scatter plot markers - line_width: Line width for box plots and fence lines
- Returns:
Tuple of (Figure, Axes) containing the visualization.
- Raises:
ValueError – If analyze() has not been called yet (no results to display).
KeyError – If criteria references columns not present in the data.
- Return type:
(plt.Figure, plt.Axes)
Examples
Visualize outlier detection with multiple grouping options
import pandas as pd import numpy as np from phenotypic.analysis import TukeyOutlierRemover # Create sample data with multiple grouping columns np.random.seed(42) data = pd.DataFrame({ 'ImageName': ['img1', 'img2'] * 50, 'Plate': ['P1'] * 50 + ['P2'] * 50, 'Area': np.concatenate([ np.random.normal(200, 30, 48), [500, 550], np.random.normal(180, 25, 48), [50, 600] ]) }) # Remove outliers and visualize all groups detector = TukeyOutlierRemover( on='Area', groupby=['Plate', 'ImageName'], k=1.5 ) results = detector.analyze(data) fig, axes = detector.show(figsize=(12, 5)) # Visualize only specific plate fig, axes = detector.show(criteria={'Plate': 'P1'}) # Visualize specific images across plates using collapsed view fig, ax = detector.show(criteria={'ImageName': 'img1'}, collapsed=True)
Notes
Individual mode (collapsed=False): - Each group gets its own subplot with box plot - Outliers shown in red, normal values in blue - Horizontal lines show fence boundaries
Collapsed mode (collapsed=True): - All groups stacked vertically in single plot - Each group shown as horizontal line with median marker - Vertical bars show fence boundaries - Normal points as circles, outliers as diamonds - More compact for comparing many groups
Filtering with criteria: - Only groups matching all criteria are displayed - Useful for focusing on specific plates, conditions, or subsets - Can be combined with both individual and collapsed modes