PyCO2stats Functions

This section documents the public functions and classes of PyCO2stats, grouped by module.

Gaussian mixtures

class pyco2stats.gaussian_mixtures.GMM[source]

Bases: object

static gaussian_mixture_em(data, n_components, max_iter=10000, tol=1e-07, random_state=None)[source]

Fit a Gaussian Mixture Model (GMM) to the given data using the Expectation-Maximization (EM) algorithm.

The method was taken from Elío et al., 2016 (DOI: 10.1016/j.ijggc.2016.02.012).

Parameters:

data (array) – The input data to fit the GMM to.
n_components (int) – The number of Gaussian components in the mixture.
max_iter (int) – The maximum number of iterations for the EM algorithm. Default is 100.
tol (float) – The tolerance for convergence. Default is 1e-6.
random_state (int, Generator, or None) – Controls the randomness for initialization. If int, uses it as a seed. If Generator, uses it directly. If None, uses the global random state (non-deterministic).

Returns:

means (array) – The means of the Gaussian components.
std_devs (array) – The standard deviations of the Gaussian components.
weights (array) – The weights (mixing proportions) of the Gaussian components.
log_likelihoods (list) – The log-likelihood values over the iterations.

static gaussian_mixture_sklearn(X, n_components=3, max_iter=10, tol=1e-10, n_init=20, suppress_warnings=True, covariance_type='spherical')[source]

Fit a Gaussian Mixture Model (GMM) mutuated from sklearn.

Parameters:

X (array) – The input data to fit the GMM to.
n_components (int) – The number of Gaussian components in the mixture.
max_iter (int) – The maximum number of iterations for the EM algorithm. Default is 100.
tol (float) – The tolerance for convergence. Default is 1e-10.
n_init (int) – The number of initializations to perform. The best results are kept. Default is 20.
suppress_warnings (bool) – If True, suppresses the generation of warnings. Default is True.
covariance_type (string) – Can be ‘full’, ‘tied’, ‘diag’ or ‘spherical’. Describes the type of covariance parameters to use. Default is ‘spherical’.

Returns:

means (array) – The means of the Gaussian components.
std_devs (array) – The standard deviations of the Gaussian components.
weights (array) – The weights (mixing proportions) of the Gaussian components.
max_iter (int) – Maximum number of iteration (given as input).
log_likelihoods (list) – The log-likelihood values over the iterations.

static constrained_gaussian_mixture(X, mean_bounds, std_bounds, n_components, n_epochs=5000, lr=0.001, verbose=True)[source]

Optimize a Gaussian Mixture Model (GMM) using PyTorch with specified constraints on means and standard deviations. Uses Softmax for stable weight optimization and LogSumExp for numerical stability.

Parameters:

X (array) – Input data to fit the GMM. Should be 1D for this implementation (univariate GMM).
mean_bounds (list of tuples) – List of tuples specifying (min, max) constraints for each component’s mean. Length must equal n_components.
std_bounds (list of tuples) – List of tuples specifying (min, max) constraints for each component’s standard deviation. Lower bound should be > 0 for numerical stability. Length must equal n_components.
n_components (int) – Number of Gaussian components in the mixture.
n_epochs (int) – Number of iterations for optimization. Default is 5000.
lr (float) – Learning rate for the optimizer. Default is 0.001.
verbose (bool) – If True, prints progress every 200 epochs. Default is True.

Returns:

optimized_means (array) – Optimized means of the Gaussian components.
optimized_stds (array) – Optimized standard deviations of the Gaussian components.
optimized_weights (array) – Optimized weights (mixing proportions) of the Gaussian components.

static gaussian_mixture_pdf(x, meds, stds, weights)[source]

Compute the PDF of a Gaussian Mixture Model.

Parameters:

x (array) – X values at which to compute the PDF.
meds (list or array) – Means for each Gaussian component.
stds (list or array) – Standard deviations for each Gaussian component.
weights (list or array) – Weights (relative importance that must sum to 1) for each Gaussian component.

Returns:

pdf – The computed PDF values for the Gaussian Mixture Model at each x.

Return type:

array

static sample_from_gmm(n_samples, means, stds, weights, random_state=None)[source]

Samples a finite number of observations from a Gaussian Mixture Model (GMM).

Parameters:

n_samples (int) – The number of observations to sample.
means (array) – The means for each Gaussian component.
stds (array) – The standard deviations for each Gaussian component.
weights (array) – The weights (mixing proportions) for each Gaussian component. Weights should sum to 1.
random_state (int, Generator, or None) – Controls the randomness for sampling. If int, uses it as a seed. If Generator, uses it directly. If None, uses the global random state (non-deterministic).

Returns:

samples – An array of sampled observations from the GMM.

Return type:

array

Sinclair

class pyco2stats.sinclair.Sinclair[source]

Bases: object

Implements transformations between cumulative probabilities and sigma-values (standard normal quantiles) for probability plots.

static cumulative_to_sigma(p: ndarray) → ndarray[source]

Converts cumulative probabilities to sigma-values (z-scores).

Parameters:: p (array) – Array of cumulative probabilities in the range [0, 1].
Returns:: sigma_values – Array of sigma-values corresponding to the input probabilities.
Return type:: array

static sigma_to_cumulative(sigma: ndarray) → ndarray[source]

Converts sigma-values (z-scores) to cumulative probabilities.

Parameters:: sigma (array) – Array of sigma-values.
Returns:: cumulative_probs – Array of cumulative probabilities corresponding to the input sigma-values.
Return type:: array

static get_raw_data(raw_data: ndarray) → tuple[ndarray, ndarray][source]

Converts raw data into sorted values and their corresponding sigma-values.

Parameters:

raw_data (array) – Array of raw data values.

Returns:

sigma_values (array) – Sigma-values corresponding to the empirical cumulative probabilities.
sorted_data (array) – Raw data sorted in ascending order.

static calculate_combined_population(x: ndarray, means: ndarray, stds: ndarray, weights: ndarray) → ndarray[source]

Computes the cumulative distribution of a weighted mixture of Gaussian distributions.

Parameters:

x (array) – Points at which to evaluate the combined cumulative distribution.
means (array) – Means of the individual Gaussian components.
stds (array) – Standard deviations of the Gaussian components.
weights (array) – Weights for each Gaussian component.

Returns:

y_comb – The combined cumulative distribution evaluated at the points x.

Return type:

array

Statistics

class pyco2stats.stats.Stats[source]

Bases: object

static lognormal_median_ci(data, confidence_level=0.95)[source]

Estimates the median and its confidence interval for data assumed to be log-normally distributed.

Parameters:

data (array-like) – A list, numpy array, or pandas Series of positive numerical data points.
confidence_level (float) – The desired confidence level (e.g., 0.95 for 95%). Must be between 0 and 1.

Returns:

A dictionary containing:

’median_estimate’: The point estimate of the median. ‘confidence_interval’: A tuple (lower_bound, upper_bound)

for the median.

Returns None if input data is invalid (e.g., non-positive values, not enough data points).

Return type:

dict

static bootstrap_mean_ci(data, n_bootstraps=1000, confidence_level=0.95)[source]

Estimates the mean and confidence interval for a log-normal distribution using the bootstrapping method.

Parameters:

data (array-like) – A 1D array or list containing the log-normally distributed data.
n_bootstraps (int) – The number of bootstrap samples to generate. Defaults to 1000.
confidence_level (float) – The desired confidence level for the interval. Must be between 0 and 1. Defaults to 0.95.

Returns:

A tuple containing:

estimated_mean (float): The estimated mean of the log-normal
distribution (mean of bootstrap means).
ci_lower (float): The lower bound of the confidence interval.
ci_upper (float): The upper bound of the confidence interval.

Return type:

tuple

static median(a, axis=None, out=None, overwrite_input=False, keepdims=False)[source]

Compute the median along the specified axis. Mutuated from numpy.

Returns the median of the array elements.

Parameters:

a (array_like) – Input array or object that can be converted to an array.
axis ({int, sequence of int, None}, optional) – Axis or axes along which the medians are computed. The default, axis=None, will compute the median along a flattened version of the array.
out (ndarray, optional) – Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output, but the type (of the output) will be cast if necessary.
overwrite_input (bool, optional) – If True, then allow use of memory of input array a for calculations. The input array will be modified by the call to median. This will save memory when you do not need to preserve the contents of the input array. Treat the input as undefined, but it will probably be fully or partially sorted. Default is False. If overwrite_input is True and a is not already an ndarray, an error will be raised.
keepdims (bool, optional) – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.

Returns:

median – A new array holding the result. If the input contains integers or floats smaller than float64, then the output data-type is np.float64. Otherwise, the data-type of the output is the same as that of the input. If out is specified, that array is returned instead.

Return type:

ndarray

static mad(data, axis=None, func=None, ignore_nan=False)[source]

Calculate the median absolute deviation (MAD) mutuated from astropy.

The MAD is defined as :math: median(abs(a - median(a))).

Parameters:

data (array-like) – Input array or object that can be converted to an array.
axis (None, int, or tuple of int, optional) – The axis or axes along which the MADs are computed. The default (None) is to compute the MAD of the flattened array.
func (callable, optional) – The function used to compute the median. Defaults to numpy.ma.median for masked arrays, otherwise to numpy.median.
ignore_nan (bool) – Ignore NaN values (treat them as if they are not in the array) when computing the median. This will use numpy.ma.median if axis is specified, or numpy.nanmedian if axis==None and numpy’s version is >1.10 because nanmedian is slightly faster in this case.

Returns:

mad – The median absolute deviation of the input array. If axis is None then a scalar will be returned, otherwise a ~numpy.ndarray will be returned.

Return type:

float or ~numpy.ndarray

static mad_std(data, axis=None, func=None, ignore_nan=False)[source]

Calculate a robust standard deviation using the median absolute deviation (MAD), mutuated from astropy.

The standard deviation estimator is given by:

\[\sigma \approx \frac{\textrm{MAD}}{\Phi^{-1}(3/4)} \approx 1.4826 \cdot \textrm{MAD}\]

where :math: Phi^{-1}(P) is the normal inverse cumulative distribution function evaluated at probability :math: P = 3/4.

Parameters:

data (array-like) – Data array or object that can be converted to an array.
axis (None, int, or tuple of int, optional) – The axis or axes along which the robust standard deviations are computed. The default (None) is to compute the robust standard deviation of the flattened array.
func (callable, optional) – The function used to compute the median. Defaults to numpy.ma.median for masked arrays, otherwise to numpy.median.
ignore_nan (bool) – Ignore NaN values (treat them as if they are not in the array) when computing the median. This will use numpy.ma.median if axis is specified, or numpy.nanmedian if axis=None and numpy’s version is >1.10 because nanmedian is slightly faster in this case.

Returns:

mad_std – The robust standard deviation of the input data. If axis is None then a scalar will be returned, otherwise a ~numpy.ndarray will be returned.

Return type:

float or ~numpy.ndarray

static sigma_clip(data, sigma=3, sigma_lower=None, sigma_upper=None, maxiters=5, cenfunc='median', stdfunc='std', axis=None, masked=True, return_bounds=False, copy=True, grow=False)[source]

Perform sigma-clipping on the provided data. Mutuated from astropy.

The data will be iterated over, each time rejecting values that are less or more than a specified number of standard deviations from a center value.

Clipped (rejected) pixels are those where:

\[data < center - (\sigma_{lower} * std) data > center + (\sigma_{upper} * std)\]

where:

center = cenfunc(data [, axis=]) std = stdfunc(data [, axis=])

Invalid data values (i.e., NaN or inf) are automatically clipped.

For an object-oriented interface to sigma clipping, see SigmaClip.

Parameters:

data (array-like or ~numpy.ma.MaskedArray) – The data to be sigma clipped.
sigma (float, optional) – The number of standard deviations to use for both the lower and upper clipping limit. These limits are overridden by sigma_lower and sigma_upper, if input. The default is 3.
sigma_lower (float or None, optional) – The number of standard deviations to use as the lower bound for the clipping limit. If None then the value of sigma is used. The default is None.
sigma_upper (float or None, optional) – The number of standard deviations to use as the upper bound for the clipping limit. If None then the value of sigma is used. The default is None.
maxiters (int or None, optional) – The maximum number of sigma-clipping iterations to perform or None to clip until convergence is achieved (i.e., iterate until the last iteration clips nothing). If convergence is achieved prior to maxiters iterations, the clipping iterations will stop. The default is 5.
cenfunc ({'median', 'mean'} or callable, optional) – The statistic or callable function/object used to compute the center value for the clipping. If using a callable function/object and the axis keyword is used, then it must be able to ignore NaNs (e.g., numpy.nanmean) and it must have an axis keyword to return an array with axis dimension(s) removed. The default is 'median'.
stdfunc ({'std', 'mad_std'} or callable, optional) – The statistic or callable function/object used to compute the standard deviation about the center value. If using a callable function/object and the axis keyword is used, then it must be able to ignore NaNs (e.g., numpy.nanstd) and it must have an axis keyword to return an array with axis dimension(s) removed. The default is 'std'.
axis (None or int or tuple of int, optional) – The axis or axes along which to sigma clip the data. If None, then the flattened data will be used. axis is passed to the cenfunc and stdfunc. The default is None.
masked (bool, optional) – If True, then a ~numpy.ma.MaskedArray is returned, where the mask is True for clipped values. If False, then a ~numpy.ndarray is returned. The default is True.
return_bounds (bool, optional) – If True, then the minimum and maximum clipping bounds are also returned.
copy (bool, optional) – If True, then the data array will be copied. If False and masked=True, then the returned masked array data will contain the same array as the input data (if data is a ~numpy.ndarray or ~numpy.ma.MaskedArray). If False and masked=False, the input data is modified in-place. The default is True.
grow (float or False, optional) – Radius within which to mask the neighbouring pixels of those that fall outwith the clipping limits (only applied along axis, if specified). As an example, for a 2D image a value of 1 will mask the nearest pixels in a cross pattern around each deviant pixel, while 1.5 will also reject the nearest diagonal neighbours and so on.

Returns:

result – If masked=True, then a ~numpy.ma.MaskedArray is returned, where the mask is True for clipped values and where the input mask was True.

If masked=False, then a ~numpy.ndarray is returned.

If return_bounds=True, then in addition to the masked array or array above, the minimum and maximum clipping bounds are returned.

If masked=False and axis=None, then the output array is a flattened 1D ~numpy.ndarray where the clipped values have been removed. If return_bounds=True then the returned minimum and maximum thresholds are scalars.

If masked=False and axis is specified, then the output ~numpy.ndarray will have the same shape as the input data and contain np.nan where values were clipped. If the input data was a masked array, then the output ~numpy.ndarray will also contain np.nan where the input mask was True. If return_bounds=True then the returned minimum and maximum clipping thresholds will be ~numpy.ndarrays.

Return type:

array-like

static sigma_clipped_stats(data, mask=None, mask_value=None, sigma=3.0, sigma_lower=None, sigma_upper=None, maxiters=5, cenfunc='median', stdfunc='std', std_ddof=0, axis=None, grow=False)[source]

Calculate sigma-clipped statistics on the provided data, mutuated from astropy.

Parameters:

data (array-like or ~numpy.ma.MaskedArray) – Data array or object that can be converted to an array.
mask (numpy.ndarray (bool), optional) – A boolean mask with the same shape as data, where a True value indicates the corresponding element of data is masked. Masked pixels are excluded when computing the statistics.
mask_value (float, optional) – A data value (e.g., 0.0) that is ignored when computing the statistics. mask_value will be masked in addition to any input mask.
sigma (float, optional) – The number of standard deviations to use for both the lower and upper clipping limit. These limits are overridden by sigma_lower and sigma_upper, if input. The default is 3.
sigma_lower (float or None, optional) – The number of standard deviations to use as the lower bound for the clipping limit. If None then the value of sigma is used. The default is None.
sigma_upper (float or None, optional) – The number of standard deviations to use as the upper bound for the clipping limit. If None then the value of sigma is used. The default is None.
maxiters (int or None, optional) – The maximum number of sigma-clipping iterations to perform or None to clip until convergence is achieved (i.e., iterate until the last iteration clips nothing). If convergence is achieved prior to maxiters iterations, the clipping iterations will stop. The default is 5.
cenfunc ({'median', 'mean'} or callable, optional) – The statistic or callable function/object used to compute the center value for the clipping. If using a callable function/object and the axis keyword is used, then it must be able to ignore NaNs (e.g., numpy.nanmean) and it must have an axis keyword to return an array with axis dimension(s) removed. The default is 'median'.
stdfunc ({'std', 'mad_std'} or callable, optional) – The statistic or callable function/object used to compute the standard deviation about the center value. If using a callable function/object and the axis keyword is used, then it must be able to ignore NaNs (e.g., numpy.nanstd) and it must have an axis keyword to return an array with axis dimension(s) removed. The default is 'std'.
std_ddof (int, optional) – The delta degrees of freedom for the standard deviation calculation. The divisor used in the calculation is N - std_ddof, where N represents the number of elements. The default is 0.
axis (None or int or tuple of int, optional) – The axis or axes along which to sigma clip the data. If None, then the flattened data will be used. axis is passed to the cenfunc and stdfunc. The default is None.
grow (float or False, optional) – Radius within which to mask the neighbouring pixels of those that fall outwith the clipping limits (only applied along axis, if specified). As an example, for a 2D image a value of 1 will mask the nearest pixels in a cross pattern around each deviant pixel, while 1.5 will also reject the nearest diagonal neighbours and so on.

Notes

The best performance will typically be obtained by setting cenfunc and stdfunc to one of the built-in functions specified as as string. If one of the options is set to a string while the other has a custom callable, you may in some cases see better performance if you have the bottleneck package installed.

Returns:: mean, median, stddev – The mean, median, and standard deviation of the sigma-clipped data.
Return type:: float

static biweight_location(data, c=6.0, M=None, axis=None, ignore_nan=False)[source]

Compute the biweight location.

The biweight location is a robust statistic for determining the central location of a distribution. It is given by:

\[\zeta_{biloc}= M + \frac{\sum_{|u_i|<1}(x_i - M) (1 - u_i^2)^2}{\sum_{|u_i|<1}(1 - u_i^2)^2}\]

where \(x\) is the input data, \(M\) is the sample median (or the input initial location guess) and \(u_i\) is given by:

\[u_{i} = \frac{(x_i - M)}{c \cdot MAD}\]

where \(c\) is the tuning constant and \(MAD\) is the median absolute deviation. The biweight location tuning constant c is typically 6.0 (the default).

If \(MAD\) is zero, then the median will be returned.

Parameters:

data (array-like) – Input array or object that can be converted to an array. data can be a ~numpy.ma.MaskedArray.
c (float, optional) – Tuning constant for the biweight estimator (default = 6.0).
M (float or array-like, optional) – Initial guess for the location. If M is a scalar value, then its value will be used for the entire array (or along each axis, if specified). If M is an array, then its must be an array containing the initial location estimate along each axis of the input array. If None (default), then the median of the input array will be used (or along each axis, if specified).
axis (None, int, or tuple of int, optional) – The axis or axes along which the biweight locations are computed. If None (default), then the biweight location of the flattened input array will be computed.
ignore_nan (bool, optional) – Whether to ignore NaN values in the input data.

Returns:

biweight_location – The biweight location of the input data. If axis is None then a scalar will be returned, otherwise a ~numpy.ndarray will be returned.

Return type:

float or ~numpy.ndarray

static biweight_scale(data, c=9.0, M=None, axis=None, modify_sample_size=False, ignore_nan=False)[source]

Compute the biweight scale.

The biweight scale is a robust statistic for determining the standard deviation of a distribution. It is the square root of the `biweight midvariance.

It is given by:

\[\zeta_{biscl} = \sqrt{n}\frac{\sqrt{\sum_{|u_i| < 1}(x_i - M)^2 (1 - u_i^2)^4}} {|(\sum_{|u_i| < 1}(1 - u_i^2) (1 - 5u_i^2))|}\]

where \(x\) is the input data, \(M\) is the sample median (or the input location) and \(u_i\) is given by:

\[u_{i} = \frac{x_i - M}{c * MAD}\]

where \(c\) is the tuning constant and \(MAD\) is the median absolute deviation. The biweight midvariance tuning constant c is typically 9.0 (the default).

If \(MAD\) is zero, then zero will be returned.

For the standard definition of biweight scale, \(n\) is the total number of points in the array (or along the input axis, if specified). That definition is used if modify_sample_size is False, which is the default.

However, if modify_sample_size = True, then \(n\) is the number of points for which \(|u_i| < 1\) (i.e. the total number of non-rejected values), i.e.

\[n = \sum_{|u_i| < 1} 1\]

which results in a value closer to the true standard deviation for small sample sizes or for a large number of rejected values.

Parameters:

data (array-like) – Input array or object that can be converted to an array. data can be a ~numpy.ma.MaskedArray.
c (float, optional) – Tuning constant for the biweight estimator (default = 9.0).
M (float or array-like, optional) – The location estimate. If M is a scalar value, then its value will be used for the entire array (or along each axis, if specified). If M is an array, then its must be an array containing the location estimate along each axis of the input array. If None (default), then the median of the input array will be used (or along each axis, if specified).
axis (None, int, or tuple of int, optional) – The axis or axes along which the biweight scales are computed. If None (default), then the biweight scale of the flattened input array will be computed.
modify_sample_size (bool, optional) – If False (default), then the sample size used is the total number of elements in the array (or along the input axis, if specified), which follows the standard definition of biweight scale. If True, then the sample size is reduced to correct for any rejected values (i.e. the sample size used includes only the non-rejected values), which results in a value closer to the true standard deviation for small sample sizes or for a large number of rejected values.
ignore_nan (bool, optional) – Whether to ignore NaN values in the input data.

Returns:

biweight_scale – The biweight scale of the input data. If axis is None then a scalar will be returned, otherwise a ~numpy.ndarray will be returned.

Return type:

float or ~numpy.ndarray

static trim(a, limits=None, inclusive=(True, True), axis=None)[source]

Trims an array by masking the data outside some given limits. Mutuated for scipy.

Returns a masked version of the input array.

Parameters:

a (array_like) – Input array.
limits ({None, tuple of float}, optional) – Tuple of the percentages to cut on each side of the array, with respect to the number of unmasked data, as floats between 0. and 1. Noting n the number of unmasked data before trimming, the (n*limits[0])th smallest data and the (n*limits[1])th largest data are masked, and the total number of unmasked data after trimming is n*(1.-sum(limits)). The value of one limit can be set to None to indicate an open interval.
inclusive ({(True, True) tuple}, optional) – Tuple indicating whether the number of data being masked on each side should be truncated (True) or rounded (False).
axis ({None, int}, optional) – Axis along which to trim. If None, the whole array is trimmed, but its shape is maintained.

static trimmed_mean(a, limits=None, inclusive=(True, True), axis=None)[source]

Compute the trimmed, mean given a lower and an upper limit. Mutuated from Scipy stats.

This function finds the arithmetic mean of given values, ignoring values outside the given limits.

Parameters:

a (array_like) – Array of values.
limits (None or (lower limit, upper limit), optional) – Values in the input array less than the lower limit or greater than the upper limit will be ignored. When limits is None (default), then all values are used. Either of the limit values in the tuple can also be None representing a half-open interval.
inclusive ((bool, bool), optional) – A tuple consisting of the (lower flag, upper flag). These flags determine whether values exactly equal to the lower or upper limits are included. The default value is (True, True).
axis (int or None, optional) – Axis along which to compute test. Default is None.

Returns:

tmean – Trimmed mean.

Return type:

ndarray

static trimmed_std(a, limits=(0.1, 0.1), inclusive=(1, 1), relative=True, axis=None, ddof=0)[source]

Returns the trimmed standard deviation of the data along the given axis. Mutuated from Scipy stats.

Parameters:

a (array_like) – Input array.
limits (tuple of float, optional) – The lower and upper fraction of elements to trim. These fractions should be between 0 and 1.
inclusive (tuple of {0, 1}, optional) – Tuple indicating whether the number of data being masked on each side should be truncated (1) or rounded (0).
relative (bool, optional) – Whether to treat the limits as relative or absolute positions.
axis (int, optional) – Axis along which to perform the trimming.
ddof (int, optional) – Means Delta Degrees of Freedom. The denominator used in the calculations is n - ddof, where n represents the number of elements.

static trimboth(a, proportiontocut=0.2, axis=0)[source]

Slice off a proportion of items from both ends of an array.

Slice off the passed proportion of items from both ends of the passed array (i.e., with proportiontocut = 0.1, slices leftmost 10% and rightmost 10% of scores). The trimmed values are the lowest and highest ones. Slice off less if proportion results in a non-integer slice index (i.e. conservatively slices off proportiontocut).

Parameters:

a (array_like) – Data to trim.
proportiontocut (float) – Proportion (in range 0-1) of total data set to trim of each end.
axis (int or None, optional) – Axis along which to trim data. Default is 0. If None, compute over the whole array a.

Returns:

out – Trimmed version of array a. The order of the trimmed content is undefined.

Return type:

ndarray

Error propagation

class pyco2stats.propagate_errors.Propagate_Errors[source]

Bases: object

A class to perform Monte Carlo error propagation for Gaussian Mixture Models (GMMs) fitted using different methods. Assumes input data is log-transformed. Propagates errors by adding fixed-standard-deviation additive noise to log-transformed data, corresponding to relative error on the original scale. Includes parameter alignment.

static propagate_em_error(original_log_data, percentage_relative_error: float, n_simulations: int, n_components: int, random_state: None, max_iter: int = 100, tol: float = 1e-06, show_progress: bool = False) → dict[source]

Propagates error through the EM-based GMM fitting by simulating additive noise on the log-transformed sample data using Monte Carlo. Noise std dev on log scale is fixed, equal to percentage_relative_error / 100. Aligns components by sorting means before storing results. Returns perturbed data statistics for diagnostics.

Parameters:

original_log_data (array) – The original 1D log-transformed data.
percentage_relative_error (float) – The relative uncertainty in percent (e.g., 5 for 5%). This sets the std dev of additive noise on log scale.
n_simulations (int) – The number of Monte Carlo simulations to run.
n_components (int) – The number of Gaussian components in the mixture.
max_iter (int) – Max iterations for the EM algorithm.
tol (float) – Tolerance for EM convergence.

Returns:

Result – A dictionary containing lists of results from each simulation: {‘means’: list[np.ndarray], ‘std_devs’: list[np.ndarray], ‘weights’: list[np.ndarray], ‘perturbed_data_means’: list[float], ‘perturbed_data_stds’: list[float]} Lists of GMM parameters have shape (n_components,). Lists of data statistics have shape (n_simulations,).

Return type:

dist

static propagate_sklearn_error(original_log_data, percentage_relative_error: float, n_simulations: int, n_components: int, max_iter: int = 10, tol: float = 1e-10, n_init: int = 20, suppress_warnings: bool = True, covariance_type: str = 'spherical', show_progress: bool = False) → dict[source]

Propagates error through the sklearn-based GMM fitting by simulating additive noise on the log-transformed sample data using Monte Carlo. Noise std dev on log scale is fixed, equal to percentage_relative_error / 100. Aligns components by sorting means before storing results.

Parameters:

original_log_data (array) – The original 1D log-transformed data.
percentage_relative_error (float) – The relative uncertainty in percent (e.g., 5 for 5%). This sets the std dev of additive noise on log scale.
n_simulations (int) – The number of Monte Carlo simulations to run.
n_components (int) – The number of Gaussian components in the mixture.
max_iter (int) – Max iterations for the sklearn EM algorithm.
tol (float) – Tolerance for sklearn EM convergence.
n_init (int) – Number of initializations for sklearn GMM.
suppress_warnings (bool) – Whether to suppress sklearn warnings.
covariance_type (string) – The type of covariance (‘spherical’, ‘diag’, ‘full’, ‘tied’).

Returns:

result – A dictionary containing lists of results from each simulation: {‘means’: list[np.ndarray], ‘std_devs’: list[np.ndarray], ‘weights’: list[np.ndarray]} Each np.ndarray in the lists has shape (n_components,).

Return type:

dict

static propagate_constrained_error(original_log_data, percentage_relative_error: float, n_simulations: int, mean_constraints: list, std_constraints: list, n_components: int, n_epochs: int = 5000, lr: float = 0.001, verbose: bool = False, show_progress: bool = False) → dict[source]

Propagates error through the constrained PyTorch-based GMM fitting by simulating additive noise on the log-transformed sample data using Monte Carlo. Noise std dev on log scale is fixed, equal to percentage_relative_error / 100. Aligns components by sorting means before storing results.

Parameters:

original_log_data (array) – The original 1D log-transformed data.
percentage_relative_error (float) – The relative uncertainty in percent (e.g., 5 for 5%). This sets the std dev of additive noise on log scale.
n_simulations (int) – The number of Monte Carlo simulations to run.
mean_constraints (list) – List of tuples specifying (min, max) constraints for each component’s mean on the log scale.
std_constraints (list) – List of tuples specifying (min, max) constraints for each component’s std dev on the log scale.
n_components (int) – Number of Gaussian components.
n_epochs (int) – Number of optimization epochs for constrained GMM.
lr (float) – Learning rate for optimization.
verbose (bool) – Whether to print progress during constrained GMM fitting (False recommended for MC).

Returns:

result – A dictionary containing lists of results from each simulation: {‘means’: list[np.ndarray], ‘std_devs’: list[np.ndarray], ‘weights’: list[np.ndarray]} Each np.ndarray in the lists has shape (n_components,).

Return type:

dict

static elaborate_results(propagation_results: dict, single_fit_means: ndarray = None, single_fit_std_devs: ndarray = None, single_fit_weights: ndarray = None, original_data_mean: float = None, original_data_std: float = None, method_name: str = 'GMM')[source]

Elaborates and prints the results from a Monte Carlo error propagation run.

Parameters:

propagation_results (dict) – The dictionary returned by a propagate_*_error method. Expected keys: ‘means’, ‘std_devs’, ‘weights’. May also contain ‘perturbed_data_means’, ‘perturbed_data_stds’ for certain methods (e.g., EM).
single_fit_means (array, optional) – Means from a single fit on original data.
single_fit_std_devs (array, optional) – Std devs from a single fit on original data.
single_fit_weights (array, optional) – Weights from a single fit on original data.
original_data_mean (float, optional) – Mean of the original log-transformed data.
original_data_std (float, optional) – Std dev of the original log-transformed data.
method_name (string) – The name of the GMM method used for reporting.

Return type:

None

Matplotlib visualization

class pyco2stats.visualize_mpl.Visualize_Mpl[source]

Bases: object

Class for plotting Sinclair-style probability plots for raw data and GMMs.

static pp_raw_data(raw_data, ax=None, **scatter_kwargs)[source]

Plot a probability plot of raw data using Sinclair transformation.

Parameters:

raw_data (array-like) – Array of raw data values.
ax (matplotlib.axes.Axes, optional) – Matplotlib Axes object to plot on. Creates new one if None.
**scatter_kwargs (dict) – Additional keyword arguments passed to ax.scatter().

Returns:

ax – The Axes object with the plot.

Return type:

matplotlib.axes.Axes

static pp_combined_population(means, stds, weights, x_range=(-3.5, 3.5), ax=None, **line_kwargs)[source]

Plot the cumulative distribution of a Gaussian mixture model on a probability plot.

Parameters:

means (array-like) – Means of Gaussian components.
stds (array-like) – Standard deviations of Gaussian components.
weights (array-like) – Weights of each Gaussian component.
x_range (tuple, optional) – Range of sigma-values (x-axis) to display.
ax (matplotlib.axes.Axes, optional) – Axes to plot on. Creates new one if None.
**line_kwargs (dict) – Additional arguments passed to ax.plot().

Returns:

ax – The Axes object with the plot.

Return type:

matplotlib.axes.Axes

static pp_single_populations(means, stds, z_range=(-3.5, 3.5), ax=None, **line_kwargs)[source]

Plot individual Gaussian distributions on a probability plot.

Parameters:

means (array-like) – Means of the Gaussian components.
stds (array-like) – Standard deviations of the Gaussian components.
z_range (tuple, optional) – Range of z-values to use for plotting.
ax (matplotlib.axes.Axes, optional) – Axes to plot on. Creates new one if None.
**line_kwargs (dict) – Additional arguments passed to ax.plot().

Returns:

ax – The Axes object with the plots.

Return type:

matplotlib.axes.Axes

pp_one_population(std, z_range=(-3.5, 3.5), ax=None, **line_kwargs)[source]

Plot a single Gaussian distribution on a probability plot.

Parameters:

mean (float) – Mean of the Gaussian distribution.
std (float) – Standard deviation of the Gaussian distribution.
z_range (tuple, optional) – Range of z-values to use for plotting.
ax (matplotlib.axes.Axes, optional) – Axes to plot on. Creates new one if None.
**line_kwargs (dict) – Additional arguments passed to ax.plot().

Returns:

ax – The Axes object with the plot.

Return type:

matplotlib.axes.Axes

static pp_add_sigma_grid(ax=None, sigma_ticks=array([-3, -2, -1, 0, 1, 2, 3]))[source]

Add vertical grid lines at specified sigma (z-score) positions.

Parameters:

ax (matplotlib.axes.Axes, optional) – Axes to add the grid to. Creates new one if None.
sigma_ticks (array-like) – Positions (z-scores) where grid lines should be added.

Returns:

ax – The Axes object with the updated grid.

Return type:

matplotlib.axes.Axes

static pp_add_percentiles(ax=None, percentiles='standard', linestyle='-.', linewidth=1, color='green', label_size=10, **plot_kwargs)[source]

Add percentile reference lines and labels to the top axis.

Parameters:

ax (matplotlib.axes.Axes, optional) – Axes to annotate. Creates new one if None.
percentiles (str or list, optional) – Which percentiles to use: ‘standard’, ‘full’, or custom list.
linestyle (str) – Line style for vertical lines.
linewidth (float) – Width of the percentile lines.
color (str) – Color of percentile lines.
label_size (int) – Font size of percentile labels.
**plot_kwargs (dict) – Additional keyword arguments for ax.axvline().

Returns:

ax – The Axes object with added percentile lines and labels.

Return type:

matplotlib.axes.Axes

static qq_plot(raw_data, model_data, ax=None, line_kwargs=None, marker_kwargs=None)[source]

Create a Q-Q plot comparing raw data to model-simulated data.

Parameters:

raw_data (array-like) – Observed dataset.
model_data (array-like) – Simulated or reference dataset.
ax (matplotlib.axes.Axes) – Axes object on which to draw the plot.
line_kwargs (dict, optional) – Keyword arguments for the reference line.
marker_kwargs (dict, optional) – Keyword arguments for the scatter points.

Return type:

None

plot_gmm_pdf(meds, stds, weights, ax=None, data=None, pdf_plot_kwargs=None, component_plot_kwargs=None, hist_plot_kwargs=None)[source]

Plot the Gaussian Mixture Model PDF and its components.

Parameters:

x (array) – x values.
meds (list or array) – Means of the Gaussian components.
stds (list or array) – Standard deviations of the Gaussian components.
weights (list or array) – Weights of the Gaussian components.
ax (Matplotlib axis object) – Axes object where to plot.
data (list or array, optional) – Raw data to plot as a histogram.
pdf_plot_kwargs (list) – Keyword arguments for the main GMM PDF plot.
component_plot_kwargs (list) – Keyword arguments for the individual component plots.
hist_plot_kwargs (list) – Keyword arguments for the histogram plot.

Return type:

None

Plotly visualization

class pyco2stats.visualize_plotly.Visualize_Plotly[source]

Bases: object

Plotly-based Sinclair-style probability plots for raw data and GMMs.

static pp_raw_data(raw_data, fig=None, marker_kwargs=None)[source]

Plot raw data on log-normal probability paper.

Parameters:

raw_data (array-like) – The raw data values to plot.
fig (plotly.graph_objects.Figure, optional) – Existing figure to add the trace to. If None, the trace is returned without adding to any figure.
marker_kwargs (dict, optional) – Marker style options, either as top-level keys (size, color, etc.) or nested under ‘marker’.

Returns:

trace – The Scatter trace representing the raw data.

Return type:

plotly.graph_objects.Scatter

static pp_one_population(mean, std, fig=None, z_range=(-3.5, 3.5), line_kwargs=None)[source]

Plot a single Gaussian population line on probability paper.

Parameters:

mean (float) – Mean of the Gaussian.
std (float) – Standard deviation of the Gaussian.
fig (plotly.graph_objects.Figure, optional) – Existing figure to add the trace to.
z_range (tuple, optional) – Z-score range over which to compute the line.
line_kwargs (dict, optional) – Line styling arguments.

Returns:

trace – Line trace of the Gaussian population.

Return type:

plotly.graph_objects.Scatter

static pp_single_populations(means, stds, fig=None, z_range=(-3.5, 3.5), line_kwargs=None)[source]

Plot each Gaussian component as a separate line.

Parameters:

means (array-like) – Means of the Gaussian components.
stds (array-like) – Standard deviations of the components.
fig (plotly.graph_objects.Figure, optional) – Existing figure to add the traces to.
z_range (tuple, optional) – Z-score range over which to plot.
line_kwargs (dict, optional) – Line styling options.

Returns:

traces – List of traces, one per component.

Return type:

list of plotly.graph_objects.Scatter

static pp_combined_population(means, stds, weights, fig=None, z_range=(-3.5, 3.5), line_kwargs=None)[source]

Plot the combined Gaussian mixture CDF as a line on probability paper.

Parameters:

means (array-like) – Means of the Gaussian components.
stds (array-like) – Standard deviations of the components.
weights (array-like) – Mixture weights of the components.
fig (plotly.graph_objects.Figure, optional) – Existing figure to add the trace to.
z_range (tuple, optional) – Z-value range for evaluation.
line_kwargs (dict, optional) – Line styling options.

Returns:

trace – Trace representing the combined population CDF.

Return type:

plotly.graph_objects.Scatter

static pp_add_percentiles(fig, percentiles='full', line_kwargs: dict = None, label_kwargs: dict = None, y_min: float = None, y_max: float = None)[source]

Add vertical percentile lines and labels to a Plotly figure.

Parameters:

fig (plotly.graph_objects.Figure) – The figure to which the percentiles are added.
percentiles (str or list of float) – Either ‘full’ for default percentiles or custom list.
line_kwargs (dict, optional) – Styling for vertical percentile lines.
label_kwargs (dict, optional) – Styling for percentile text annotations.
y_min (float, optional) – Minimum y-coordinate for the vertical lines.
y_max (float, optional) – Maximum y-coordinate for the vertical lines.

Return type:

None

static plot_gmm_pdf(x_values, meds, stds, weights, data=None, pdf_plot_kwargs=None, component_plot_kwargs=None, hist_plot_kwargs=None)[source]

Generate Plotly traces for a Gaussian mixture PDF:

Histogram of raw data (probability density)
Individual component PDFs
Total mixture PDF

Parameters:

x_values (np.ndarray) – Points at which to evaluate the PDFs.
meds (array-like) – Means of the Gaussian components.
stds (array-like) – Standard deviations of the components.
weights (array-like) – Mixture weights for each component.
data (array-like, optional) – Raw data to include as a histogram.
pdf_plot_kwargs (dict, optional) – Style arguments for the total PDF line.
component_plot_kwargs (dict, optional) – Style arguments for the component lines.
hist_plot_kwargs (dict, optional) – Style arguments for the histogram, including ‘bins’.

Returns:

hist_trace (plotly.graph_objects.Histogram or None) – Histogram trace if data is provided.
comp_traces (list of plotly.graph_objects.Scatter) – List of individual component PDF traces.
pdf_trace (plotly.graph_objects.Scatter) – Trace for the full GMM PDF.

static qq_plot(raw_data, model_data, fig=None, marker_kwargs=None, line_kwargs=None)[source]

Draw a Q–Q plot comparing two samples

Parameters:

raw_data (array-like) – Observed dataset.
model_data (array-like) – Simulated or modeled dataset.
fig (plotly.graph_objects.Figure, optional) – Figure to which the traces will be added.
marker_kwargs (dict, optional) – Styling options for the Q–Q points.
line_kwargs (dict, optional) – Styling options for the y = x reference line.

Returns:

pts (plotly.graph_objects.Scatter) – Q–Q scatter trace.
line (plotly.graph_objects.Scatter) – Identity line trace (y = x).