maxframe.learn.preprocessing.scale#

maxframe.learn.preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True, validate=True)[源代码]#

沿任意轴标准化数据集。

中心化到均值，并按分量缩放至单位方差。

更多内容请参见用户指南。

参数:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) -- 需要中心化和缩放的数据。
axis ({0, 1}, default=0) -- 用于计算均值和标准差的轴。如果为 0，则独立标准化每个特征，否则（如果为 1）标准化每个样本。
with_mean (bool, default=True) -- 如果为 True，则在缩放前对数据进行中心化。
with_std (bool, default=True) -- 如果为 True，则将数据缩放至单位方差（或等效地，单位标准差）。
copy (bool, default=True) -- 如果为 False，尝试避免复制并在原地缩放。但这不保证总是能在原地完成；例如，如果数据是 int 类型的 numpy 数组，即使 copy=False，也会返回一个副本。

返回:

X_tr -- 转换后的数据。

返回类型:

{ndarray, sparse matrix} of shape (n_samples, n_features)

参见

StandardScaler: 使用 Transformer API 执行单位方差缩放（例如作为预处理的一部分 Pipeline）。

备注

此实现将拒绝中心化 scipy.sparse 矩阵，因为这会使它们非稀疏化，并可能导致程序因内存耗尽而崩溃。

调用者应显式设置 with_mean=False`（这种情况下，仅对 CSC 矩阵的特征执行方差缩放），或者如果他/她期望物化的密集数组能放入内存中，则调用 `X.toarray()。

为了避免内存复制，调用者应传递一个 CSC 矩阵。

NaN 被视为缺失值：在计算统计量时被忽略，并在数据转换过程中保留。

我们使用有偏估计量计算标准差，等价于 numpy.std(x, ddof=0)。注意，ddof 的选择不太可能影响模型性能。

有关不同缩放器、变换器和归一化器的比较，请参见：sphx_glr_auto_examples_preprocessing_plot_all_scaling.py。

警告

数据泄露风险

除非你知道自己在做什么，否则不要使用 scale()。一个常见的错误是在将数据分割为训练集和测试集之前应用它。这会导致模型评估产生偏差，因为信息会从测试集泄露到训练集。通常，我们建议在 Pipeline 中使用 StandardScaler 以防止大多数数据泄露风险：pipe = make_pipeline(StandardScaler(), LogisticRegression())。

示例

>>> from maxframe.learn.preprocessing import scale
>>> X = [[-2, 1, 2], [-1, 0, 1]]
>>> scale(X, axis=0).execute()  # scaling each column independently
array([[-1.,  1.,  1.],
       [ 1., -1., -1.]])
>>> scale(X, axis=1).execute()  # scaling each row independently
array([[-1.37...,  0.39...,  0.98...],
       [-1.22...,  0.     ,  1.22...]])