Preprocessing#
PCA (Principal Component Analysis)#
- class cuvs.preprocessing.pca.Params(
- n_components=None,
- *,
- copy=None,
- whiten=None,
- algorithm=None,
- tol=None,
- n_iterations=None,
Parameters for PCA decomposition.
- Parameters:
- n_componentsint
Number of principal components to keep (default: 1).
- copybool
If False, data passed to fit are overwritten and running fit(X) then transform(X) will not yield the expected results; use fit_transform(X) instead (default: True).
- whitenbool
When True the component vectors are multiplied by the square root of n_samples and divided by the singular values to ensure uncorrelated outputs with unit component-wise variances (default: False).
- algorithmstr
Solver algorithm. One of
"cov_eig_dq"(divide-and-conquer) or"cov_eig_jacobi"(Jacobi) (default:"cov_eig_dq").- tolfloat
Tolerance for singular values, used by the Jacobi solver (default: 0.0).
- n_iterationsint
Number of iterations for the Jacobi solver (default: 15).
- Attributes:
- algorithm
- copy
- n_components
- n_iterations
- tol
- whiten
- cuvs.preprocessing.pca.fit(Params params, X, resources=None)[source]#
Compute PCA (fit only).
Computes the principal components, explained variances, singular values, and column means from the input data.
- Parameters:
- paramsParams
PCA parameters.
params.copyshould be True if you intend to reuse X after this call.- Xdevice array-like, shape (n_samples, n_features), float32
Input data (will be converted to col-major device memory).
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- FitOutput
Named tuple with fields:
components,explained_var,explained_var_ratio,singular_vals,mu,noise_vars.
Examples
>>> import cupy as cp >>> from cuvs.preprocessing import pca >>> X = cp.random.random_sample((500, 32), dtype=cp.float32) >>> params = pca.Params(n_components=8, copy=True) >>> result = pca.fit(params, X) >>> result.components.shape (8, 32)
- cuvs.preprocessing.pca.fit_transform(Params params, X, resources=None)[source]#
Compute PCA and transform the input data in a single operation.
- Parameters:
- paramsParams
PCA parameters.
- Xdevice array-like, shape (n_samples, n_features), float32
Input data (will be converted to col-major device memory).
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- FitTransformOutput
Named tuple with fields:
trans_input,components,explained_var,explained_var_ratio,singular_vals,mu,noise_vars.
Examples
>>> import cupy as cp >>> from cuvs.preprocessing import pca >>> X = cp.random.random_sample((500, 32), dtype=cp.float32) >>> params = pca.Params(n_components=8) >>> result = pca.fit_transform(params, X) >>> result.trans_input.shape (500, 8)
- cuvs.preprocessing.pca.transform(
- Params params,
- X,
- components,
- singular_vals,
- mu,
- trans_input=None,
- resources=None,
Transform data into the PCA eigenspace.
Uses previously computed principal components from
fit()orfit_transform().- Parameters:
- paramsParams
PCA parameters (must match those used during fit).
- Xdevice array-like, shape (n_samples, n_features), float32
Data to transform.
- componentsdevice array-like, shape (n_components, n_features)
Principal components from a prior fit.
- singular_valsdevice array-like, shape (n_components,)
Singular values from a prior fit.
- mudevice array-like, shape (n_features,)
Column means from a prior fit.
- trans_inputoptional device array, shape (n_samples, n_components)
Pre-allocated output buffer (col-major, float32).
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- trans_inputdevice array, shape (n_samples, n_components)
Examples
>>> import cupy as cp >>> from cuvs.preprocessing import pca >>> X = cp.random.random_sample((500, 32), dtype=cp.float32) >>> params = pca.Params(n_components=8, copy=True) >>> result = pca.fit(params, X) >>> transformed = pca.transform(params, X, result.components, ... result.singular_vals, result.mu)
- cuvs.preprocessing.pca.inverse_transform(
- Params params,
- trans_input,
- components,
- singular_vals,
- mu,
- output=None,
- resources=None,
Transform data from the PCA eigenspace back to the original space.
- Parameters:
- paramsParams
PCA parameters (must match those used during fit).
- trans_inputdevice array-like, shape (n_samples, n_components)
Transformed data from
transform()orfit_transform().- componentsdevice array-like, shape (n_components, n_features)
Principal components from a prior fit.
- singular_valsdevice array-like, shape (n_components,)
Singular values from a prior fit.
- mudevice array-like, shape (n_features,)
Column means from a prior fit.
- outputoptional device array, shape (n_samples, n_features)
Pre-allocated output buffer (col-major, float32).
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- outputdevice array, shape (n_samples, n_features)
Reconstructed data.
Examples
>>> import cupy as cp >>> from cuvs.preprocessing import pca >>> X = cp.random.random_sample((500, 32), dtype=cp.float32) >>> params = pca.Params(n_components=8) >>> result = pca.fit_transform(params, X) >>> reconstructed = pca.inverse_transform( ... params, result.trans_input, result.components, ... result.singular_vals, result.mu)
Binary Quantizer#
- cuvs.preprocessing.quantize.binary.transform(dataset, output=None, resources=None)[source]#
Applies binary quantization transform to given dataset
This applies binary quantization to a dataset, changing any positive values to a bitwise 1. This is useful for searching with the BitwiseHamming distance type.
- Parameters:
- datasetrow major host or device dataset to transform
- outputoptional preallocated output memory, on host or device memory
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- outputtransformed dataset quantized into a uint8
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import binary >>> from cuvs.neighbors import cagra >>> n_samples = 50000 >>> n_features = 50 >>> dataset = cp.random.standard_normal((n_samples, n_features), ... dtype=cp.float32) >>> transformed = binary.transform(dataset) >>> >>> # build a cagra index on the binarized data >>> params = cagra.IndexParams(metric="bitwise_hamming", ... build_algo="iterative_cagra_search") >>> idx = cagra.build(params, transformed)
Product Quantizer#
- class cuvs.preprocessing.quantize.pq.Quantizer#
Defines and stores Product Quantizer upon training
The quantization is performed by a linear mapping of an interval in the float data type to the full range of the quantized int type.
- Attributes:
encoded_dimReturns the encoded dimension of the quantized dataset
- pq_bits
pq_codebookReturns the PQ codebook
- pq_dim
- use_vq
vq_codebookReturns the VQ codebook
- encoded_dim#
Returns the encoded dimension of the quantized dataset
- pq_codebook#
Returns the PQ codebook
- vq_codebook#
Returns the VQ codebook
- class cuvs.preprocessing.quantize.pq.QuantizerParams(
- pq_bits=8,
- *,
- pq_dim=0,
- use_subspaces=True,
- use_vq=False,
- vq_n_centers=0,
- kmeans_n_iters=25,
- pq_kmeans_type='kmeans_balanced',
- max_train_points_per_pq_code=256,
- max_train_points_per_vq_cluster=1024,
Parameters for product quantization
- Parameters:
- pq_bits: int
specifies the bit length of the vector element after compression by PQ possible values: within [4, 16]
- pq_dim: int
specifies the dimensionality of the vector after compression by PQ
- use_subspaces: bool
specifies whether to use subspaces for product quantization (PQ). When true, one PQ codebook is used for each subspace. Otherwise, a single PQ codebook is used.
- use_vq: bool
specifies whether to use Vector Quantization (KMeans) before product quantization (PQ).
- vq_n_centers: int
specifies the number of centers for the vector quantizer. When zero, an optimal value is selected using a heuristic. When one, only product quantization is used.
- kmeans_n_iters: int
specifies the number of iterations searching for kmeans centers
- pq_kmeans_type: str
specifies the type of kmeans algorithm to use for PQ training possible values: “kmeans”, “kmeans_balanced”
- max_train_points_per_pq_code: int
specifies the max number of data points to use per PQ code during PQ codebook training. Using more data points per PQ code may increase the quality of PQ codebook but may also increase the build time.
- max_train_points_per_vq_cluster: int
specifies the max number of data points to use per VQ cluster.
- Attributes:
- kmeans_n_iters
- max_train_points_per_pq_code
- max_train_points_per_vq_cluster
- pq_bits
- pq_dim
- pq_kmeans_type
- use_subspaces
- use_vq
- vq_n_centers
- cuvs.preprocessing.quantize.pq.build(QuantizerParams params, dataset, resources=None)[source]#
Builds a Product Quantizer to be used later for quantizing the dataset.
- Parameters:
- paramsQuantizerParams object
- datasetrow major dataset on host or device memory. FP32
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- quantizer: cuvs.preprocessing.quantize.pq.Quantizer
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import pq >>> n_samples = 5000 >>> n_features = 64 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> params = pq.QuantizerParams(pq_bits=8, pq_dim=16) >>> quantizer = pq.build(params, dataset) >>> transformed, _ = pq.transform(quantizer, dataset)
- cuvs.preprocessing.quantize.pq.transform(
- Quantizer quantizer,
- dataset,
- codes_output=None,
- vq_labels=None,
- resources=None,
Applies Product Quantization transform to given dataset
- Parameters:
- quantizertrained Quantizer object
- datasetrow major dataset on host or device memory. FP32
- codes_outputoptional preallocated output memory, on device memory
- vq_labelsoptional preallocated output memory for VQ labels, on device memory
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- codes_outputtransformed dataset quantized into a uint8
- vq_labelsVQ labels when VQ is used, None otherwise
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import pq >>> n_samples = 5000 >>> n_features = 64 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> params = pq.QuantizerParams(pq_bits=8, pq_dim=16) >>> quantizer = pq.build(params, dataset) >>> transformed, _ = pq.transform(quantizer, dataset)
- cuvs.preprocessing.quantize.pq.inverse_transform(
- Quantizer quantizer,
- codes,
- output=None,
- vq_labels=None,
- resources=None,
Applies Product Quantization inverse transform to given codes
- Parameters:
- quantizertrained Quantizer object
- codesrow major device codes to inverse transform. uint8
- outputoptional preallocated output memory, on device memory
- vq_labelsoptional VQ labels when VQ is used, on device memory
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- outputOriginal dataset reconstructed from quantized codes
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import pq >>> n_samples = 5000 >>> n_features = 64 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> params = pq.QuantizerParams(pq_bits=8, pq_dim=16, use_vq=True) >>> quantizer = pq.build(params, dataset) >>> transformed, vq_labels = pq.transform(quantizer, dataset) >>> reconstructed = pq.inverse_transform(quantizer, transformed, vq_labels=vq_labels)
Scalar Quantizer#
- class cuvs.preprocessing.quantize.scalar.Quantizer#
Defines and stores scalar for quantisation upon training
The quantization is performed by a linear mapping of an interval in the float data type to the full range of the quantized int type.
- Attributes:
- max
- min
- class cuvs.preprocessing.quantize.scalar.QuantizerParams(quantile=None, *)#
Parameters for scalar quantization
- Parameters:
- quantile: float
specifies how many outliers at top & bottom will be ignored needs to be within range of (0, 1]
- Attributes:
- quantile
- cuvs.preprocessing.quantize.scalar.train(QuantizerParams params, dataset, resources=None)[source]#
Initializes a scalar quantizer to be used later for quantizing the dataset.
- Parameters:
- paramsQuantizerParams object
- datasetrow major host or device dataset
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- quantizer: cuvs.preprocessing.quantize.scalar.Quantizer
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import scalar >>> n_samples = 50000 >>> n_features = 50 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> params = scalar.QuantizerParams(quantile=0.99) >>> quantizer = scalar.train(params, dataset) >>> transformed = scalar.transform(quantizer, dataset)
- cuvs.preprocessing.quantize.scalar.transform(Quantizer quantizer, dataset, output=None, resources=None)[source]#
Applies quantization transform to given dataset
- Parameters:
- quantizertrained Quantizer object
- datasetrow major host or device dataset to transform
- outputoptional preallocated output memory, on host or device memory
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- outputtransformed dataset quantized into a int8
Examples
>>> import cupy as cp >>> from cuvs.preprocessing.quantize import scalar >>> n_samples = 50000 >>> n_features = 50 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> params = scalar.QuantizerParams(quantile=0.99) >>> quantizer = scalar.train(params, dataset) >>> transformed = scalar.transform(quantizer, dataset)
- cuvs.preprocessing.quantize.scalar.inverse_transform(
- Quantizer quantizer,
- dataset,
- output=None,
- resources=None,
Perform inverse quantization step on previously quantized dataset
Note that depending on the chosen data types train dataset the conversion is not lossless.
- Parameters:
- quantizertrained Quantizer object
- datasetrow major host or device dataset to transform
- outputoptional preallocated output memory, on host or device
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- outputtransformed dataset with scalar quantization reversed