cugraph_pyg.loader.dask_node_loader.BulkSampleLoader#

class cugraph_pyg.loader.dask_node_loader.BulkSampleLoader(feature_store: DaskGraphStore, graph_store: DaskGraphStore, input_nodes: Tensor | None | str | Tuple[str, Tensor | None] = None, batch_size: int = 0, *, shuffle: bool = False, drop_last: bool = True, edge_types: Sequence[Tuple[str]] = None, directory: str | TemporaryDirectory = None, input_files: List[str] = None, starting_batch_id: int = 0, batches_per_partition: int = 100, num_neighbors: List[int] | Dict[Tuple[str, str, str], List[int]] = None, replace: bool = True, compression: str = 'COO', **kwargs)[source]#

Iterator that executes sampling using Dask and cuGraph and loads sampled minibatches from disk.

__init__(feature_store: DaskGraphStore, graph_store: DaskGraphStore, input_nodes: Tensor | None | str | Tuple[str, Tensor | None] = None, batch_size: int = 0, *, shuffle: bool = False, drop_last: bool = True, edge_types: Sequence[Tuple[str]] = None, directory: str | TemporaryDirectory = None, input_files: List[str] = None, starting_batch_id: int = 0, batches_per_partition: int = 100, num_neighbors: List[int] | Dict[Tuple[str, str, str], List[int]] = None, replace: bool = True, compression: str = 'COO', **kwargs)[source]#

Executes a bulk sampling job immediately upon creation. Allows iteration over the returned results.

Parameters:
feature_store: DaskGraphStore

The feature store containing features for the graph.

graph_store: DaskGraphStore

The graph store containing the graph structure.

input_nodes: InputNodes

The input nodes associated with this sampler. If None, this loader will load batches from disk rather than performing sampling in memory.

batch_size: int

The number of input nodes per sampling batch. Generally required unless loading already-sampled data from disk.

shuffle: bool (optional, default=False)

Whether to shuffle the input indices. If True, will shuffle the input indices. If False, will create batches in the original order.

edge_types: Sequence[Tuple[str]] (optional, default=None)

The desired edge types for the subgraph. Defaults to all edges in the graph.

directory: str (optional, default=new tempdir)

The path of the directory to write samples to. Defaults to a new generated temporary directory.

input_files: List[str] (optional, default=None)

The input files to read from the directory containing samples. This argument is only used when loading alread-sampled batches from disk.

starting_batch_id: int (optional, default=0)

The starting id for each batch. Defaults to 0.

batches_per_partition: int (optional, default=100)

The number of batches in each output partition. Defaults to 100. Gets passed to the bulk sampler if there is one; otherwise, this argument is used to determine which files to read.

num_neighbors: Union[List[int],

Dict[Tuple[str, str, str], List[int]]] (required)

The number of neighbors to sample for each node in each iteration. If an entry is set to -1, all neighbors will be included. In heterogeneous graphs, may also take in a dictionary denoting the number of neighbors to sample for each individual edge type.

Note: in cuGraph, only one value of num_neighbors is currently supported. Passing in a dictionary will result in an exception.

Methods

__init__(feature_store, graph_store[, ...])

Executes a bulk sampling job immediately upon creation.