cugraph.dask.sampling.uniform_neighbor_sample.uniform_neighbor_sample#
- cugraph.dask.sampling.uniform_neighbor_sample.uniform_neighbor_sample(input_graph: Graph, start_list: Sequence, fanout_vals: List[int], *, with_replacement: bool = True, with_batch_ids: bool = False, keep_batches_together=False, min_batch_id=None, max_batch_id=None, random_state: int = None, return_offsets: bool = False, return_hops: bool = True, prior_sources_behavior: str = None, deduplicate_sources: bool = False, renumber: bool = False, compress_per_hop=False, compression='COO', _multiple_clients: bool = False) dask_cudf.DataFrame | Tuple[dask_cudf.DataFrame, dask_cudf.DataFrame][source]#
Does neighborhood sampling, which samples nodes from a graph based on the current node’s neighbors, with a corresponding fanout value at each hop.
- Parameters:
- input_graphcugraph.Graph
cuGraph graph, which contains connectivity information as dask cudf edge list dataframe
- start_listint, list, cudf.Series, or dask_cudf.Series (int32 or int64)
a list of starting vertices for sampling
- fanout_valslist
List of branching out (fan-out) degrees per starting vertex for each hop level.
- with_replacement: bool, optional (default=True)
Flag to specify if the random sampling is done with replacement
- with_batch_ids: bool, optional (default=False)
Flag to specify whether batch ids are present in the start_list
- keep_batches_together: bool (optional, default=False)
If True, will ensure that the returned samples for each batch are on the same partition.
- min_batch_id: int (optional, default=None)
Required for the keep_batches_together option. The minimum batch id.
- max_batch_id: int (optional, default=None)
Required for the keep_batches_together option. The maximum batch id.
- random_state: int, optional
Random seed to use when making sampling calls.
- return_offsets: bool, optional (default=False)
Whether to return the sampling results with batch ids included as one dataframe, or to instead return two dataframes, one with sampling results and one with batch ids and their start offsets per rank.
- return_hops: bool, optional (default=True)
Whether to return the sampling results with hop ids corresponding to the hop where the edge appeared. Defaults to True.
- prior_sources_behavior: str (Optional)
Options are “carryover”, and “exclude”. Default will leave the source list as-is. Carryover will carry over sources from previous hops to the current hop. Exclude will exclude sources from previous hops from reappearing as sources in future hops.
- deduplicate_sources: bool, optional (default=False)
Whether to first deduplicate the list of possible sources from the previous destinations before performing next hop.
- renumber: bool, optional (default=False)
Whether to renumber on a per-batch basis. If True, will return the renumber map and renumber map offsets as an additional dataframe.
- compress_per_hop: bool, optional (default=False)
Whether to compress globally (default), or to produce a separate compressed edgelist per hop.
- compression: str, optional (default=COO)
Sets the compression type for the output minibatches. Valid options are COO (default), CSR, CSC, DCSR, and DCSC.
- _multiple_clients: bool, optional (default=False)
internal flag to ensure sampling works with multiple dask clients set to True to prevent hangs in multi-client environment
- Returns:
- resultdask_cudf.DataFrame or Tuple[dask_cudf.DataFrame, dask_cudf.DataFrame]
GPU distributed data frame containing several dask_cudf.Series
- If return_offsets=False:
- df[‘majors’]: dask_cudf.Series
Contains the source vertices from the sampling result
- df[‘minors’]: dask_cudf.Series
Contains the destination vertices from the sampling result
- df[‘weight’]: dask_cudf.Series
Contains the edge weights from the sampling result
- df[‘edge_id’]: dask_cudf.Series
Contains the edge ids from the sampling result
- df[‘edge_type’]: dask_cudf.Series
Contains the edge types from the sampling result
- df[‘batch_id’]: dask_cudf.Series
Contains the batch ids from the sampling result
- df[‘hop_id’]: dask_cudf.Series
Contains the hop ids from the sampling result
- If renumber=True:
(adds the following dataframe) renumber_df[‘map’]: dask_cudf.Series
Contains the renumber maps for each batch
- renumber_df[‘offsets’]: dask_cudf.Series
Contains the batch offsets for the renumber maps
- If return_offsets=True:
- df[‘majors’]: dask_cudf.Series
Contains the source vertices from the sampling result
- df[‘minors’]: dask_cudf.Series
Contains the destination vertices from the sampling result
- df[‘weight’]: dask_cudf.Series
Contains the edge weights from the sampling result
- df[‘edge_id’]: dask_cudf.Series
Contains the edge ids from the sampling result
- df[‘edge_type’]: dask_cudf.Series
Contains the edge types from the sampling result
- df[‘hop_id’]: dask_cudf.Series
Contains the hop ids from the sampling result
- offsets_df[‘batch_id’]: dask_cudf.Series
Contains the batch ids from the sampling result
- offsets_df[‘offsets’]: dask_cudf.Series
Contains the offsets of each batch in the sampling result
- If renumber=True:
(adds the following dataframe) renumber_df[‘map’]: dask_cudf.Series
Contains the renumber maps for each batch
- renumber_df[‘offsets’]: dask_cudf.Series
Contains the batch offsets for the renumber maps