cuGraph API Reference¶
Structure¶
Graph¶
- class cugraph.structure.graph_classes.Graph(m_graph=None, directed=False)[source]¶
Methods
add_internal_vertex_id
(df, …[, drop, …])Given a DataFrame containing external vertex ids in the identified columns, return a DataFrame containing the internal vertex ids as the specified column name.
add_nodes_from
(nodes)Add nodes information to the Graph.
clear
()Empty the graph.
from_cudf_adjlist
(offset_col, index_col[, …])Initialize a graph from the adjacency list.
from_cudf_edgelist
(input_df[, source, …])Initialize a graph from the edge list.
from_dask_cudf_edgelist
(input_ddf[, source, …])Initializes the distributed graph from the dask_cudf.DataFrame edgelist.
from_numpy_array
(np_array[, nodes])Initializes the graph from numpy array containing adjacency matrix.
from_numpy_matrix
(np_matrix)Initializes the graph from numpy matrix containing adjacency matrix.
Initializes the graph from pandas adjacency matrix
from_pandas_edgelist
(pdf[, source, …])Initialize a graph from the edge list.
Returns True if the graph has isolated vertices.
Checks if Graph is bipartite.
Returns True if the graph is a directed graph.
Returns True if the graph is a multigraph.
Checks if Graph is multipartite.
Returns True if the graph is renumbered.
Returns True if the graph has edge weights.
lookup_internal_vertex_id
(df[, column_name])Given a DataFrame containing external vertex ids in the identified columns, or a Series containing external vertex ids, return a Series with the internal vertex ids.
Return a directed representation of the graph.
Return an undirected copy of the graph.
unrenumber
(df, column_name[, …])Given a DataFrame containing internal vertex ids in the identified column, replace this with external vertex ids.
Properties
- add_internal_vertex_id(df, internal_column_name, external_column_name, drop=True, preserve_order=False)[source]¶
Given a DataFrame containing external vertex ids in the identified columns, return a DataFrame containing the internal vertex ids as the specified column name. Optionally drop the external vertex id columns. Optionally preserve the order of the original DataFrame. Parameters ———- df: cudf.DataFrame or dask_cudf.DataFrame A DataFrame containing external vertex identifiers that will be converted into internal vertex identifiers.
internal_column_name: string Name of column to contain the internal vertex id
external_column_name: string or list of strings Name of the column(s) containing the external vertex ids
drop: (optional) bool, defaults to True Drop the external columns from the returned DataFrame
preserve_order: (optional) bool, defaults to False Preserve the order of the data frame (requires an extra sort)
- Returns
- dfcudf.DataFrame or dask_cudf.DataFrame
Original DataFrame with new column containing internal vertex id
- add_nodes_from(nodes)[source]¶
Add nodes information to the Graph. Parameters ———- nodes : list or cudf.Series The nodes of the graph to be stored.
- from_cudf_adjlist(offset_col, index_col, value_col=None)[source]¶
Initialize a graph from the adjacency list. It is an error to call this method on an initialized Graph object. The passed offset_col and index_col arguments wrap gdf_column objects that represent a graph using the adjacency list format. If value_col is None, an unweighted graph is created. If value_col is not None, a weighted graph is created. Undirected edges must be stored as directed edges in both directions. Parameters ———- offset_col : cudf.Series This cudf.Series wraps a gdf_column of size V + 1 (V: number of vertices). The gdf column contains the offsets for the vertices in this graph. Offsets must be in the range [0, E] (E: number of edges).
index_col : cudf.Series This cudf.Series wraps a gdf_column of size E (E: number of edges). The gdf column contains the destination index for each edge. Destination indices must be in the range [0, V) (V: number of vertices).
value_col : cudf.Series, optional This pointer can be
None
. If not, this cudf.Series wraps a gdf_column of size E (E: number of edges). The gdf column contains the weight value for each edge. The expected type of the gdf_column element is floating point number.Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> M = gdf.to_pandas() >>> M = scipy.sparse.coo_matrix((M['2'],(M['0'],M['1']))) >>> M = M.tocsr() >>> offsets = cudf.Series(M.indptr) >>> indices = cudf.Series(M.indices) >>> G = cugraph.Graph() >>> G.from_cudf_adjlist(offsets, indices, None)
- from_cudf_edgelist(input_df, source='source', destination='destination', edge_attr=None, renumber=True)[source]¶
Initialize a graph from the edge list. It is an error to call this method on an initialized Graph object. The passed input_df argument wraps gdf_column objects that represent a graph using the edge list format. source argument is source column name and destination argument is destination column name. By default, renumbering is enabled to map the source and destination vertices into an index in the range [0, V) where V is the number of vertices. If the input vertices are a single column of integers in the range [0, V), renumbering can be disabled and the original external vertex ids will be used. If weights are present, edge_attr argument is the weights column name.
- Parameters
- input_dfcudf.DataFrame or dask_cudf.DataFrame
- A DataFrame that contains edge information If a dask_cudf.DataFrame is
- passed it will be reinterpreted as a cudf.DataFrame. For the
- distributed path please use from_dask_cudf_edgelist.
- sourcestr or array-like
- source column name or array of column names
- destinationstr or array-like
- destination column name or array of column names
- edge_attrstr or None
- the weights column name. Default is None
- renumberbool
- Indicate whether or not to renumber the source and destination vertex
- IDs. Default is True.
Examples
>>> df = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(df, source='0', destination='1', edge_attr='2', renumber=False)
- from_dask_cudf_edgelist(input_ddf, source='source', destination='destination', edge_attr=None, renumber=True)[source]¶
Initializes the distributed graph from the dask_cudf.DataFrame edgelist. Undirected Graphs are not currently supported. By default, renumbering is enabled to map the source and destination vertices into an index in the range [0, V) where V is the number of vertices. If the input vertices are a single column of integers in the range [0, V), renumbering can be disabled and the original external vertex ids will be used. Note that the graph object will store a reference to the dask_cudf.DataFrame provided.
- Parameters
- input_ddfdask_cudf.DataFrame
- The edgelist as a dask_cudf.DataFrame
- sourcestr or array-like
- source column name or array of column names
- destinationstr
- destination column name or array of column names
- edge_attrstr
- weights column name.
- renumberbool
- If source and destination indices are not in range 0 to V where V is
- number of vertices, renumber argument should be True.
- from_numpy_array(np_array, nodes=None)[source]¶
Initializes the graph from numpy array containing adjacency matrix.
- from_numpy_matrix(np_matrix)[source]¶
Initializes the graph from numpy matrix containing adjacency matrix.
- from_pandas_edgelist(pdf, source='source', destination='destination', edge_attr=None, renumber=True)[source]¶
Initialize a graph from the edge list. It is an error to call this method on an initialized Graph object. Source argument is source column name and destination argument is destination column name. By default, renumbering is enabled to map the source and destination vertices into an index in the range [0, V) where V is the number of vertices. If the input vertices are a single column of integers in the range [0, V), renumbering can be disabled and the original external vertex ids will be used. If weights are present, edge_attr argument is the weights column name.
- Parameters
- input_dfpandas.DataFrame
- A DataFrame that contains edge information
- sourcestr or array-like
- source column name or array of column names
- destinationstr or array-like
- destination column name or array of column names
- edge_attrstr or None
- the weights column name. Default is None
- renumberbool
- Indicate whether or not to renumber the source and destination vertex
- IDs. Default is True.
Examples
>>> df = pandas.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_pandas_edgelist(df, source='0', destination='1', edge_attr='2', renumber=False)
- is_bipartite()[source]¶
Checks if Graph is bipartite. This solely relies on the user call of add_nodes_from with the bipartite parameter. This does not parse the graph to check if it is bipartite.
- is_directed()[source]¶
Returns True if the graph is a directed graph. Returns False if the graph is an undirected graph.
- is_multipartite()[source]¶
Checks if Graph is multipartite. This solely relies on the user call of add_nodes_from with the partition parameter. This does not parse the graph to check if it is multipartite.
- lookup_internal_vertex_id(df, column_name=None)[source]¶
Given a DataFrame containing external vertex ids in the identified columns, or a Series containing external vertex ids, return a Series with the internal vertex ids. Note that this function does not guarantee order in single GPU mode, and does not guarantee order or partitioning in multi-GPU mode.
- Parameters
- df: cudf.DataFrame, cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
- A DataFrame containing external vertex identifiers that will be
- converted into internal vertex identifiers.
- column_name: (optional) string
- Name of the column containing the external vertex ids
- Returns
- seriescudf.Series or dask_cudf.Series
The internal vertex identifiers
- to_directed()[source]¶
Return a directed representation of the graph. This function sets the type of graph as DiGraph() and returns the directed view.
- Returns
- GDiGraph
- A directed graph with the same nodes, and each edge (u,v,weights)
- replaced by two directed edges (u,v,weights) and (v,u,weights).
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, '0', '1') >>> DiG = G.to_directed()
- to_undirected()[source]¶
Return an undirected copy of the graph.
- Returns
- GGraph
- A undirected graph with the same nodes, and each directed edge
- (u,v,weights) replaced by an undirected edge (u,v,weights).
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> DiG = cugraph.DiGraph() >>> DiG.from_cudf_edgelist(M, '0', '1') >>> G = DiG.to_undirected()
- unrenumber(df, column_name, preserve_order=False, get_column_names=False)[source]¶
Given a DataFrame containing internal vertex ids in the identified column, replace this with external vertex ids. If the renumbering is from a single column, the output dataframe will use the same name for the external vertex identifiers. If the renumbering is from a multi-column input, the output columns will be labeled 0 through n-1 with a suffix of _column_name. Note that this function does not guarantee order in single GPU mode, and does not guarantee order or partitioning in multi-GPU mode. If you wish to preserve ordering, add an index column to df and sort the return by that index column.
- Parameters
- df: cudf.DataFrame or dask_cudf.DataFrame
- A DataFrame containing internal vertex identifiers that will be
- converted into external vertex identifiers.
- column_name: string
- Name of the column containing the internal vertex id.
- preserve_order: (optional) bool
- If True, preserve the order of the rows in the output DataFrame to
- match the input DataFrame
- Returns
- dfcudf.DataFrame or dask_cudf.DataFrame
- The original DataFrame columns exist unmodified. The external vertex
- identifiers are added to the DataFrame, the internal vertex identifier
- column is removed from the dataframe.
Symmetrize¶
- cugraph.structure.symmetrize.symmetrize(source_col, dest_col, value_col=None, multi=False, symmetrize=True)[source]¶
Take a COO set of source destination pairs along with associated values stored in a single GPU or distributed create a new COO set of source destination pairs along with values where all edges exist in both directions.
Return from this call will be a COO stored as two cudf Series or dask_cudf.Series -the symmetrized source column and the symmetrized dest column, along with an optional cudf Series containing the associated values (only if the values are passed in).
- Parameters
- source_colcudf.Series or dask_cudf.Series
This cudf.Series wraps a gdf_column of size E (E: number of edges). The gdf column contains the source index for each edge. Source indices must be an integer type.
- dest_colcudf.Series or dask_cudf.Series
This cudf.Series wraps a gdf_column of size E (E: number of edges). The gdf column contains the destination index for each edge. Destination indices must be an integer type.
- value_colcudf.Series or dask_cudf.Series (optional)
This cudf.Series wraps a gdf_column of size E (E: number of edges). The gdf column contains values associated with this edge. For this function the values can be any type, they are not examined, just copied.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> sources = cudf.Series(M['0']) >>> destinations = cudf.Series(M['1']) >>> values = cudf.Series(M['2']) >>> src, dst, val = cugraph.symmetrize(sources, destinations, values)
- cugraph.structure.symmetrize.symmetrize_ddf(df, src_name, dst_name, weight_name=None)[source]¶
Take a COO stored in a distributed DataFrame, and the column names of the source and destination columns and create a new data frame using the same column names that symmetrize the graph so that all edges appear in both directions.
Note that if other columns exist in the data frame (e.g. edge weights) the other columns will also be replicated. That is, if (u,v,data) represents the source value (u), destination value (v) and some set of other columns (data) in the input data, then the output data will contain both (u,v,data) and (v,u,data) with matching data.
If (u,v,data1) and (v,u,data2) exist in the input data where data1 != data2 then this code will arbitrarily pick the smaller data element to keep, if this is not desired then the caller should should correct the data prior to calling symmetrize.
- Parameters
- dfdask_cudf.DataFrame
Input data frame containing COO. Columns should contain source ids, destination ids and any properties associated with the edges.
- src_namestring
Name of the column in the data frame containing the source ids
- dst_namestring
Name of the column in the data frame containing the destination ids
- multibool
Set to True if graph is a Multi(Di)Graph. This allows multiple edges instead of dropping them.
- symmetrizebool
Default is True to perform symmetrization. If False only duplicate edges are dropped.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> sym_df = cugraph.symmetrize(M, '0', '1')
- cugraph.structure.symmetrize.symmetrize_df(df, src_name, dst_name, multi=False, symmetrize=True)[source]¶
Take a COO stored in a DataFrame, along with the column names of the source and destination columns and create a new data frame using the same column names that symmetrize the graph so that all edges appear in both directions. Note that if other columns exist in the data frame (e.g. edge weights) the other columns will also be replicated. That is, if (u,v,data) represents the source value (u), destination value (v) and some set of other columns (data) in the input data, then the output data will contain both (u,v,data) and (v,u,data) with matching data. If (u,v,data1) and (v,u,data2) exist in the input data where data1 != data2 then this code will arbitrarily pick the smaller data element to keep, if this is not desired then the caller should should correct the data prior to calling symmetrize.
- Parameters
- dfcudf.DataFrame
Input data frame containing COO. Columns should contain source ids, destination ids and any properties associated with the edges.
- src_namestring
Name of the column in the data frame containing the source ids
- dst_namestring
Name of the column in the data frame containing the destination ids
- multibool
Set to True if graph is a Multi(Di)Graph. This allows multiple edges instead of dropping them.
- symmetrizebool
Default is True to perform symmetrization. If False only duplicate edges are dropped.
Examples
>>> import cugraph.dask as dcg >>> Comms.initialize() >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'weight'], dtype=['int32', 'int32', 'float32']) >>> sym_ddf = cugraph.symmetrize_ddf(ddf, "src", "dst", "weight") >>> Comms.destroy()
Conversion from Other Formats¶
- cugraph.structure.convert_matrix.from_adjlist(offsets, indices, values=None, create_using=<class 'cugraph.structure.graph_classes.Graph'>)[source]¶
Initializes the graph from cuDF or Pandas Series representing adjacency matrix CSR data and returns a new cugraph.Graph object if ‘create_using’ is set to cugraph.Graph (the default), or cugraph.DiGraph if ‘create_using’ is set to cugraph.DiGraph.
- Parameters
- offsetscudf.Series, pandas.Series
The offsets of a CSR adjacency matrix.
- indicescudf.Series, pandas.Series
The indices of a CSR adjacency matrix.
- valuescudf.Series, pandas.Series, or None (default), optional
The values in a CSR adjacency matrix, which represent edge weights in a graph. If not provided, the resulting graph is considered unweighted.
- create_usingcuGraph.Graph
Specify the type of Graph to create. Default is cugraph.Graph
Examples
>>> pdf = pd.read_csv('datasets/karate.csv', delimiter=' ', ... dtype={0:'int32', 1:'int32', 2:'float32'}, ... header=None) >>> M = scipy.sparse.coo_matrix((pdf[2],(pdf[0],pdf[1]))) >>> M = M.tocsr() >>> offsets = pd.Series(M.indptr) >>> indices = pd.Series(M.indices) >>> G = cugraph.from_adjlist(offsets, indices, None)
- cugraph.structure.convert_matrix.from_cudf_edgelist(df, source='source', destination='destination', edge_attr=None, create_using=<class 'cugraph.structure.graph_classes.Graph'>, renumber=True)[source]¶
Return a new graph created from the edge list representaion. This function is added for NetworkX compatibility (this function is a RAPIDS version of NetworkX’s from_pandas_edge_list()). This function does not support multiple source or destination columns. But does support renumbering
- Parameters
- dfcudf.DataFrame
This cudf.DataFrame contains columns storing edge source vertices, destination (or target following NetworkX’s terminology) vertices, and (optional) weights.
- sourcestring or integer
This is used to index the source column.
- destinationstring or integer
This is used to index the destination (or target following NetworkX’s terminology) column.
- edge_attrstring or integer, optional
This pointer can be
None
. If not, this is used to index the weight column.- create_usingcuGraph.Graph
Specify the type of Graph to create. Default is cugraph.Graph
- renumberbool
If source and destination indices are not in range 0 to V where V is number of vertices, renumber argument should be True.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G = cugraph.from_cudf_edgelist(M, source='0', target='1', weight='2')
- cugraph.structure.convert_matrix.from_edgelist(df, source='source', destination='destination', edge_attr=None, create_using=<class 'cugraph.structure.graph_classes.Graph'>, renumber=True)[source]¶
Return a new graph created from the edge list representaion.
- Parameters
- dfcudf.DataFrame, pandas.DataFrame, dask_cudf.core.DataFrame
This DataFrame contains columns storing edge source vertices, destination (or target following NetworkX’s terminology) vertices, and (optional) weights.
- sourcestring or integer
This is used to index the source column.
- destinationstring or integer
This is used to index the destination (or target following NetworkX’s terminology) column.
- edge_attrstring or integer, optional
This pointer can be
None
. If not, this is used to index the weight column.- create_usingcuGraph.Graph
Specify the type of Graph to create. Default is cugraph.Graph
- renumberbool
If source and destination indices are not in range 0 to V where V is number of vertices, renumber argument should be True.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G = cugraph.from_edgelist(M, source='0', destination='1', edge_attr='2')
- cugraph.structure.convert_matrix.from_numpy_array(A, create_using=<class 'cugraph.structure.graph_classes.Graph'>)[source]¶
Initializes the graph from numpy array containing adjacency matrix. Set create_using to cugraph.DiGraph for directed graph and cugraph.Graph for undirected Graph.
- cugraph.structure.convert_matrix.from_numpy_matrix(A, create_using=<class 'cugraph.structure.graph_classes.Graph'>)[source]¶
Initializes the graph from numpy matrix containing adjacency matrix. Set create_using to cugraph.DiGraph for directed graph and cugraph.Graph for undirected Graph.
- cugraph.structure.convert_matrix.from_pandas_adjacency(df, create_using=<class 'cugraph.structure.graph_classes.Graph'>)[source]¶
Initializes the graph from pandas adjacency matrix. Set create_using to cugraph.DiGraph for directed graph and cugraph.Graph for undirected Graph.
- cugraph.structure.convert_matrix.from_pandas_edgelist(df, source='source', destination='destination', edge_attr=None, create_using=<class 'cugraph.structure.graph_classes.Graph'>, renumber=True)[source]¶
Initialize a graph from the edge list. It is an error to call this method on an initialized Graph object. Source argument is source column name and destination argument is destination column name.
By default, renumbering is enabled to map the source and destination vertices into an index in the range [0, V) where V is the number of vertices. If the input vertices are a single column of integers in the range [0, V), renumbering can be disabled and the original external vertex ids will be used.
If weights are present, edge_attr argument is the weights column name.
- Parameters
- input_dfpandas.DataFrame
A DataFrame that contains edge information
- sourcestr or array-like
source column name or array of column names
- destinationstr or array-like
destination column name or array of column names
- edge_attrstr or None
the weights column name. Default is None
- renumberbool
Indicate whether or not to renumber the source and destination vertex IDs. Default is True.
- create_using: cugraph.DiGraph or cugraph.Graph
Indicate whether to create a directed or undirected graph
- Returns
- Gcugraph.DiGraph or cugraph.Graph
graph containing edges from the pandas edgelist
Examples
>>> df = pandas.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_pandas_edgelist(df, source='0', destination='1', edge_attr='2', renumber=False)
- cugraph.structure.convert_matrix.to_numpy_array(G)[source]¶
Returns the graph adjacency matrix as a NumPy array.
- cugraph.structure.convert_matrix.to_numpy_matrix(G)[source]¶
Returns the graph adjacency matrix as a NumPy matrix.
- cugraph.structure.convert_matrix.to_pandas_adjacency(G)[source]¶
Returns the graph adjacency matrix as a Pandas DataFrame. The row indices denote source and column names denote destination.
- cugraph.structure.convert_matrix.to_pandas_edgelist(G, source='source', destination='destination')[source]¶
Returns the graph edge list as a Pandas DataFrame.
- Parameters
- Gcugraph.Graph or cugraph.DiGraph
Graph containg the edgelist.
- sourcestr or array-like
source column name or array of column names
- destinationstr or array-like
destination column name or array of column names
- Returns
- ——
- dfpandas.DataFrame
pandas dataframe containing the edgelist as source and destination columns.
Centrality¶
Betweenness Centrality¶
- cugraph.centrality.betweenness_centrality.betweenness_centrality(G, k=None, normalized=True, weight=None, endpoints=False, seed=None, result_dtype=<class 'numpy.float64'>)[source]¶
Compute the betweenness centrality for all vertices of the graph G. Betweenness centrality is a measure of the number of shortest paths that pass through a vertex. A vertex with a high betweenness centrality score has more paths passing through it and is therefore believed to be more important.
To improve performance. rather than doing an all-pair shortest path, a sample of k starting vertices can be used.
CuGraph does not currently support the ‘endpoints’ and ‘weight’ parameters as seen in the corresponding networkX call.
- Parameters
- GcuGraph.Graph or networkx.Graph
The graph can be either directed (DiGraph) or undirected (Graph). Weights in the graph are ignored, the current implementation uses BFS traversals. Use weight parameter if weights need to be considered (currently not supported)
- kint or list or None, optional, default=None
If k is not None, use k node samples to estimate betweenness. Higher values give better approximation. If k is a list, use the content of the list for estimation: the list should contain vertex identifiers. If k is None (the default), all the vertices are used to estimate betweenness. Vertices obtained through sampling or defined as a list will be used assources for traversals inside the algorithm.
- normalizedbool, optional
Default is True. If true, the betweenness values are normalized by __2 / ((n - 1) * (n - 2))__ for Graphs (undirected), and __1 / ((n - 1) * (n - 2))__ for DiGraphs (directed graphs) where n is the number of nodes in G. Normalization will ensure that values are in [0, 1], this normalization scales for the highest possible value where one node is crossed by every single shortest path.
- weightcudf.DataFrame, optional, default=None
Specifies the weights to be used for each edge. Should contain a mapping between edges and weights. (Not Supported)
- endpointsbool, optional, default=False
If true, include the endpoints in the shortest path counts. (Not Supported)
- seedoptional
if k is specified and k is an integer, use seed to initialize the random number generator. Using None as seed relies on random.seed() behavior: using current system time If k is either None or list: seed parameter is ignored
- result_dtypenp.float32 or np.float64, optional, default=np.float64
Indicate the data type of the betweenness centrality scores
- Returns
- dfcudf.DataFrame or Dictionary if using NetworkX
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding betweenness centrality values. Please note that the resulting the ‘vertex’ column might not be in ascending order. The Dictionary conatains the same two columns
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘betweenness_centrality’]cudf.Series
Contains the betweenness centrality of vertices
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> bc = cugraph.betweenness_centrality(G)
- cugraph.centrality.betweenness_centrality.edge_betweenness_centrality(G, k=None, normalized=True, weight=None, seed=None, result_dtype=<class 'numpy.float64'>)[source]¶
Compute the edge betweenness centrality for all edges of the graph G. Betweenness centrality is a measure of the number of shortest paths that pass over an edge. An edge with a high betweenness centrality score has more paths passing over it and is therefore believed to be more important.
To improve performance, rather than doing an all-pair shortest path, a sample of k starting vertices can be used.
CuGraph does not currently support the ‘weight’ parameter as seen in the corresponding networkX call.
- Parameters
- GcuGraph.Graph or networkx.Graph
The graph can be either directed (DiGraph) or undirected (Graph). Weights in the graph are ignored, the current implementation uses BFS traversals. Use weight parameter if weights need to be considered (currently not supported)
- kint or list or None, optional, default=None
If k is not None, use k node samples to estimate betweenness. Higher values give better approximation. If k is a list, use the content of the list for estimation: the list should contain vertices identifiers. Vertices obtained through sampling or defined as a list will be used as sources for traversals inside the algorithm.
- normalizedbool, optional
Default is True. If true, the betweenness values are normalized by 2 / (n * (n - 1)) for Graphs (undirected), and 1 / (n * (n - 1)) for DiGraphs (directed graphs) where n is the number of nodes in G. Normalization will ensure that values are in [0, 1], this normalization scales for the highest possible value where one edge is crossed by every single shortest path.
- weightcudf.DataFrame, optional, default=None
Specifies the weights to be used for each edge. Should contain a mapping between edges and weights. (Not Supported)
- seedoptional
if k is specified and k is an integer, use seed to initialize the random number generator. Using None as seed relies on random.seed() behavior: using current system time If k is either None or list: seed parameter is ignored
- result_dtypenp.float32 or np.float64, optional, default=np.float64
Indicate the data type of the betweenness centrality scores Using double automatically switch implementation to “default”
- Returns
- dfcudf.DataFrame or Dictionary if using NetworkX
GPU data frame containing three cudf.Series of size E: the vertex identifiers of the sources, the vertex identifies of the destinations and the corresponding betweenness centrality values. Please note that the resulting the ‘src’, ‘dst’ column might not be in ascending order.
- df[‘src’]cudf.Series
Contains the vertex identifiers of the source of each edge
- df[‘dst’]cudf.Series
Contains the vertex identifiers of the destination of each edge
- df[‘edge_betweenness_centrality’]cudf.Series
Contains the betweenness centrality of edges
When using undirected graphs, ‘src’ and ‘dst’ only contains elements such that ‘src’ < ‘dst’, which might differ from networkx and user’s input. Namely edge (1 -> 0) is transformed into (0 -> 1) but contains the betweenness centrality of edge (1 -> 0).
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> ebc = cugraph.edge_betweenness_centrality(G)
Katz Centrality¶
- cugraph.centrality.katz_centrality.katz_centrality(G, alpha=None, beta=None, max_iter=100, tol=1e-06, nstart=None, normalized=True)[source]¶
Compute the Katz centrality for the nodes of the graph G. cuGraph does not currently support the ‘beta’ and ‘weight’ parameters as seen in the corresponding networkX call. This implementation is based on a relaxed version of Katz defined by Foster with a reduced computational complexity of O(n+m)
Foster, K.C., Muth, S.Q., Potterat, J.J. et al. Computational & Mathematical Organization Theory (2001) 7: 275. https://doi.org/10.1023/A:1013470632383
- Parameters
- GcuGraph.Graph or networkx.Graph
cuGraph graph descriptor with connectivity information. The graph can contain either directed (DiGraph) or undirected edges (Graph).
- alphafloat
Attenuation factor defaulted to None. If alpha is not specified then it is internally calculated as 1/(degree_max) where degree_max is the maximum out degree.
- NOTE
The maximum acceptable value of alpha for convergence alpha_max = 1/(lambda_max) where lambda_max is the largest eigenvalue of the graph. Since lambda_max is always lesser than or equal to degree_max for a graph, alpha_max will always be greater than or equal to (1/degree_max). Therefore, setting alpha to (1/degree_max) will guarantee that it will never exceed alpha_max thus in turn fulfilling the requirement for convergence.
- betaNone
A weight scalar - currently Not Supported
- max_iterint
The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100.
- tolerancefloat
Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 1.0e-6. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 1e-2 and 1e-6 are acceptable.
- nstartcudf.Dataframe
GPU Dataframe containing the initial guess for katz centrality.
- nstart[‘vertex’]cudf.Series
Contains the vertex identifiers
- nstart[‘values’]cudf.Series
Contains the katz centrality values of vertices
- normalizedbool
If True normalize the resulting katz centrality values
- Returns
- dfcudf.DataFrame or Dictionary if using NetworkX
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding katz centrality values.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘katz_centrality’]cudf.Series
Contains the katz centrality of vertices
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> kc = cugraph.katz_centrality(G)
Katz Centrality (MG)¶
- cugraph.dask.centrality.katz_centrality.call_katz_centrality(sID, data, num_verts, num_edges, vertex_partition_offsets, aggregate_segment_offsets, alpha, beta, max_iter, tol, nstart, normalized)[source]¶
- cugraph.dask.centrality.katz_centrality.katz_centrality(input_graph, alpha=None, beta=None, max_iter=100, tol=1e-05, nstart=None, normalized=True)[source]¶
Compute the Katz centrality for the nodes of the graph G.
- Parameters
- input_graphcuGraph.Graph
cuGraph graph descriptor with connectivity information. The graph can contain either directed (DiGraph) or undirected edges (Graph).
- alphafloat
Attenuation factor defaulted to None. If alpha is not specified then it is internally calculated as 1/(degree_max) where degree_max is the maximum out degree.
- NOTE
The maximum acceptable value of alpha for convergence alpha_max = 1/(lambda_max) where lambda_max is the largest eigenvalue of the graph. Since lambda_max is always lesser than or equal to degree_max for a graph, alpha_max will always be greater than or equal to (1/degree_max). Therefore, setting alpha to (1/degree_max) will guarantee that it will never exceed alpha_max thus in turn fulfilling the requirement for convergence.
- betaNone
A weight scalar - currently Not Supported
- max_iterint
The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100.
- tolerancefloat
Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 1.0e-6. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 1e-2 and 1e-6 are acceptable.
- nstartdask_cudf.Dataframe
GPU Dataframe containing the initial guess for katz centrality
- nstart[‘vertex’]dask_cudf.Series
Contains the vertex identifiers
- nstart[‘values’]dask_cudf.Series
Contains the katz centrality values of vertices
- normalizedbool
If True normalize the resulting katz centrality values
- Returns
- katz_centralitydask_cudf.DataFrame
GPU data frame containing two dask_cudf.Series of size V: the vertex identifiers and the corresponding katz centrality values.
- ddf[‘vertex’]dask_cudf.Series
Contains the vertex identifiers
- ddf[‘katz_centrality’]dask_cudf.Series
Contains the katz centrality of vertices
Examples
>>> import cugraph.dask as dcg >>> ... Init a DASK Cluster >> see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'value'], dtype=['int32', 'int32', 'float32']) >>> dg = cugraph.DiGraph() >>> dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst', edge_attr='value') >>> pr = dcg.katz_centrality(dg)
Community¶
EgoNet¶
- cugraph.community.egonet.batched_ego_graphs(G, seeds, radius=1, center=True, undirected=False, distance=None)[source]¶
Compute the induced subgraph of neighbors for each node in seeds within a given radius.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information. Edge weights, if present, should be single or double precision floating point values.
- seedscudf.Series or list or cudf.DataFrame
Specifies the seeds of the induced egonet subgraphs.
- radius: integer, optional
Include all neighbors of distance<=radius from n.
- center: bool, optional
Defaults to True. False is not supported
- undirected: bool, optional
Defaults to False. True is not supported
- distance: key, optional
Distances are counted in hops from n. Other cases are not supported.
- Returns
- ego_edge_listscudf.DataFrame or pandas.DataFrame
GPU data frame containing all induced sources identifiers, destination identifiers, edge weights
- seeds_offsets: cudf.Series
Series containing the starting offset in the returned edge list for each seed.
- cugraph.community.egonet.ego_graph(G, n, radius=1, center=True, undirected=False, distance=None)[source]¶
Compute the induced subgraph of neighbors centered at node n, within a given radius.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information. Edge weights, if present, should be single or double precision floating point values.
- ninteger or cudf.DataFrame
A single node as integer or a cudf.DataFrame if nodes are represented with multiple columns. If a cudf.DataFrame is provided, only the first row is taken as the node input.
- radius: integer, optional
Include all neighbors of distance<=radius from n.
- center: bool, optional
Defaults to True. False is not supported
- undirected: bool, optional
Defaults to False. True is not supported
- distance: key, optional
Distances are counted in hops from n. Other cases are not supported.
- Returns
- G_egocuGraph.Graph or networkx.Graph
A graph descriptor with a minimum spanning tree or forest. The networkx graph will not have all attributes copied over
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> ego_graph = cugraph.ego_graph(G, seed, radius=2)
Ensemble clustering for graphs (ECG)¶
- cugraph.community.ecg.ecg(input_graph, min_weight=0.05, ensemble_size=16, weight=None)[source]¶
Compute the Ensemble Clustering for Graphs (ECG) partition of the input graph. ECG runs truncated Louvain on an ensemble of permutations of the input graph, then uses the ensemble partitions to determine weights for the input graph. The final result is found by running full Louvain on the input graph using the determined weights.
See https://arxiv.org/abs/1809.05578 for further information.
- Parameters
- input_graphcugraph.Graph or NetworkX Graph
The graph descriptor should contain the connectivity information and weights. The adjacency list will be computed if not already present.
- min_weightfloating point
The minimum value to assign as an edgeweight in the ECG algorithm. It should be a value in the range [0,1] usually left as the default value of .05
- ensemble_sizeinteger
The number of graph permutations to use for the ensemble. The default value is 16, larger values may produce higher quality partitions for some graphs.
- weightstr
This parameter is here for NetworkX compatibility and represents which NetworkX data column represents Edge weights. Default is None
- Returns
- partscudf.DataFrame or python dictionary
GPU data frame of size V containing two columns, the vertex id and the partition id it is assigned to.
- df[vertex]cudf.Series
Contains the vertex identifiers
- df[partition]cudf.Series
Contains the partition assigned to the vertices
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2') >>> parts = cugraph.ecg(G)
K-Truss¶
- cugraph.community.ktruss_subgraph.k_truss(G, k)[source]¶
Returns the K-Truss subgraph of a graph for a specific k.
The k-truss of a graph is a subgraph where each edge is part of at least (k−2) triangles. K-trusses are used for finding tighlty knit groups of vertices in a graph. A k-truss is a relaxation of a k-clique in the graph and was define in [1]. Finding cliques is computationally demanding and finding the maximal k-clique is known to be NP-Hard.
- Parameters
- GcuGraph.Graph or networkx.Graph
cuGraph graph descriptor with connectivity information. k-Trusses are defined for only undirected graphs as they are defined for undirected triangle in a graph.
- kint
The desired k to be used for extracting the k-truss subgraph.
- Returns
- G_trusscuGraph.Graph or networkx.Graph
A cugraph graph descriptor with the k-truss subgraph for the given k. The networkx graph will NOT have all attributes copied over
- cugraph.community.ktruss_subgraph.ktruss_subgraph(G, k, use_weights=True)[source]¶
Returns the K-Truss subgraph of a graph for a specific k.
The k-truss of a graph is a subgraph where each edge is part of at least (k−2) triangles. K-trusses are used for finding tighlty knit groups of vertices in a graph. A k-truss is a relaxation of a k-clique in the graph and was define in [1]. Finding cliques is computationally demanding and finding the maximal k-clique is known to be NP-Hard.
In contrast, finding a k-truss is computationally tractable as its key building block, namely triangle counting counting, can be executed in polnymomial time.Typically, it takes many iterations of triangle counting to find the k-truss of a graph. Yet these iterations operate on a weakly monotonically shrinking graph. Therefore, finding the k-truss of a graph can be done in a fairly reasonable amount of time. The solution in cuGraph is based on a GPU algorithm first shown in [2] and uses the triangle counting algorithm from [3].
[1] Cohen, J., “Trusses: Cohesive subgraphs for social network analysis” National security agency technical report, 2008
[2] O. Green, J. Fox, E. Kim, F. Busato, et al. “Quickly Finding a Truss in a Haystack” IEEE High Performance Extreme Computing Conference (HPEC), 2017 https://doi.org/10.1109/HPEC.2017.8091038
[3] O. Green, P. Yalamanchili, L.M. Munguia, “Fast Triangle Counting on GPU” Irregular Applications: Architectures and Algorithms (IA3), 2014
- Parameters
- GcuGraph.Graph
cuGraph graph descriptor with connectivity information. k-Trusses are defined for only undirected graphs as they are defined for undirected triangle in a graph.
- kint
The desired k to be used for extracting the k-truss subgraph.
- use_weightsBool
whether the output should contain the edge weights if G has them
- Returns
- G_trusscuGraph.Graph
A cugraph graph descriptor with the k-truss subgraph for the given k.
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> k_subgraph = cugraph.ktruss_subgraph(G, 3)
Leiden¶
- cugraph.community.leiden.leiden(G, max_iter=100, resolution=1.0)[source]¶
Compute the modularity optimizing partition of the input graph using the Leiden algorithm
It uses the Louvain method described in:
Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports, 9(1), 5233. doi: 10.1038/s41598-019-41695-z
- Parameters
- Gcugraph.Graph
cuGraph graph descriptor of type Graph
The adjacency list will be computed if not already present.
- max_iterinteger
This controls the maximum number of levels/iterations of the Leiden algorithm. When specified the algorithm will terminate after no more than the specified number of iterations. No error occurs when the algorithm terminates early in this manner.
- resolution: float/double, optional
Called gamma in the modularity formula, this changes the size of the communities. Higher resolutions lead to more smaller communities, lower resolutions lead to fewer larger communities. Defaults to 1.
- Returns
- partscudf.DataFrame
GPU data frame of size V containing two columns the vertex id and the partition id it is assigned to.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘partition’]cudf.Series
Contains the partition assigned to the vertices
- modularity_scorefloat
a floating point number containing the global modularity score of the partitioning.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> parts, modularity_score = cugraph.leiden(G)
Louvain¶
- cugraph.community.louvain.louvain(G, max_iter=100, resolution=1.0)[source]¶
Compute the modularity optimizing partition of the input graph using the Louvain method
It uses the Louvain method described in:
VD Blondel, J-L Guillaume, R Lambiotte and E Lefebvre: Fast unfolding of community hierarchies in large networks, J Stat Mech P10008 (2008), http://arxiv.org/abs/0803.0476
- Parameters
- Gcugraph.Graph or NetworkX Graph
The graph descriptor should contain the connectivity information and weights. The adjacency list will be computed if not already present.
- max_iterinteger
This controls the maximum number of levels/iterations of the Louvain algorithm. When specified the algorithm will terminate after no more than the specified number of iterations. No error occurs when the algorithm terminates early in this manner.
- resolution: float/double, optional
Called gamma in the modularity formula, this changes the size of the communities. Higher resolutions lead to more smaller communities, lower resolutions lead to fewer larger communities. Defaults to 1.
- Returns
- partscudf.DataFrame
GPU data frame of size V containing two columns the vertex id and the partition id it is assigned to.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘partition’]cudf.Series
Contains the partition assigned to the vertices
- modularity_scorefloat
a floating point number containing the global modularity score of the partitioning.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> parts, modularity_score = cugraph.louvain(G)
Louvain (MG)¶
- cugraph.dask.community.louvain.call_louvain(sID, data, num_verts, num_edges, vertex_partition_offsets, aggregate_segment_offsets, max_level, resolution)[source]¶
- cugraph.dask.community.louvain.louvain(input_graph, max_iter=100, resolution=1.0)[source]¶
Compute the modularity optimizing partition of the input graph using the Louvain method on multiple GPUs
Examples
>>> import cugraph.dask as dcg >>> ... Init a DASK Cluster >> see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv('datasets/karate.csv', chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'value'], dtype=['int32', 'int32', 'float32']) >>> dg = cugraph.Graph() >>> dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst', edge_attr='value') >>> parts, modularity_score = dcg.louvain(dg)
Spectral Clustering¶
- cugraph.community.spectral_clustering.analyzeClustering_edge_cut(G, n_clusters, clustering, vertex_col_name='vertex', cluster_col_name='cluster')[source]¶
Compute the edge cut score for a partitioning/clustering The assumption is that “clustering” is the results from a call from a special clustering algorithm and contains columns named “vertex” and “cluster”.
- Parameters
- Gcugraph.Graph
cuGraph graph descriptor
- n_clustersinteger
Specifies the number of clusters in the given clustering
- clusteringcudf.DataFrame
The cluster assignment to analyze.
- vertex_col_namestr
The name of the column in the clustering dataframe identifying the external vertex id
- cluster_col_namestr
The name of the column in the clustering dataframe identifying the cluster id
- Returns
- scorefloat
The computed edge cut score
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr=None) >>> df = cugraph.spectralBalancedCutClustering(G, 5) >>> score = cugraph.analyzeClustering_edge_cut(G, 5, df)
- cugraph.community.spectral_clustering.analyzeClustering_modularity(G, n_clusters, clustering, vertex_col_name='vertex', cluster_col_name='cluster')[source]¶
Compute the modularity score for a given partitioning/clustering. The assumption is that “clustering” is the results from a call from a special clustering algorithm and contains columns named “vertex” and “cluster”.
- Parameters
- Gcugraph.Graph or networkx.Graph
graph descriptor. This graph should have edge weights.
- n_clustersinteger
Specifies the number of clusters in the given clustering
- clusteringcudf.DataFrame
The cluster assignment to analyze.
- vertex_col_namestr or list of str
The names of the column in the clustering dataframe identifying the external vertex id
- cluster_col_namestr
The name of the column in the clustering dataframe identifying the cluster id
- Returns
- scorefloat
The computed modularity score
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2') >>> df = cugraph.spectralBalancedCutClustering(G, 5) >>> score = cugraph.analyzeClustering_modularity(G, 5, df)
- cugraph.community.spectral_clustering.analyzeClustering_ratio_cut(G, n_clusters, clustering, vertex_col_name='vertex', cluster_col_name='cluster')[source]¶
Compute the ratio cut score for a partitioning/clustering
- Parameters
- Gcugraph.Graph
cuGraph graph descriptor. This graph should have edge weights.
- n_clustersinteger
Specifies the number of clusters in the given clustering
- clusteringcudf.DataFrame
The cluster assignment to analyze.
- vertex_col_namestr
The name of the column in the clustering dataframe identifying the external vertex id
- cluster_col_namestr
The name of the column in the clustering dataframe identifying the cluster id
- Returns
- scorefloat
The computed ratio cut score
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2') >>> df = cugraph.spectralBalancedCutClustering(G, 5) >>> score = cugraph.analyzeClustering_ratio_cut(G, 5, df, >>> 'vertex', 'cluster')
- cugraph.community.spectral_clustering.spectralBalancedCutClustering(G, num_clusters, num_eigen_vects=2, evs_tolerance=1e-05, evs_max_iter=100, kmean_tolerance=1e-05, kmean_max_iter=100)[source]¶
Compute a clustering/partitioning of the given graph using the spectral balanced cut method.
- Parameters
- Gcugraph.Graph or networkx.Graph
graph descriptor
- num_clustersinteger
Specifies the number of clusters to find, must be greater than 1
- num_eigen_vectsinteger
Specifies the number of eigenvectors to use. Must be lower or equal to num_clusters. Default is 2
- evs_tolerance: float
Specifies the tolerance to use in the eigensolver. Default is 0.00001
- evs_max_iter: integer
Specifies the maximum number of iterations for the eigensolver. Default is 100
- kmean_tolerance: float
Specifies the tolerance to use in the k-means solver. Default is 0.00001
- kmean_max_iter: integer
Specifies the maximum number of iterations for the k-means solver. Default is 100
- Returns
- dfcudf.DataFrame
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding cluster assignments.
- df[‘vertex’]cudf.Series
contains the vertex identifiers
- df[‘cluster’]cudf.Series
contains the cluster assignments
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> df = cugraph.spectralBalancedCutClustering(G, 5)
- cugraph.community.spectral_clustering.spectralModularityMaximizationClustering(G, num_clusters, num_eigen_vects=2, evs_tolerance=1e-05, evs_max_iter=100, kmean_tolerance=1e-05, kmean_max_iter=100)[source]¶
Compute a clustering/partitioning of the given graph using the spectral modularity maximization method.
- Parameters
- Gcugraph.Graph or networkx.Graph
cuGraph graph descriptor. This graph should have edge weights.
- num_clustersinteger
Specifies the number of clusters to find
- num_eigen_vectsinteger
Specifies the number of eigenvectors to use. Must be lower or equal to num_clusters. Default is 2
- evs_tolerance: float
Specifies the tolerance to use in the eigensolver. Default is 0.00001
- evs_max_iter: integer
Specifies the maximum number of iterations for the eigensolver. Default is 100
- kmean_tolerance: float
Specifies the tolerance to use in the k-means solver. Default is 0.00001
- kmean_max_iter: integer
Specifies the maximum number of iterations for the k-means solver. Default is 100
- Returns
- dfcudf.DataFrame
- df[‘vertex’]cudf.Series
contains the vertex identifiers
- df[‘cluster’]cudf.Series
contains the cluster assignments
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2') >>> df = cugraph.spectralModularityMaximizationClustering(G, 5)
Subgraph Extraction¶
- cugraph.community.subgraph_extraction.subgraph(G, vertices)[source]¶
Compute a subgraph of the existing graph including only the specified vertices. This algorithm works for both directed and undirected graphs, it does not actually traverse the edges, simply pulls out any edges that are incident on vertices that are both contained in the vertices list.
- Parameters
- Gcugraph.Graph
cuGraph graph descriptor
- verticescudf.Series or cudf.DataFrame
Specifies the vertices of the induced subgraph. For multi-column vertices, vertices should be provided as a cudf.DataFrame
- Returns
- Sgcugraph.Graph
A graph object containing the subgraph induced by the given vertex set.
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> verts = numpy.zeros(3, dtype=numpy.int32) >>> verts[0] = 0 >>> verts[1] = 1 >>> verts[2] = 2 >>> sverts = cudf.Series(verts) >>> Sg = cugraph.subgraph(G, sverts)
Triangle Counting¶
- cugraph.community.triangle_count.triangles(G)[source]¶
Compute the number of triangles (cycles of length three) in the input graph.
Unlike NetworkX, this algorithm simply returns the total number of triangle and not the number per vertex.
- Parameters
- Gcugraph.graph or networkx.Graph
cuGraph graph descriptor, should contain the connectivity information, (edge weights are not used in this algorithm)
- Returns
- countint64
A 64 bit integer whose value gives the number of triangles in the graph.
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> count = cugraph.triangles(G)
Components¶
Connected Components¶
- cugraph.components.connectivity.connected_components(G, directed=None, connection='weak', return_labels=None)[source]¶
Generate either the stronlgly or weakly connected components and attach a component label to each vertex.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information (edge weights are not used for this algorithm). If using a graph object, the graph can be either directed or undirected where an undirected edge is represented by a directed edge in both directions. The adjacency list will be computed if not already present. The number of vertices should fit into a 32b int.
- directedbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then convert the input matrix to a cugraph.DiGraph and only move from point i to point j along paths csgraph[i, j]. If False, then find the shortest path on an undirected graph: the algorithm can progress from point i to j along csgraph[i, j] or csgraph[j, i].
- connectionstr, optional
[‘weak’|’strong’]. Return either weakly or strongly connected components.
- return_labelsbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then return the labels for each of the connected components.
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding component identifier.
- df[‘vertex’]
Contains the vertex identifier
- df[‘labels’]
The component identifier
- If G is a networkx.Graph, returns:
python dictionary, where keys are vertices and values are the component identifiers.
- If G is a CuPy or SciPy matrix, returns:
CuPy ndarray (if CuPy matrix input) or Numpy ndarray (if SciPy matrix input) of shape (<num vertices>, 2), where column 0 contains component identifiers and column 1 contains vertices.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr=None) >>> df = cugraph.connected_components(G, connection="weak")
- cugraph.components.connectivity.strongly_connected_components(G, directed=None, connection=None, return_labels=None)[source]¶
Generate the Strongly Connected Components and attach a component label to each vertex.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information (edge weights are not used for this algorithm). If using a graph object, the graph can be either directed or undirected where an undirected edge is represented by a directed edge in both directions. The adjacency list will be computed if not already present. The number of vertices should fit into a 32b int.
- directedbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then convert the input matrix to a cugraph.DiGraph and only move from point i to point j along paths csgraph[i, j]. If False, then find the shortest path on an undirected graph: the algorithm can progress from point i to j along csgraph[i, j] or csgraph[j, i].
- connectionstr, optional
Added for SciPy compatibility, can only be specified for non-Graph-type (eg. sparse matrix) values of G only (raises TypeError if used with a Graph object), and can only be set to “strong” for this API.
- return_labelsbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then return the labels for each of the connected components.
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding component identifier.
- df[‘vertex’]
Contains the vertex identifier
- df[‘labels’]
The component identifier
- If G is a networkx.Graph, returns:
python dictionary, where keys are vertices and values are the component identifiers.
- If G is a CuPy or SciPy matrix, returns:
CuPy ndarray (if CuPy matrix input) or Numpy ndarray (if SciPy matrix input) of shape (<num vertices>, 2), where column 0 contains component identifiers and column 1 contains vertices.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr=None) >>> df = cugraph.strongly_connected_components(G)
- cugraph.components.connectivity.weakly_connected_components(G, directed=None, connection=None, return_labels=None)[source]¶
Generate the Weakly Connected Components and attach a component label to each vertex.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information (edge weights are not used for this algorithm). If using a graph object, the graph can be either directed or undirected where an undirected edge is represented by a directed edge in both directions. The adjacency list will be computed if not already present. The number of vertices should fit into a 32b int.
- directedbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then convert the input matrix to a cugraph.DiGraph and only move from point i to point j along paths csgraph[i, j]. If False, then find the shortest path on an undirected graph: the algorithm can progress from point i to j along csgraph[i, j] or csgraph[j, i].
- connectionstr, optional
Added for SciPy compatibility, can only be specified for non-Graph-type (eg. sparse matrix) values of G only (raises TypeError if used with a Graph object), and can only be set to “weak” for this API.
- return_labelsbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then return the labels for each of the connected components.
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding component identifier.
- df[‘vertex’]
Contains the vertex identifier
- df[‘labels’]
The component identifier
- If G is a networkx.Graph, returns:
python dictionary, where keys are vertices and values are the component identifiers.
- If G is a CuPy or SciPy matrix, returns:
CuPy ndarray (if CuPy matrix input) or Numpy ndarray (if SciPy matrix input) of shape (<num vertices>, 2), where column 0 contains component identifiers and column 1 contains vertices.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter = ' ', dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr=None) >>> df = cugraph.weakly_connected_components(G)
Connected Components (MG)¶
Cores¶
Core Number¶
- cugraph.cores.core_number.core_number(G)[source]¶
Compute the core numbers for the nodes of the graph G. A k-core of a graph is a maximal subgraph that contains nodes of degree k or more. A node has a core number of k if it belongs a k-core but not to k+1-core. This call does not support a graph with self-loops and parallel edges.
- Parameters
- graphcuGraph.Graph or networkx.Graph
The graph should contain undirected edges where undirected edges are represented as directed edges in both directions. While this graph can contain edge weights, they don’t participate in the calculation of the core numbers.
- Returns
- dfcudf.DataFrame or python dictionary (in NetworkX input)
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding core number values.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘core_number’]cudf.Series
Contains the core number of vertices
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> cn = cugraph.core_number(G)
K-Core¶
- cugraph.cores.k_core.k_core(G, k=None, core_number=None)[source]¶
Compute the k-core of the graph G based on the out degree of its nodes. A k-core of a graph is a maximal subgraph that contains nodes of degree k or more. This call does not support a graph with self-loops and parallel edges.
- Parameters
- GcuGraph.Graph or networkx.Graph
cuGraph graph descriptor with connectivity information. The graph should contain undirected edges where undirected edges are represented as directed edges in both directions. While this graph can contain edge weights, they don’t participate in the calculation of the k-core.
- kint, optional
Order of the core. This value must not be negative. If set to None, the main core is returned.
- core_numbercudf.DataFrame, optional
Precomputed core number of the nodes of the graph G containing two cudf.Series of size V: the vertex identifiers and the corresponding core number values. If set to None, the core numbers of the nodes are calculated internally.
- core_number[‘vertex’]cudf.Series
Contains the vertex identifiers
- core_number[‘values’]cudf.Series
Contains the core number of vertices
- Returns
- KCoreGraphcuGraph.Graph
K Core of the input graph
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> KCoreGraph = cugraph.k_core(G)
Layout¶
Force Atlas 2¶
- cugraph.layout.force_atlas2.force_atlas2(input_graph, max_iter=500, pos_list=None, outbound_attraction_distribution=True, lin_log_mode=False, prevent_overlapping=False, edge_weight_influence=1.0, jitter_tolerance=1.0, barnes_hut_optimize=True, barnes_hut_theta=0.5, scaling_ratio=2.0, strong_gravity_mode=False, gravity=1.0, verbose=False, callback=None)[source]¶
ForceAtlas2 is a continuous graph layout algorithm for handy network visualization.
NOTE: Peak memory allocation occurs at 30*V.
- Parameters
- input_graphcugraph.Graph
cuGraph graph descriptor with connectivity information. Edge weights, if present, should be single or double precision floating point values.
- max_iterinteger
This controls the maximum number of levels/iterations of the Force Atlas algorithm. When specified the algorithm will terminate after no more than the specified number of iterations. No error occurs when the algorithm terminates in this manner. Good short-term quality can be achieved with 50-100 iterations. Above 1000 iterations is discouraged.
- pos_list: cudf.DataFrame
Data frame with initial vertex positions containing two columns: ‘x’ and ‘y’ positions.
- outbound_attraction_distribution: bool
Distributes attraction along outbound edges. Hubs attract less and thus are pushed to the borders.
- lin_log_mode: bool
Switch Force Atlas model from lin-lin to lin-log. Makes clusters more tight.
- prevent_overlapping: bool
Prevent nodes to overlap.
- edge_weight_influence: float
How much influence you give to the edges weight. 0 is “no influence” and 1 is “normal”.
- jitter_tolerance: float
How much swinging you allow. Above 1 discouraged. Lower gives less speed and more precision.
- barnes_hut_optimize: bool
Whether to use the Barnes Hut approximation or the slower exact version.
- barnes_hut_theta: float
Float between 0 and 1. Tradeoff for speed (1) vs accuracy (0) for Barnes Hut only.
- scaling_ratio: float
How much repulsion you want. More makes a more sparse graph. Switching from regular mode to LinLog mode needs a readjustment of the scaling parameter.
- strong_gravity_mode: bool
Sets a force that attracts the nodes that are distant from the center more. It is so strong that it can sometimes dominate other forces.
- gravityfloat
Attracts nodes to the center. Prevents islands from drifting away.
- verbose: bool
Output convergence info at each interation.
- callback: GraphBasedDimRedCallback
An instance of GraphBasedDimRedCallback class to intercept the internal state of positions while they are being trained.
- Example of callback usage:
- from cugraph.internals import GraphBasedDimRedCallback
- class CustomCallback(GraphBasedDimRedCallback):
- def on_preprocess_end(self, positions):
print(positions.copy_to_host())
- def on_epoch_end(self, positions):
print(positions.copy_to_host())
- def on_train_end(self, positions):
print(positions.copy_to_host())
- Returns
- poscudf.DataFrame
GPU data frame of size V containing three columns: the vertex identifiers and the x and y positions.
Linear Assignment¶
Hungarian¶
Execute the Hungarian algorithm against a symmetric, weighted, bipartite graph.
As a bipartite graph, the vertex set of the graph can be partitioned into two disjoint sets such that all edges connect a vertex from one set to a vertex of the other set. The workers variable identifies one of the sets of vertices, the other set is all vertices not in the workers set (V - workers).
The edge weights reflect the cost of assigning a particular job to a worker.
The Hungarian algorithm identifies the lowest cost matching of vertices such that all workers that can be assigned work are assigned exactly on job.
Parameters¶
- Gcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an an edge list. Edge weights are required. If an edge list is not provided then it will be computed.
- workerscudf.Series or cudf.DataFrame
A series or column that identifies the vertex ids of the vertices in the workers set. In case of multi-column vertices, it should be a cudf.DataFrame. All vertices in G that are not in the workers set are implicitly assigned to the jobs set.
- epsilonfloat or double (matching weight type in graph)
Used for determining when value is close enough to zero to consider 0. Defaults (if not specified) to 1e-6 in the C++ code. Unused for integer weight types.
Returns¶
- costmatches costs.dtype
The cost of the overall assignment
- dfcudf.DataFrame
- df[‘vertex’][i] gives the vertex id of the i’th vertex. Only vertices
in the workers list are defined in this column.
- df[‘assignment’][i] gives the vertex id of the “job” assigned to the
corresponding vertex.
FIXME: Update this with a real example…
Examples¶
>>> M = cudf.read_csv('datasets/bipartite.csv', delimiter=' ',
>>> dtype=['int32', 'int32', 'float32'], header=None)
>>> G = cugraph.Graph()
>>> G.from_cudf_edgelist(M, source='0', destination='1', edge_attr='2')
>>> cost, df = cugraph.hungarian(G, workers)
Link Analysis¶
HITS¶
- cugraph.link_analysis.hits.hits(G, max_iter=100, tol=1e-05, nstart=None, normalized=True)[source]¶
Compute HITS hubs and authorities values for each vertex
The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.
The cuGraph implementation of HITS is a wrapper around the gunrock implementation of HITS.
Note that the gunrock implementation uses a 2-norm, while networkx uses a 1-norm. The raw scores will be different, but the rank ordering should be comparable with networkx.
- Parameters
- graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The adjacency list will be computed if not already present.
- max_iterint
The maximum number of iterations before an answer is returned. The gunrock implementation does not currently support tolerance, so this will in fact be the number of iterations the HITS algorithm executes.
- tolerancefloat
Set the tolerance the approximation, this parameter should be a small magnitude value. This parameter is not currently supported.
- nstartcudf.Dataframe
Not currently supported
- normalizedbool
Not currently supported, always used as True
- Returns
- HubsAndAuthoritiescudf.DataFrame
GPU data frame containing three cudf.Series of size V: the vertex identifiers and the corresponding hubs values and the corresponding authorities values.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘hubs’]cudf.Series
Contains the hubs score
- df[‘authorities’]cudf.Series
Contains the authorities score
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> hits = cugraph.hits(G, max_iter = 50)
Pagerank¶
- cugraph.link_analysis.pagerank.pagerank(G, alpha=0.85, personalization=None, max_iter=100, tol=1e-05, nstart=None, weight=None, dangling=None)[source]¶
Find the PageRank score for every vertex in a graph. cuGraph computes an approximation of the Pagerank eigenvector using the power method. The number of iterations depends on the properties of the network itself; it increases when the tolerance descreases and/or alpha increases toward the limiting value of 1. The user is free to use default values or to provide inputs for the initial guess, tolerance and maximum number of iterations.
- Parameters
- graphcugraph.Graph or networkx.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list. The transposed adjacency list will be computed if not already present.
- alphafloat
The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.
- personalizationcudf.Dataframe
GPU Dataframe containing the personalization information.
- personalization[‘vertex’]cudf.Series
Subset of vertices of graph for personalization
- personalization[‘values’]cudf.Series
Personalization values for vertices
- max_iterint
The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100.
- tolerancefloat
Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 1.0E-5. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.
- nstartcudf.Dataframe
GPU Dataframe containing the initial guess for pagerank.
- nstart[‘vertex’]cudf.Series
Subset of vertices of graph for initial guess for pagerank values
- nstart[‘values’]cudf.Series
Pagerank values for vertices
- weight: str
The attribute column to be used as edge weights if Graph is a NetworkX Graph. This parameter is here for NetworkX compatibility and is ignored in case of a cugraph.Graph
- danglingdict
This parameter is here for NetworkX compatibility and ignored
- Returns
- PageRankcudf.DataFrame
GPU data frame containing two cudf.Series of size V: the vertex identifiers and the corresponding PageRank values.
- df[‘vertex’]cudf.Series
Contains the vertex identifiers
- df[‘pagerank’]cudf.Series
Contains the PageRank score
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> pr = cugraph.pagerank(G, alpha = 0.85, max_iter = 500, tol = 1.0e-05)
Pagerank (MG)¶
- cugraph.dask.link_analysis.pagerank.pagerank(input_graph, alpha=0.85, personalization=None, max_iter=100, tol=1e-05, nstart=None)[source]¶
Find the PageRank values for each vertex in a graph using multiple GPUs. cuGraph computes an approximation of the Pagerank using the power method. The input graph must contain edge list as dask-cudf dataframe with one partition per GPU.
- Parameters
- graphcugraph.DiGraph
cuGraph graph descriptor, should contain the connectivity information as dask cudf edge list dataframe(edge weights are not used for this algorithm). Undirected Graph not currently supported.
- alphafloat
The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.
- personalizationcudf.Dataframe
GPU Dataframe containing the personalization information. Currently not supported.
- personalization[‘vertex’]cudf.Series
Subset of vertices of graph for personalization
- personalization[‘values’]cudf.Series
Personalization values for vertices
- max_iterint
The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30.
- tolerancefloat
Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 1.0E-5. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.
- nstartnot supported
initial guess for pagerank
- Returns
- PageRankdask_cudf.DataFrame
GPU data frame containing two dask_cudf.Series of size V: the vertex identifiers and the corresponding PageRank values.
- ddf[‘vertex’]dask_cudf.Series
Contains the vertex identifiers
- ddf[‘pagerank’]dask_cudf.Series
Contains the PageRank score
Examples
>>> import cugraph.dask as dcg >>> ... Init a DASK Cluster >> see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'value'], dtype=['int32', 'int32', 'float32']) >>> dg = cugraph.DiGraph() >>> dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst', edge_attr='value') >>> pr = dcg.pagerank(dg)
Link Prediction¶
Jaccard Coefficient¶
- cugraph.link_prediction.jaccard.jaccard(input_graph, vertex_pair=None)[source]¶
Compute the Jaccard similarity between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Jaccard similarity is defined between two sets as the ratio of the volume of their intersection divided by the volume of their union. In the context of graphs, the neighborhood of a vertex is seen as a set. The Jaccard similarity weight of each edge represents the strength of connection between vertices based on the relative similarity of their neighbors. If first is specified but second is not, or vice versa, an exception will be thrown.
NOTE: If the vertex_pair parameter is not specified then the behavior of cugraph.jaccard is different from the behavior of networkx.jaccard_coefficient.
cugraph.jaccard, in the absence of a specified vertex pair list, will use the edges of the graph to construct a vertex pair list and will return the jaccard coefficient for those vertex pairs.
networkx.jaccard_coefficient, in the absence of a specified vertex pair list, will return an upper triangular dense matrix, excluding the diagonal as well as vertex pairs that are directly connected by an edge in the graph, of jaccard coefficients. Technically, networkx returns a lazy iterator across this upper triangular matrix where the actual jaccard coefficient is computed when the iterator is dereferenced. Computing a dense matrix of results is not feasible if the number of vertices in the graph is large (100,000 vertices would result in 4.9 billion values in that iterator).
If your graph is small enough (or you have enough memory and patience) you can get the interesting (non-zero) values that are part of the networkx solution by doing the following:
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> pairs = cugraph.get_two_hop_neighbors(G) >>> df = cugraph.jaccard(G, pairs)
But please remember that cugraph will fill the dataframe with the entire solution you request, so you’ll need enough memory to store the 2-hop neighborhood dataframe.
- Parameters
- graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The graph should be undirected where an undirected edge is represented by a directed edge in both direction. The adjacency list will be computed if not already present.
- vertex_paircudf.DataFrame
A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the jaccard coefficient is computed for the given vertex pairs. If the vertex_pair is not provided then the current implementation computes the jaccard coefficient for all adjacent vertices in the graph.
- Returns
- dfcudf.DataFrame
GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Jaccard weights. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.
- df[‘source’]cudf.Series
The source vertex ID (will be identical to first if specified)
- df[‘destination’]cudf.Series
The destination vertex ID (will be identical to second if specified)
- df[‘jaccard_coeff’]cudf.Series
The computed Jaccard coefficient between the source and destination vertices
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> df = cugraph.jaccard(G)
- cugraph.link_prediction.jaccard.jaccard_coefficient(G, ebunch=None)[source]¶
For NetworkX Compatability. See jaccard
- Parameters
- graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The graph should be undirected where an undirected edge is represented by a directed edge in both direction. The adjacency list will be computed if not already present.
- ebunchcudf.DataFrame
A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the jaccard coefficient is computed for the given vertex pairs. If the vertex_pair is not provided then the current implementation computes the jaccard coefficient for all adjacent vertices in the graph.
- Returns
- dfcudf.DataFrame
GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Jaccard weights. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.
- df[‘source’]cudf.Series
The source vertex ID (will be identical to first if specified)
- df[‘destination’]cudf.Series
The destination vertex ID (will be identical to second if specified)
- df[‘jaccard_coeff’]cudf.Series
The computed Jaccard coefficient between the source and destination vertices
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> df = cugraph.jaccard_coefficient(G)
- cugraph.link_prediction.wjaccard.jaccard_w(input_graph, weights, vertex_pair=None)[source]¶
Compute the weighted Jaccard similarity between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Jaccard similarity is defined between two sets as the ratio of the volume of their intersection divided by the volume of their union. In the context of graphs, the neighborhood of a vertex is seen as a set. The Jaccard similarity weight of each edge represents the strength of connection between vertices based on the relative similarity of their neighbors. If first is specified but second is not, or vice versa, an exception will be thrown.
- Parameters
- graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The adjacency list will be computed if not already present.
- weightscudf.DataFrame
Specifies the weights to be used for each vertex. Vertex should be represented by multiple columns for multi-column vertices.
- weights[‘vertex’]cudf.Series
Contains the vertex identifiers
- weights[‘weight’]cudf.Series
Contains the weights of vertices
- vertex_paircudf.DataFrame
A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the jaccard coefficient is computed for the given vertex pairs, else, it is computed for all vertex pairs.
- Returns
- dfcudf.DataFrame
GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Jaccard weights. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.
- df[‘source’]cudf.Series
The source vertex ID
- df[‘destination’]cudf.Series
The destination vertex ID
- df[‘jaccard_coeff’]cudf.Series
The computed weighted Jaccard coefficient between the source and destination vertices.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> df = cugraph.jaccard_w(G, M[2])
Overlap Coefficient¶
- cugraph.link_prediction.overlap.overlap(input_graph, vertex_pair=None)[source]¶
Compute the Overlap Coefficient between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Overlap Coefficient is defined between two sets as the ratio of the volume of their intersection divided by the smaller of their two volumes. In the context of graphs, the neighborhood of a vertex is seen as a set. The Overlap Coefficient weight of each edge represents the strength of connection between vertices based on the relative similarity of their neighbors. If first is specified but second is not, or vice versa, an exception will be thrown.
- Parameters
- graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The adjacency list will be computed if not already present.
- vertex_paircudf.DataFrame
A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the overlap coefficient is computed for the given vertex pairs, else, it is computed for all vertex pairs.
- Returns
- dfcudf.DataFrame
GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Overlap coefficients. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.
- df[‘source’]cudf.Series
The source vertex ID (will be identical to first if specified).
- df[‘destination’]cudf.Series
The destination vertex ID (will be identical to second if specified).
- df[‘overlap_coeff’]cudf.Series
The computed Overlap coefficient between the source and destination vertices.
Examples
>>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(gdf, source='0', destination='1') >>> df = cugraph.overlap(G)
- cugraph.link_prediction.overlap.overlap_coefficient(G, ebunch=None)[source]¶
NetworkX similar API. See ‘jaccard’ for a description
- cugraph.link_prediction.woverlap.overlap_w(input_graph, weights, vertex_pair=None)[source]¶
Compute the weighted Overlap Coefficient between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Overlap Coefficient is defined between two sets as the ratio of the volume of their intersection divided by the smaller of their volumes. In the context of graphs, the neighborhood of a vertex is seen as a set. The Overlap Coefficient weight of each edge represents the strength of connection between vertices based on the relative similarity of their neighbors. If first is specified but second is not, or vice versa, an exception will be thrown.
- Parameters
- input_graphcugraph.Graph
cuGraph graph descriptor, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The adjacency list will be computed if not already present.
- weightscudf.Series
Specifies the weights to be used for each vertex.
- vertex_paircudf.DataFrame
A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the overlap coefficient is computed for the given vertex pairs, else, it is computed for all vertex pairs.
- Returns
- dfcudf.DataFrame
GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the overlap coefficients. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.
- df[‘source’]cudf.Series
The source vertex ID
- df[‘destination’]cudf.Series
The destination vertex ID
- df[‘overlap_coeff’]cudf.Series
The computed weighted Overlap coefficient between the source and destination vertices.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> df = cugraph.overlap_w(G, M[2])
Sampling¶
Random Walks¶
- cugraph.sampling.random_walks.random_walks(G, start_vertices, max_depth=None, use_padding=False)[source]¶
compute random walks for each nodes in ‘start_vertices’
- Parameters
- GcuGraph.Graph or networkx.Graph
The graph can be either directed (DiGraph) or undirected (Graph). Weights in the graph are ignored. Use weight parameter if weights need to be considered (currently not supported)
- start_verticesint or list or cudf.Series or cudf.DataFrame
A single node or a list or a cudf.Series of nodes from which to run the random walks. In case of multi-column vertices it should be a cudf.DataFrame
- max_depthint
The maximum depth of the random walks
- use_paddingbool
If True, padded paths are returned else coalesced paths are returned.
- Returns
- vertex_pathscudf.Series or cudf.DataFrame
Series containing the vertices of edges/paths in the random walk.
- edge_weight_paths: cudf.Series
Series containing the edge weights of edges represented by the returned vertex_paths
- sizes: int
The path size in case of coalesced paths.
- cugraph.sampling.random_walks.rw_path(num_paths, sizes)[source]¶
Retrieve more information on the obtained paths in case use_padding is False.
- Parameters
- num_paths: int
Number of paths in the random walk output.
- sizes: int
Path size returned in random walk output.
- Returns
- path_datacudf.DataFrame
Dataframe containing vetex path offsets, edge weight offsets and edge weight sizes for each path.
Traversal¶
Breadth-first-search¶
- cugraph.traversal.bfs.bfs(G, start=None, depth_limit=None, i_start=None, directed=None, return_predecessors=None)[source]¶
Find the distances and predecessors for a breadth first traversal of a graph.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information. Edge weights, if present, should be single or double precision floating point values.
- startInteger
The index of the graph vertex from which the traversal begins
- i_startInteger, optional
Identical to start, added for API compatibility. Only start or i_start can be set, not both.
- depth_limitInteger or None
Limit the depth of the search
- directedbool, optional
- NOTE
For non-Graph-type (eg. sparse matrix) values of G only. Raises TypeError if used with a Graph object.
If True (default), then convert the input matrix to a cugraph.DiGraph, otherwise a cugraph.Graph object will be used.
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
df[‘vertex’] vertex IDs
df[‘distance’] path distance for each vertex from the starting vertex
df[‘predecessor’] for each i’th position in the column, the vertex ID immediately preceding the vertex at position i in the ‘vertex’ column
- If G is a networkx.Graph, returns:
pandas.DataFrame with contents equivalent to the cudf.DataFrame described above.
- If G is a CuPy or SciPy matrix, returns:
a 2-tuple of CuPy ndarrays (if CuPy matrix input) or Numpy ndarrays (if SciPy matrix input) representing:
- distance: cupy or numpy ndarray
ndarray of shortest distances between source and vertex.
- predecessor: cupy or numpy ndarray
ndarray of predecessors of a vertex on the path from source, which can be used to reconstruct the shortest paths.
…or if return_sp_counter is True, returns a 3-tuple with the above two arrays plus:
- sp_counter: cupy or numpy ndarray
ndarray of number of shortest paths leading to each vertex.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> df = cugraph.bfs(G, 0)
- cugraph.traversal.bfs.bfs_edges(G, source, reverse=False, depth_limit=None, sort_neighbors=None)[source]¶
Find the distances and predecessors for a breadth first traversal of a graph.
- Parameters
- Gcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix
Graph or matrix object, which should contain the connectivity information. Edge weights, if present, should be single or double precision floating point values.
- sourceInteger
The starting vertex index
- reverseboolean
If a directed graph, then process edges in a reverse direction Currently not implemented
- depth_limitInt or None
Limit the depth of the search
- sort_neighborsNone or Function
Currently not implemented
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
df[‘vertex’] vertex IDs
df[‘distance’] path distance for each vertex from the starting vertex
df[‘predecessor’] for each i’th position in the column, the vertex ID immediately preceding the vertex at position i in the ‘vertex’ column
- If G is a networkx.Graph, returns:
pandas.DataFrame with contents equivalent to the cudf.DataFrame described above.
- If G is a CuPy or SciPy matrix, returns:
a 2-tuple of CuPy ndarrays (if CuPy matrix input) or Numpy ndarrays (if SciPy matrix input) representing:
- distance: cupy or numpy ndarray
ndarray of shortest distances between source and vertex.
- predecessor: cupy or numpy ndarray
ndarray of predecessors of a vertex on the path from source, which can be used to reconstruct the shortest paths.
…or if return_sp_counter is True, returns a 3-tuple with the above two arrays plus:
- sp_counter: cupy or numpy ndarray
ndarray of number of shortest paths leading to each vertex.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> df = cugraph.bfs_edges(G, 0)
Breadth-first-search (MG)¶
- cugraph.dask.traversal.bfs.bfs(graph, start, depth_limit=None, return_distances=True)[source]¶
Find the distances and predecessors for a breadth first traversal of a graph. The input graph must contain edge list as dask-cudf dataframe with one partition per GPU.
- Parameters
- graphcugraph.DiGraph
cuGraph graph descriptor, should contain the connectivity information as dask cudf edge list dataframe(edge weights are not used for this algorithm). Undirected Graph not currently supported.
- startInteger
Specify starting vertex for breadth-first search; this function iterates over edges in the component reachable from this node.
- depth_limitInteger or None
Limit the depth of the search
- return_distancesbool, optional, default=True
Indicates if distances should be returned
- Returns
- dfdask_cudf.DataFrame
df[‘vertex’] gives the vertex id
df[‘distance’] gives the path distance from the starting vertex (Only if return_distances is True)
df[‘predecessor’] gives the vertex it was reached from in the traversal
Examples
>>> import cugraph.dask as dcg >>> ... Init a DASK Cluster >> see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'value'], dtype=['int32', 'int32', 'float32']) >>> dg = cugraph.DiGraph() >>> dg.from_dask_cudf_edgelist(ddf, 'src', 'dst') >>> df = dcg.bfs(dg, 0)
Single-source-shortest-path¶
- cugraph.traversal.sssp.filter_unreachable(df)[source]¶
Remove unreachable vertices from the result of SSSP or BFS
- Parameters
- dfcudf.DataFrame
cudf.DataFrame that is the output of SSSP or BFS
- Returns
- dffiltered cudf.DataFrame with only reachable vertices
df[‘vertex’][i] gives the vertex id of the i’th vertex. df[‘distance’][i] gives the path distance for the i’th vertex from the starting vertex. df[‘predecessor’][i] gives the vertex that was reached before the i’th vertex in the traversal.
- cugraph.traversal.sssp.shortest_path(G, source=None, method=None, directed=None, return_predecessors=None, unweighted=None, overwrite=None, indices=None)[source]¶
Alias for sssp(), provided for API compatibility with NetworkX. See sssp() for details.
- cugraph.traversal.sssp.shortest_path_length(G, source, target=None)[source]¶
Compute the distance from a source vertex to one or all vertexes in graph. Uses Single Source Shortest Path (SSSP).
- Parameters
- graphcuGraph.Graph, NetworkX.Graph, or CuPy sparse COO matrix
cuGraph graph descriptor with connectivity information. Edge weights, if present, should be single or double precision floating point values.
- sourceDependant on graph type. Index of the source vertex.
- If graph is an instance of cuGraph.Graph or CuPy sparse COO matrix:
int
- If graph is an instance of a NetworkX.Graph:
str
- target: Dependant on graph type. Vertex to find distance to.
- If graph is an instance of cuGraph.Graph or CuPy sparse COO matrix:
int
- If graph is an instance of a NetworkX.Graph:
str
- Returns
- Return value type is based on the input type.
- If target is None, returns:
- cudf.DataFrame
- df[‘vertex’]
vertex id
- df[‘distance’]
gives the path distance from the starting vertex
- If target is not None, returns:
Distance from source to target vertex.
- cugraph.traversal.sssp.sssp(G, source=None, method=None, directed=None, return_predecessors=None, unweighted=None, overwrite=None, indices=None)[source]¶
Compute the distance and predecessors for shortest paths from the specified source to all the vertices in the graph. The distances column will store the distance from the source to each vertex. The predecessors column will store each vertex’s predecessor in the shortest path. Vertices that are unreachable will have a distance of infinity denoted by the maximum value of the data type and the predecessor set as -1. The source vertex’s predecessor is also set to -1. Graphs with negative weight cycles are not supported.
- Parameters
- graphcugraph.Graph, networkx.Graph, CuPy or SciPy sparse matrix Graph or
matrix object, which should contain the connectivity information. Edge weights, if present, should be single or double precision floating point values.
- sourceint
Index of the source vertex.
- Returns
- Return value type is based on the input type. If G is a cugraph.Graph,
- returns:
- cudf.DataFrame
- df[‘vertex’]
vertex id
- df[‘distance’]
gives the path distance from the starting vertex
- df[‘predecessor’]
the vertex it was reached from
- If G is a networkx.Graph, returns:
pandas.DataFrame with contents equivalent to the cudf.DataFrame described above.
- If G is a CuPy or SciPy matrix, returns:
a 2-tuple of CuPy ndarrays (if CuPy matrix input) or Numpy ndarrays (if SciPy matrix input) representing:
- distance: cupy or numpy ndarray
ndarray of shortest distances between source and vertex.
- predecessor: cupy or numpy ndarray
ndarray of predecessors of a vertex on the path from source, which can be used to reconstruct the shortest paths.
Examples
>>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ', >>> dtype=['int32', 'int32', 'float32'], header=None) >>> G = cugraph.Graph() >>> G.from_cudf_edgelist(M, source='0', destination='1') >>> distances = cugraph.sssp(G, 0)
Single-source-shortest-path (MG)¶
- cugraph.dask.traversal.sssp.call_sssp(sID, data, num_verts, num_edges, vertex_partition_offsets, aggregate_segment_offsets, start)[source]¶
- cugraph.dask.traversal.sssp.sssp(graph, source)[source]¶
Compute the distance and predecessors for shortest paths from the specified source to all the vertices in the graph. The distances column will store the distance from the source to each vertex. The predecessors column will store each vertex’s predecessor in the shortest path. Vertices that are unreachable will have a distance of infinity denoted by the maximum value of the data type and the predecessor set as -1. The source vertex’s predecessor is also set to -1. The input graph must contain edge list as dask-cudf dataframe with one partition per GPU.
- Parameters
- graphcugraph.DiGraph
cuGraph graph descriptor, should contain the connectivity information as dask cudf edge list dataframe. Undirected Graph not currently supported.
- sourceInteger
Specify source vertex
- Returns
- dfdask_cudf.DataFrame
df[‘vertex’] gives the vertex id
df[‘distance’] gives the path distance from the starting vertex
df[‘predecessor’] gives the vertex id it was reached from in the traversal
Examples
>>> import cugraph.dask as dcg >>> ... Init a DASK Cluster >> see https://docs.rapids.ai/api/cugraph/stable/dask-cugraph.html >>> chunksize = dcg.get_chunksize(input_data_path) >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize, delimiter=' ', names=['src', 'dst', 'value'], dtype=['int32', 'int32', 'float32']) >>> dg = cugraph.DiGraph() >>> dg.from_dask_cudf_edgelist(ddf, 'src', 'dst') >>> df = dcg.sssp(dg, 0)
Traveling-salesperson-problem¶
- cugraph.traversal.traveling_salesperson.traveling_salesperson(pos_list, restarts=100000, beam_search=True, k=4, nstart=None, verbose=False)[source]¶
Finds an approximate solution to the traveling salesperson problem (TSP). cuGraph computes an approximation of the TSP problem using hill climbing optimization.
The current implementation does not support a weighted graph.
- Parameters
- pos_list: cudf.DataFrame
Data frame with initial vertex positions containing three columns: ‘vertex’ ids and ‘x’, ‘y’ positions.
- restarts: int
Number of starts to try. The more restarts, the better the solution will be approximated. The number of restarts depends on the problem size and should be kept low for instances above 2k cities.
- beam_search: bool
Specify if the initial solution should use KNN for an approximation solution.
- k: int
Beam width to use in the search.
- nstart: int
Vertex id to use as starting position.
- verbose: bool
Logs configuration and iterative improvement.
- Returns
- routecudf.Series
cudf.Series of size V containing the ordered list of vertices than needs to be visited.
Tree¶
Minimum Spanning Tree¶
- cugraph.tree.minimum_spanning_tree.minimum_spanning_tree(G, weight=None, algorithm='boruvka', ignore_nan=False)[source]¶
Returns a minimum spanning tree (MST) or forest (MSF) on an undirected graph
- Parameters
- GcuGraph.Graph or networkx.Graph
cuGraph graph descriptor with connectivity information.
- weightstring
default to the weights in the graph, if the graph edges do not have a weight attribute a default weight of 1 will be used.
- algorithmstring
Default to ‘boruvka’. The parallel algorithm to use when finding a minimum spanning tree.
- ignore_nanbool
Default to False
- Returns
- G_mstcuGraph.Graph or networkx.Graph
A graph descriptor with a minimum spanning tree or forest. The networkx graph will not have all attributes copied over
Maximum Spanning Tree¶
- cugraph.tree.minimum_spanning_tree.maximum_spanning_tree(G, weight=None, algorithm='boruvka', ignore_nan=False)[source]
Returns a maximum spanning tree (MST) or forest (MSF) on an undirected graph
- Parameters
- GcuGraph.Graph or networkx.Graph
cuGraph graph descriptor with connectivity information.
- weightstring
default to the weights in the graph, if the graph edges do not have a weight attribute a default weight of 1 will be used.
- algorithmstring
Default to ‘boruvka’. The parallel algorithm to use when finding a maximum spanning tree.
- ignore_nanbool
Default to False
- Returns
- G_mstcuGraph.Graph or networkx.Graph
A graph descriptor with a maximum spanning tree or forest. The networkx graph will not have all attributes copied over
Generator¶
RMAT¶
- cugraph.generators.rmat(scale, num_edges, a, b, c, seed, clip_and_flip, scramble_vertex_ids, create_using=<class 'cugraph.structure.graph_classes.DiGraph'>, mg=False)[source]¶
Generate a Graph object using a Recursive MATrix (R-MAT) graph generation algorithm.
- Parameters
- scaleint
Scale factor to set the number of vertices in the graph Vertex IDs have values in [0, V), where V = 1 << ‘scale’
- num_edgesint
Number of edges to generate
- afloat
Probability of the first partition
- bfloat
Probability of the second partition
- cfloat
Probability of the thrid partition
- seedint
Seed value for the random number generator
- clip_and_flipbool
Flag controlling whether to generate edges only in the lower triangular part (including the diagonal) of the graph adjacency matrix (if set to ‘true’) or not (if set to ‘false).
- scramble_vertex_idsbool
Flag controlling whether to scramble vertex ID bits (if set to true) or not (if set to false); scrambling vertex ID bits breaks correlation between vertex ID values and vertex degrees.
- create_usingcugraph Graph type or None The graph type to construct
containing the generated edges and vertices. If None is specified, the edgelist cuDF DataFrame (or dask_cudf DataFrame for MG) is returned as-is. This is useful for benchmarking Graph construction steps that require raw data that includes potential self-loops, isolated vertices, and duplicated edges. Default is cugraph.DiGraph. NOTE: only the cugraph.DiGraph type is supported for multi-GPU
- mgbool
If True, R-MAT generation occurs across multiple GPUs. If False, only a single GPU is used. Default is False (single-GPU)
- Returns
- instance of cugraph.Graph
Examples
import cugraph from cugraph.generators import rmat
- df = rmat(
scale, (2**scale)*edgefactor, 0.1, 0.2, 0.3, seed or 42, clip_and_flip=False, scramble_vertex_ids=True, create_using=None, # return edgelist instead of Graph instance mg=False
)
DASK MG Helper functions¶
- cugraph.comms.comms.initialize(comms=None, p2p=False, prows=None, pcols=None, partition_type=1)[source]¶
Initialize a communicator for multi-node/multi-gpu communications. It is expected to be called right after client initialization for running multi-GPU algorithms (this wraps raft comms that manages underlying NCCL and UCX comms handles across the workers of a Dask cluster).
It is recommended to also call destroy() when the comms are no longer needed so the underlying resources can be cleaned up.
- Parameters
- commsraft Comms
A pre-initialized raft communicator. If provided, this is used for mnmg communications. If not provided, default comms are initialized as per client information.
- p2pbool
Initialize UCX endpoints if True. Default is False.
- prowsint
Specifies the number of rows when performing a 2D partitioning of the input graph. If specified, this must be a factor of the total number of parallel processes. When specified with pcols, prows*pcols should be equal to the total number of parallel processes.
- pcolsint
Specifies the number of columns when performing a 2D partitioning of the input graph. If specified, this must be a factor of the total number of parallel processes. When specified with prows, prows*pcols should be equal to the total number of parallel processes.
- partition_typeint
Valid values are currently 1 or any int other than 1. A value of 1 (the default) represents a partitioning resulting in prows*pcols partitions. A non-1 value currently results in a partitioning of p*pcols partitions, where p is the number of GPUs.