cugraph.jaccard#

cugraph.jaccard(input_graph: Graph, vertex_pair: Optional[DataFrame] = None, do_expensive_check: bool = False, use_weight: bool = False)[source]#

Compute the Jaccard similarity between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Jaccard similarity is defined between two sets as the ratio of the volume of their intersection divided by the volume of their union. In the context of graphs, the neighborhood of a vertex is seen as a set. The Jaccard similarity weight of each edge represents the strength of connection between vertices based on the relative similarity of their neighbors. If first is specified but second is not, or vice versa, an exception will be thrown.

NOTE: If the vertex_pair parameter is not specified then the behavior of cugraph.jaccard is different from the behavior of networkx.jaccard_coefficient.

cugraph.jaccard, in the absence of a specified vertex pair list, will compute the two_hop_neighbors of the entire graph to construct a vertex pair list and will return the jaccard coefficient for those vertex pairs. This is not advisable as the vertex_pairs can grow exponentially with respect to the size of the datasets

networkx.jaccard_coefficient, in the absence of a specified vertex pair list, will return an upper triangular dense matrix, excluding the diagonal as well as vertex pairs that are directly connected by an edge in the graph, of jaccard coefficients. Technically, networkx returns a lazy iterator across this upper triangular matrix where the actual jaccard coefficient is computed when the iterator is dereferenced. Computing a dense matrix of results is not feasible if the number of vertices in the graph is large (100,000 vertices would result in 4.9 billion values in that iterator).

If your graph is small enough (or you have enough memory and patience) you can get the interesting (non-zero) values that are part of the networkx solution by doing the following:

>>> from cugraph.datasets import karate
>>> input_graph = karate.get_graph(download=True, ignore_weights=True)
>>> pairs = input_graph.get_two_hop_neighbors()
>>> df = cugraph.jaccard(input_graph, pairs)

But please remember that cugraph will fill the dataframe with the entire solution you request, so you’ll need enough memory to store the 2-hop neighborhood dataframe.

Parameters:

input_graphcugraph.Graph

cuGraph Graph instance, should contain the connectivity information as an edge list. The graph should be undirected where an undirected edge is represented by a directed edge in both direction.The adjacency list will be computed if not already present.

This implementation only supports undirected, non-multi Graphs.

vertex_paircudf.DataFrame, optional (default=None)

A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the jaccard coefficient is computed for the given vertex pairs. If the vertex_pair is not provided then the current implementation computes the jaccard coefficient for all adjacent vertices in the graph.

do_expensive_checkbool, optional (default=False)

Deprecated.

This option added a check to ensure integer vertex IDs are sequential values from 0 to V-1. That check is now redundant because cugraph unconditionally renumbers and un-renumbers integer vertex IDs for optimal performance, therefore this option is deprecated and will be removed in a future version.

use_weightbool, optional (default=False)

Flag to indicate whether to compute weighted jaccard (if use_weight==True) or un-weighted jaccard (if use_weight==False). ‘input_graph’ must be weighted if ‘use_weight=True’.

Returns:

dfcudf.DataFrame

GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Jaccard weights. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.

df[‘first’]cudf.Series: The first vertex ID of each pair (will be identical to first if specified).
df[‘second’]cudf.Series: The second vertex ID of each pair (will be identical to second if specified).
df[‘jaccard_coeff’]cudf.Series: The computed Jaccard coefficient between the first and the second vertex ID.

Examples

>>> from cugraph.datasets import karate
>>> from cugraph import jaccard
>>> input_graph = karate.get_graph(download=True, ignore_weights=True)
>>> df = jaccard(input_graph)