cugraph.sorensen_w#

cugraph.sorensen_w(input_graph: Graph, weights: DataFrame = None, vertex_pair: DataFrame = None, do_expensive_check: bool = False)[source]#

Compute the weighted Sorensen similarity between each pair of vertices connected by an edge, or between arbitrary pairs of vertices specified by the user. Sorensen coefficient is defined between two sets as the ratio of twice the volume of their intersection divided by the volume of each set.

NOTE: This algorithm doesn’t currently support datasets with vertices that are not (re)numebred vertices from 0 to V-1 where V is the total number of vertices as this creates isolated vertices.

Parameters:

input_graphcugraph.Graph

cuGraph Graph instance, should contain the connectivity information as an edge list (edge weights are not used for this algorithm). The adjacency list will be computed if not already present.

weightscudf.DataFrame

Specifies the weights to be used for each vertex. Vertex should be represented by multiple columns for multi-column vertices.

weights[‘vertex’]cudf.Series: Contains the vertex identifiers
weights[‘weight’]cudf.Series: Contains the weights of vertices

vertex_paircudf.DataFrame, optional (default=None)

A GPU dataframe consisting of two columns representing pairs of vertices. If provided, the sorensen coefficient is computed for the given vertex pairs, else, it is computed for all vertex pairs.

do_expensive_checkbool, optional (default=False)

Deprecated. This option added a check to ensure integer vertex IDs are sequential values from 0 to V-1. That check is now redundant because cugraph unconditionally renumbers and un-renumbers integer vertex IDs for optimal performance, therefore this option is deprecated and will be removed in a future version.

Returns:

dfcudf.DataFrame

GPU data frame of size E (the default) or the size of the given pairs (first, second) containing the Sorensen weights. The ordering is relative to the adjacency list, or that given by the specified vertex pairs.

df[‘first’]cudf.Series
The first vertex ID of each pair.

df[‘second’]cudf.Series: The second vertex ID of each pair.
df[‘sorensen_coeff’]cudf.Series: The computed weighted Sorensen coefficient between the first and the second vertex ID.

Examples

>>> import random
>>> from cugraph.datasets import karate
>>> G = karate.get_graph(download=True)
>>> # Create a dataframe containing the vertices with their
>>> # corresponding weight
>>> weights = cudf.DataFrame()
>>> # Sample 10 random vertices from the graph and drop duplicates if
>>> # there are any to avoid duplicates vertices with different weight
>>> # value in the 'weights' dataframe
>>> weights['vertex'] = G.nodes().sample(n=10).drop_duplicates()
>>> # Reset the indices and drop the index column
>>> weights.reset_index(inplace=True, drop=True)
>>> # Create a weight column with random weights
>>> weights['weight'] = [random.random() for w in range(
...                      len(weights['vertex']))]
>>> df = cugraph.sorensen_w(G, weights)