cudf.core.column.string.StringMethods.minhash#

StringMethods.minhash(seeds: ColumnLike | None = None, width: int = 4) SeriesOrIndex[source]#

Compute the minhash of a strings column. This uses the MurmurHash3_x86_32 algorithm for the hash function.

Parameters:
seedsColumnLike

The seeds used for the hash algorithm. Must be of type uint32.

widthint

The width of the substring to hash. Default is 4 characters.

Examples

>>> import cudf
>>> str_series = cudf.Series(['this is my', 'favorite book'])
>>> seeds = cudf.Series([0], dtype=np.uint32)
>>> str_series.str.minhash(seeds)
0     [21141582]
1    [962346254]
dtype: list
>>> seeds = cudf.Series([0, 1, 2], dtype=np.uint32)
>>> str_series.str.minhash(seeds)
0    [21141582, 403093213, 1258052021]
1    [962346254, 677440381, 122618762]
dtype: list