cudf.core.column.string.StringMethods.minhash#
- StringMethods.minhash(seeds: ColumnLike | None = None, width: int = 4) SeriesOrIndex [source]#
Compute the minhash of a strings column. This uses the MurmurHash3_x86_32 algorithm for the hash function.
- Parameters:
- seedsColumnLike
The seeds used for the hash algorithm. Must be of type uint32.
- widthint
The width of the substring to hash. Default is 4 characters.
Examples
>>> import cudf >>> str_series = cudf.Series(['this is my', 'favorite book']) >>> seeds = cudf.Series([0], dtype=np.uint32) >>> str_series.str.minhash(seeds) 0 [21141582] 1 [962346254] dtype: list >>> seeds = cudf.Series([0, 1, 2], dtype=np.uint32) >>> str_series.str.minhash(seeds) 0 [21141582, 403093213, 1258052021] 1 [962346254, 677440381, 122618762] dtype: list