cudf.merge#
- cudf.merge(left, right, *args, **kwargs)[source]#
Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.
- Parameters:
- leftSeries or DataFrame
- rightDataFrame
- onlabel or list; defaults to None
Column or index level names to join on. These must be found in both DataFrames.
If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
- how{‘left’, ‘outer’, ‘inner’, ‘leftsemi’, ‘leftanti’}, default ‘inner’
Type of merge to be performed.
left : use only keys from left frame, similar to a SQL left outer join.
right : not supported.
outer : use union of keys from both frames, similar to a SQL full outer join.
inner : use intersection of keys from both frames, similar to a SQL inner join.
- leftsemisimilar to
inner
join, but only returns columns from the left dataframe and ignores all columns from the right dataframe.
- leftsemisimilar to
leftanti : returns only rows columns from the left dataframe for non-matched records. This is exact opposite to
leftsemi
join.
- left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
- right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
- left_indexbool, default False
Use the index from the left DataFrame as the join key(s).
- right_indexbool, default False
Use the index from the right DataFrame as the join key.
- sortbool, default False
Sort the resulting dataframe by the columns that were merged on, starting from the left.
- suffixes: Tuple[str, str], defaults to (‘_x’, ‘_y’)
Suffixes applied to overlapping column names on the left and right sides
- Returns:
- mergedDataFrame
Examples
>>> import cudf >>> df_a = cudf.DataFrame() >>> df_a['key'] = [0, 1, 2, 3, 4] >>> df_a['vals_a'] = [float(i + 10) for i in range(5)] >>> df_b = cudf.DataFrame() >>> df_b['key'] = [1, 2, 4] >>> df_b['vals_b'] = [float(i+10) for i in range(3)] >>> df_merged = df_a.merge(df_b, on=['key'], how='left') >>> df_merged.sort_values('key') key vals_a vals_b 3 0 10.0 0 1 11.0 10.0 1 2 12.0 11.0 4 3 13.0 2 4 14.0 12.0
Merging on categorical variables is only allowed in certain cases
Categorical variable typecasting logic depends on both how and the specifics of the categorical variables to be merged. Merging categorical variables when only one side is ordered is ambiguous and not allowed. Merging when both categoricals are ordered is allowed, but only when the categories are exactly equal and have equal ordering, and will result in the common dtype. When both sides are unordered, the result categorical depends on the kind of join: - For inner joins, the result will be the intersection of the categories - For left or right joins, the result will be the left or right dtype respectively. This extends to semi and anti joins. - For outer joins, the result will be the union of categories from both sides.
Pandas Compatibility Note
DataFrames merges in cuDF result in non-deterministic row ordering.