pyspark.pandas.DataFrame.update#
- DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')[source]#
Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value.
Note
When
errors='raise', this method forces materialization to check for overlapping non-NA data, which may impact performance on large datasets.- Parameters
- otherDataFrame, or Series
- join‘left’, default ‘left’
Only left join is implemented, keeping the index and columns of the original object.
- overwritebool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
- filter_funccallable(1d-array) -> bool 1d-array, optional
Can choose to replace values other than NA. Return True for values which should be updated. Applied to original DataFrame’s values.
- errors{‘ignore’, ‘raise’}, default ‘ignore’
If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.
- Returns
- Nonemethod directly changes calling object
- Raises
- ValueError
If errors=’raise’ and overlapping non-NA data is detected. If errors is not ‘ignore’ or ‘raise’.
See also
DataFrame.mergeFor column(s)-on-columns(s) operations.
DataFrame.joinJoin columns of another DataFrame.
DataFrame.hintSpecifies some hint on the current DataFrame.
broadcastMarks a DataFrame as small enough for use in broadcast joins.
Examples
>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B']) >>> new_df = ps.DataFrame({'B': [4, 5, 6], 'C': [7, 8, 9]}, columns=['B', 'C']) >>> df.update(new_df) >>> df.sort_index() A B 0 1 4 1 2 5 2 3 6
The DataFrame’s length does not increase because of the update, only values at matching index/column labels are updated.
>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B']) >>> new_df = ps.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}, columns=['B']) >>> df.update(new_df) >>> df.sort_index() A B 0 a d 1 b e 2 c f
For Series, its name attribute must be set.
>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B']) >>> new_column = ps.Series(['d', 'e'], name='B', index=[0, 2]) >>> df.update(new_column) >>> df.sort_index() A B 0 a d 1 b y 2 c e
If other contains None the corresponding values are not updated in the original dataframe.
>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B']) >>> new_df = ps.DataFrame({'B': [4, None, 6]}, columns=['B']) >>> df.update(new_df) >>> df.sort_index() A B 0 1 4.0 1 2 500.0 2 3 6.0
Using filter_func to selectively update values:
>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}) >>> new_df = ps.DataFrame({'B': [4, 5, 6]}) >>> df.update(new_df, filter_func=lambda x: x > 450) >>> df.sort_index() A B 0 1 400 1 2 5 2 3 6