pyspark.pandas.DataFrame.update#

DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')[source]#

Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value.

Note

When errors='raise', this method forces materialization to check for overlapping non-NA data, which may impact performance on large datasets.

Parameters

otherDataFrame, or Series

join‘left’, default ‘left’

Only left join is implemented, keeping the index and columns of the original object.

overwritebool, default True

How to handle non-NA values for overlapping keys:

True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.

filter_funccallable(1d-array) -> bool 1d-array, optional

Can choose to replace values other than NA. Return True for values which should be updated. Applied to original DataFrame’s values.

errors{‘ignore’, ‘raise’}, default ‘ignore’

If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.

Returns

Nonemethod directly changes calling object

Raises

ValueError: If errors=’raise’ and overlapping non-NA data is detected. If errors is not ‘ignore’ or ‘raise’.

See also

DataFrame.merge: For column(s)-on-columns(s) operations.
DataFrame.join: Join columns of another DataFrame.
DataFrame.hint: Specifies some hint on the current DataFrame.
broadcast: Marks a DataFrame as small enough for use in broadcast joins.

Examples

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': [4, 5, 6], 'C': [7, 8, 9]}, columns=['B', 'C'])
>>> df.update(new_df)
>>> df.sort_index()
   A  B
0  1  4
1  2  5
2  3  6

The DataFrame’s length does not increase because of the update, only values at matching index/column labels are updated.

>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}, columns=['B'])
>>> df.update(new_df)
>>> df.sort_index()
   A  B
0  a  d
1  b  e
2  c  f

For Series, its name attribute must be set.

>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B'])
>>> new_column = ps.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df.sort_index()
   A  B
0  a  d
1  b  y
2  c  e

If other contains None the corresponding values are not updated in the original dataframe.

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': [4, None, 6]}, columns=['B'])
>>> df.update(new_df)
>>> df.sort_index()
   A      B
0  1    4.0
1  2  500.0
2  3    6.0

Using filter_func to selectively update values:

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]})
>>> new_df = ps.DataFrame({'B': [4, 5, 6]})
>>> df.update(new_df, filter_func=lambda x: x > 450)
>>> df.sort_index()
   A    B
0  1  400
1  2    5
2  3    6