pyspark.pandas.DataFrame.update#

DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')[source]#

Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value.

Note

When errors='raise', this method forces materialization to check for overlapping non-NA data, which may impact performance on large datasets.

Parameters
otherDataFrame, or Series
join‘left’, default ‘left’

Only left join is implemented, keeping the index and columns of the original object.

overwritebool, default True

How to handle non-NA values for overlapping keys:

  • True: overwrite original DataFrame’s values with values from other.

  • False: only update values that are NA in the original DataFrame.

filter_funccallable(1d-array) -> bool 1d-array, optional

Can choose to replace values other than NA. Return True for values which should be updated. Applied to original DataFrame’s values.

errors{‘ignore’, ‘raise’}, default ‘ignore’

If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.

Returns
Nonemethod directly changes calling object
Raises
ValueError

If errors=’raise’ and overlapping non-NA data is detected. If errors is not ‘ignore’ or ‘raise’.

See also

DataFrame.merge

For column(s)-on-columns(s) operations.

DataFrame.join

Join columns of another DataFrame.

DataFrame.hint

Specifies some hint on the current DataFrame.

broadcast

Marks a DataFrame as small enough for use in broadcast joins.

Examples

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': [4, 5, 6], 'C': [7, 8, 9]}, columns=['B', 'C'])
>>> df.update(new_df)
>>> df.sort_index()
   A  B
0  1  4
1  2  5
2  3  6

The DataFrame’s length does not increase because of the update, only values at matching index/column labels are updated.

>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}, columns=['B'])
>>> df.update(new_df)
>>> df.sort_index()
   A  B
0  a  d
1  b  e
2  c  f

For Series, its name attribute must be set.

>>> df = ps.DataFrame({'A': ['a', 'b', 'c'], 'B': ['x', 'y', 'z']}, columns=['A', 'B'])
>>> new_column = ps.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df.sort_index()
   A  B
0  a  d
1  b  y
2  c  e

If other contains None the corresponding values are not updated in the original dataframe.

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]}, columns=['A', 'B'])
>>> new_df = ps.DataFrame({'B': [4, None, 6]}, columns=['B'])
>>> df.update(new_df)
>>> df.sort_index()
   A      B
0  1    4.0
1  2  500.0
2  3    6.0

Using filter_func to selectively update values:

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [400, 500, 600]})
>>> new_df = ps.DataFrame({'B': [4, 5, 6]})
>>> df.update(new_df, filter_func=lambda x: x > 450)
>>> df.sort_index()
   A    B
0  1  400
1  2    5
2  3    6