-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: fix empty suffix and prefix handling in pyarrow string methods #63395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Python's `str.removeprefix("")` and `str.removesuffix("")` return the
original string.
The current pyarrow-backed implementation slices with `stop=0` or
`start=0` when the prefix or suffix is empty, which can result in
unexpected behavior instead of preserving the original values.
This PR adds explicit guards for empty prefix and suffix inputs and
includes tests to ensure parity with Python semantics.
|
@vkverma9534 Thanks for the catch and for the PR! Could you add a test for this case? There is an existing test |
|
Added explicit handling for empty prefix and suffix and added regression tests. |
|
The Locale: it_IT failure appears to be due to an apt-get update error: |
Yes, I assume you can ignore those failures, that looks like some temporary network issue |
|
@jorisvandenbossche I think I should raise a new PR by creating a new branch since my attempts to make the checks pass have worsened them. is that ok? |
|
No need to do so! If you can fix the indentation in |
|
Okay I would try and fix them. |
| def _str_removeprefix(self, prefix: str): | ||
| if prefix == "": | ||
| return self | ||
| starts_with = pc.starts_with(self._pa_array, pattern=prefix) | ||
| removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | ||
| result = pc.if_else(starts_with, removed, self._pa_array) | ||
| return self._from_pyarrow_array(result) | ||
|
|
||
|
|
||
| def _str_removesuffix(self, suffix: str): | ||
| if suffix == "": | ||
| return self | ||
| ends_with = pc.ends_with(self._pa_array, pattern=suffix) | ||
| removed = pc.utf8_slice_codeunits(self._pa_array, 0, stop=-len(suffix)) | ||
| result = pc.if_else(ends_with, removed, self._pa_array) | ||
| return self._from_pyarrow_array(result) | ||
|
|
||
|
|
||
|
|
||
| def _str_removeprefix(self, prefix: str): | ||
| starts_with = pc.starts_with(self._pa_array, pattern=prefix) | ||
| removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | ||
| result = pc.if_else(starts_with, removed, self._pa_array) | ||
| return self._from_pyarrow_array(result) | ||
|
|
||
| def _str_removesuffix(self, suffix: str): | ||
| ends_with = pc.ends_with(self._pa_array, pattern=suffix) | ||
| removed = pc.utf8_slice_codeunits(self._pa_array, 0, stop=-len(suffix)) | ||
| result = pc.if_else(ends_with, removed, self._pa_array) | ||
| return self._from_pyarrow_array(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _str_removeprefix(self, prefix: str): | |
| if prefix == "": | |
| return self | |
| starts_with = pc.starts_with(self._pa_array, pattern=prefix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | |
| result = pc.if_else(starts_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) | |
| def _str_removesuffix(self, suffix: str): | |
| if suffix == "": | |
| return self | |
| ends_with = pc.ends_with(self._pa_array, pattern=suffix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, 0, stop=-len(suffix)) | |
| result = pc.if_else(ends_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) | |
| def _str_removeprefix(self, prefix: str): | |
| starts_with = pc.starts_with(self._pa_array, pattern=prefix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | |
| result = pc.if_else(starts_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) | |
| def _str_removesuffix(self, suffix: str): | |
| ends_with = pc.ends_with(self._pa_array, pattern=suffix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, 0, stop=-len(suffix)) | |
| result = pc.if_else(ends_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) | |
| def _str_removeprefix(self, prefix: str): | |
| if prefix == "": | |
| return self | |
| starts_with = pc.starts_with(self._pa_array, pattern=prefix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | |
| result = pc.if_else(starts_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) | |
| def _str_removesuffix(self, suffix: str): | |
| if prefix == "": | |
| return self | |
| ends_with = pc.ends_with(self._pa_array, pattern=suffix) | |
| removed = pc.utf8_slice_codeunits(self._pa_array, 0, stop=-len(suffix)) | |
| result = pc.if_else(ends_with, removed, self._pa_array) | |
| return self._from_pyarrow_array(result) |
I think this should fix it
Python's
str.removeprefix("")andstr.removesuffix("")return the original string.The current pyarrow-backed implementation slices with
stop=0orstart=0when the prefix or suffix is empty, which can result in unexpected behavior instead of preserving the original values.This PR adds explicit guards for empty prefix and suffix inputs and includes tests to ensure parity with Python semantics.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.