Skip to content

[Bug]: Different output upon decoding invalid utf-8 #875

@vytas7

Description

@vytas7

Describe the bug

Not sure whether it is a bug or just a different behaviour.
Not critical either way, but thought filing it for reference anyway.

  • When decoding invalid UTF-8 with errors='replace', CPython (and PyPy for that matter) decodes every invalid byte as .
  • GraalPy, OTOH, treats invalid UTF-8 triplets as one character.

Operating system

Linux

CPU architecture

x86_64

GraalPy version

GraalPy 3.12.8 (GraalVM CE Native 25.0.2)

JDK version

No response

Context configuration

No response

Steps to reproduce

GraalPy:

$ graalpy -c "print(b'\xed\xae\x80\xed\xb0\x80'.decode(errors='replace'))"
��

CPython:

$ python -c "print(b'\xed\xae\x80\xed\xb0\x80'.decode(errors='replace'))"
������

Expected behavior

To match CPython unless there is a good reason not to.

Stack trace

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions