Skip to content

Parsing JSON column with integers fails if it was parsed with rapidjson on server side #712

@kfrydel

Description

@kfrydel

Describe the bug

Selecting a JSON column value fails with StreamFailureError: unrecognized data found in stream or sigsegv if it was stored on the server side using rapidjson (allow_simdjson='false').

It happens for number values in JSON. It looks like their type is Int64 when parsing with simdjson and UInt64 when parsing with rapidjson. Maybe that is the reason for the failure.

Steps to reproduce

  1. Create a table with JSON column.
  2. Insert a row with JSON containing a number, for example {"a": 100}. The value must be parsed from string with allow_simdjson='false'
  3. Run select query for the table

The below script shows the reproduction. For the simple case: {'a': 100} the library raises StreamFailureError and for more complex like {'key': 'value', 'nested': {'a': 1, 'b': [1, 2, 3]}} it gets sigsegv. With CLICKHOUSE_CONNECT_USE_C=0 in env it raises the exception in both cases.

Example output:

=== simple payload: {"a": 100} ===
  [ simdjson] inferred types    : {'a': 'Int64'}
  [ simdjson] clickhouse_connect: [{'id': 1, 'j': {'a': 100}}]
  [rapidjson] inferred types    : {'a': 'UInt64'}
  [rapidjson] clickhouse_connect: RAISED StreamFailureError: unrecognized data found in stream: `0000000164000000000000000000000000000000`

=== complex payload: {"key": "value", "nested": {"a": 1, "b": [1, 2, 3]}} ===
  [ simdjson] inferred types    : {'key': 'String', 'nested.a': 'Int64', 'nested.b': 'Array(Nullable(Int64))'}
  [ simdjson] clickhouse_connect: [{'id': 1, 'j': {'key': 'value', 'nested': {'a': 1, 'b': [1, 2, 3]}}}]
  [rapidjson] inferred types    : {'key': 'String', 'nested.a': 'UInt64', 'nested.b': 'Array(Nullable(UInt64))'}

(last line not printed because sigsegv)

Expected behaviour

clickhouse-connect properly parses the response from the server and returns JSON value

Code example

import clickhouse_connect
from clickhouse_connect.driver.client import Client

HOST = "localhost"
HTTP_PORT = 8123
USER = "default"
PASSWORD = "password"
DB = "test_rapidjson_bug"

CASES = [
    ("simple payload", '{"a": 100}'),
    ("complex payload", '{"key": "value", "nested": {"a": 1, "b": [1, 2, 3]}}'),
]


def run_case(client: Client, label: str, value: str):
    print(f"=== {label}: {value} ===")
    for parser, settings in [
        ("simdjson", None),
        ("rapidjson", {"allow_simdjson": "false"}),
    ]:
        table = f"t_{parser}"
        client.command(f"DROP TABLE IF EXISTS {table}")
        client.command(f"CREATE TABLE {table} (id Int32, j JSON) ENGINE = MergeTree ORDER BY id")
        client.command(f"INSERT INTO {table} VALUES (1, '{value}')", settings=settings)

        paths = client.query(f"SELECT JSONAllPathsWithTypes(j) FROM {table}").first_row[0]
        print(f"  [{parser:>9}] inferred types    : {paths}", flush=True)
        try:
            rows = list(client.query(f"SELECT * FROM {table}").named_results())
            print(f"  [{parser:>9}] clickhouse_connect: {rows}", flush=True)
        except Exception as exc:
            print(f"  [{parser:>9}] clickhouse_connect: RAISED {type(exc).__name__}: {exc}", flush=True)
    print(flush=True)


def main():
    ch_client_default = clickhouse_connect.get_client(host=HOST, port=HTTP_PORT, username=USER, password=PASSWORD)
    ch_client_default.command(f"DROP DATABASE IF EXISTS {DB}")
    ch_client_default.command(f"CREATE DATABASE {DB}")

    client = clickhouse_connect.get_client(host=HOST, port=HTTP_PORT, username=USER, password=PASSWORD, database=DB)
    try:
        for label, value in CASES:
            run_case(client, label, value)
    finally:
        ch_client_default.command(f"DROP DATABASE IF EXISTS {DB}")


if __name__ == "__main__":
    main()

Configuration

Environment

  • clickhouse-connect version: 0.15.1
  • Python version: 3.13.11
  • Operating system: Ubuntu 24.04.4 LTS

ClickHouse server

  • ClickHouse Server version: 26.3.4.11
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved: above script contains everything

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions