Describe the bug
Selecting a JSON column value fails with StreamFailureError: unrecognized data found in stream or sigsegv if it was stored on the server side using rapidjson (allow_simdjson='false').
It happens for number values in JSON. It looks like their type is Int64 when parsing with simdjson and UInt64 when parsing with rapidjson. Maybe that is the reason for the failure.
Steps to reproduce
- Create a table with JSON column.
- Insert a row with JSON containing a number, for example
{"a": 100}. The value must be parsed from string with allow_simdjson='false'
- Run select query for the table
The below script shows the reproduction. For the simple case: {'a': 100} the library raises StreamFailureError and for more complex like {'key': 'value', 'nested': {'a': 1, 'b': [1, 2, 3]}} it gets sigsegv. With CLICKHOUSE_CONNECT_USE_C=0 in env it raises the exception in both cases.
Example output:
=== simple payload: {"a": 100} ===
[ simdjson] inferred types : {'a': 'Int64'}
[ simdjson] clickhouse_connect: [{'id': 1, 'j': {'a': 100}}]
[rapidjson] inferred types : {'a': 'UInt64'}
[rapidjson] clickhouse_connect: RAISED StreamFailureError: unrecognized data found in stream: `0000000164000000000000000000000000000000`
=== complex payload: {"key": "value", "nested": {"a": 1, "b": [1, 2, 3]}} ===
[ simdjson] inferred types : {'key': 'String', 'nested.a': 'Int64', 'nested.b': 'Array(Nullable(Int64))'}
[ simdjson] clickhouse_connect: [{'id': 1, 'j': {'key': 'value', 'nested': {'a': 1, 'b': [1, 2, 3]}}}]
[rapidjson] inferred types : {'key': 'String', 'nested.a': 'UInt64', 'nested.b': 'Array(Nullable(UInt64))'}
(last line not printed because sigsegv)
Expected behaviour
clickhouse-connect properly parses the response from the server and returns JSON value
Code example
import clickhouse_connect
from clickhouse_connect.driver.client import Client
HOST = "localhost"
HTTP_PORT = 8123
USER = "default"
PASSWORD = "password"
DB = "test_rapidjson_bug"
CASES = [
("simple payload", '{"a": 100}'),
("complex payload", '{"key": "value", "nested": {"a": 1, "b": [1, 2, 3]}}'),
]
def run_case(client: Client, label: str, value: str):
print(f"=== {label}: {value} ===")
for parser, settings in [
("simdjson", None),
("rapidjson", {"allow_simdjson": "false"}),
]:
table = f"t_{parser}"
client.command(f"DROP TABLE IF EXISTS {table}")
client.command(f"CREATE TABLE {table} (id Int32, j JSON) ENGINE = MergeTree ORDER BY id")
client.command(f"INSERT INTO {table} VALUES (1, '{value}')", settings=settings)
paths = client.query(f"SELECT JSONAllPathsWithTypes(j) FROM {table}").first_row[0]
print(f" [{parser:>9}] inferred types : {paths}", flush=True)
try:
rows = list(client.query(f"SELECT * FROM {table}").named_results())
print(f" [{parser:>9}] clickhouse_connect: {rows}", flush=True)
except Exception as exc:
print(f" [{parser:>9}] clickhouse_connect: RAISED {type(exc).__name__}: {exc}", flush=True)
print(flush=True)
def main():
ch_client_default = clickhouse_connect.get_client(host=HOST, port=HTTP_PORT, username=USER, password=PASSWORD)
ch_client_default.command(f"DROP DATABASE IF EXISTS {DB}")
ch_client_default.command(f"CREATE DATABASE {DB}")
client = clickhouse_connect.get_client(host=HOST, port=HTTP_PORT, username=USER, password=PASSWORD, database=DB)
try:
for label, value in CASES:
run_case(client, label, value)
finally:
ch_client_default.command(f"DROP DATABASE IF EXISTS {DB}")
if __name__ == "__main__":
main()
Configuration
Environment
- clickhouse-connect version: 0.15.1
- Python version: 3.13.11
- Operating system: Ubuntu 24.04.4 LTS
ClickHouse server
- ClickHouse Server version: 26.3.4.11
- ClickHouse Server non-default settings, if any:
CREATE TABLE statements for tables involved: above script contains everything
Describe the bug
Selecting a JSON column value fails with
StreamFailureError: unrecognized data found in streamor sigsegv if it was stored on the server side using rapidjson (allow_simdjson='false').It happens for number values in JSON. It looks like their type is Int64 when parsing with simdjson and UInt64 when parsing with rapidjson. Maybe that is the reason for the failure.
Steps to reproduce
{"a": 100}. The value must be parsed from string withallow_simdjson='false'The below script shows the reproduction. For the simple case:
{'a': 100}the library raisesStreamFailureErrorand for more complex like{'key': 'value', 'nested': {'a': 1, 'b': [1, 2, 3]}}it gets sigsegv. WithCLICKHOUSE_CONNECT_USE_C=0in env it raises the exception in both cases.Example output:
(last line not printed because sigsegv)
Expected behaviour
clickhouse-connect properly parses the response from the server and returns JSON value
Code example
Configuration
Environment
ClickHouse server
CREATE TABLEstatements for tables involved: above script contains everything