Skip to content

Add cross-provider multipart upload support#333

Open
nuwang wants to merge 5 commits into
mainfrom
add-multipart-upload
Open

Add cross-provider multipart upload support#333
nuwang wants to merge 5 commits into
mainfrom
add-multipart-upload

Conversation

@nuwang

@nuwang nuwang commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a clean, provider-agnostic multi-part upload capability to CloudBridge. Large objects can now be uploaded reliably and memory-efficiently across AWS S3, Azure Blob, GCP Storage and OpenStack Swift (and therefore the moto mock).

Two things are delivered together:

  • Explicit lifecycle APIobj.create_multipart_upload()upload_part(n, data)complete(parts) / abort(). Parts may be uploaded in any order / in parallel; complete assembles them in ascending part-number order.
  • Transparent handlingupload() / upload_from_file() route inputs above a configurable threshold through the same mechanism, streaming one part at a time so large payloads are never fully buffered. Existing signatures and return values are unchanged.

Why

Large-object handling was inconsistent across providers: Azure and GCP buffered whole files into memory, Swift's single-PUT path failed above 5 GB, and only AWS/Swift handled large files well. This makes behaviour uniform, safe, and memory-efficient everywhere, and gives callers a single provider-agnostic API.

Design

Follows CloudBridge's three-layer (interface → base → provider) + subservice + @dispatch-event pattern:

  • Interfaces — new MultipartUpload / UploadPart, BucketObject.create_multipart_upload(), and four BucketObjectService methods.
  • Base — concrete BaseMultipartUpload (delegates to the _bucket_objects service) + BaseUploadPart; a memory-efficient streaming driver in BaseBucketObject; config knobs CB_MULTIPART_THRESHOLD / CB_MULTIPART_PART_SIZE (env + per-provider override) with a 5 MiB minimum enforced.
  • Providers — native S3 multipart; Azure block blobs (stage_block/commit_block_list, with a documented no-op abort since Azure has no server-side cancel); GCS objects.compose over temporary part objects with >32-source chaining + cleanup; Swift Static Large Objects with a manifest PUT. upload_from_file() keeps each provider's superior native path where one exists (AWS upload_file, Swift SwiftService); Azure's whole-file in-memory upload is replaced with a streaming path (removing the now-unused create_blob_from_* helpers).

Tests

New object-store tests — roundtrip, out-of-order parts, abort, transparent multipart, single-shot threshold, part-size validation. Written TDD-first against the moto mock; they run on the mock provider in CI without credentials and on aws/azure/gcp/openstack when selected via CB_TEST_PROVIDER. Existing object-store/storage suites remain green; flake8 (project CI config) is clean.

Backward compatibility

upload() / upload_from_file() keep their existing signatures and return semantics; only an internal threshold branch is added. The small-input path is unchanged.

Known limitation

GCS's >32-part compose chaining has no automated coverage (mock CI is AWS-only, and cloud tests use 3 parts). The logic is straightforward but only executes on very large GCS uploads.

@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 17:33 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 17:33 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 17:33 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 17:33 — with GitHub Actions Failure
@nuwang nuwang temporarily deployed to cloud-integration June 26, 2026 19:15 — with GitHub Actions Inactive
@nuwang nuwang temporarily deployed to cloud-integration June 26, 2026 19:15 — with GitHub Actions Inactive
@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 19:15 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 26, 2026 19:15 — with GitHub Actions Failure
nuwang added 3 commits June 27, 2026 17:18
Introduce a MultipartUpload/UploadPart abstraction in the interface and
base layers, implemented by all four providers (AWS S3, Azure Blob, GCP
Storage, OpenStack Swift) and therefore the moto mock.

The explicit lifecycle is initiate -> upload_part(n) -> complete/abort,
exposed via BucketObject.create_multipart_upload(). The high-level
upload()/upload_from_file() methods now route inputs above a configurable
threshold (CB_MULTIPART_THRESHOLD/PART_SIZE) through the same mechanism,
streaming one part at a time so large payloads are never fully buffered.
Existing method signatures and return values are preserved.

Per-provider mapping: native S3 multipart; Azure block blobs
(stage_block/commit_block_list, with a documented no-op abort); GCS
compose over temporary part objects with >32-source chaining and cleanup;
Swift Static Large Objects with a manifest PUT. Azure's whole-file in-memory
upload is replaced with a streaming single-shot path, removing the unused
create_blob_from_text/create_blob_from_file helpers.

Adds object-store tests (roundtrip, out-of-order parts, abort, transparent
multipart, single-shot threshold, part-size validation) that run on the mock
provider in CI and on the cloud providers when selected.
The transparent multipart driver previously uploaded parts sequentially,
which gave none of multipart's throughput benefit. It now uploads parts
across a bounded thread pool (CB_MULTIPART_MAX_CONCURRENCY). To stay safe
even on providers whose SDK client/connection is not thread-safe, each
worker uploads through its own cloned provider, so no provider state is
shared across threads. Reads are coalesced up to the part size so non-final
parts are never undersized on short reads.

Providers with an efficient, thread-safe native parallel uploader override
the driver: AWS uses boto3 upload_fileobj (TransferManager) and Azure uses
upload_blob(max_concurrency=...). GCP and OpenStack Swift inherit the base
clone-pool driver, which gives Swift safe parallelism despite swiftclient's
non-thread-safe connection.

Adds a provider-agnostic unit test for the base driver (part ordering,
short-read coalescing, bounded concurrency, per-worker clone isolation,
abort-on-failure, part-size validation), since the AWS-backed mock provider
exercises the native override rather than the base driver.
The abort test asserted the target object is absent after abort, which only
holds on AWS, where objects.create() returns a bare handle. GCP, OpenStack
and Azure materialise an empty placeholder on create(), so objects.get()
returns that empty object rather than None.

Assert the provider-agnostic contract instead: after abort the target is
absent or empty, but never holds the uploaded part. Also clean up the
placeholder so bucket teardown does not leak.
@nuwang nuwang force-pushed the add-multipart-upload branch from c0c9eff to 58ae94a Compare June 27, 2026 11:57
@nuwang nuwang temporarily deployed to cloud-integration June 27, 2026 12:00 — with GitHub Actions Inactive
@nuwang nuwang temporarily deployed to cloud-integration June 27, 2026 12:00 — with GitHub Actions Inactive
@nuwang nuwang temporarily deployed to cloud-integration June 27, 2026 12:00 — with GitHub Actions Inactive
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 12:00 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 13:38 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 13:50 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 13:58 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 14:13 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 16:06 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 18:04 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 18:41 — with GitHub Actions Failure
@nuwang nuwang had a problem deploying to cloud-integration June 27, 2026 19:51 — with GitHub Actions Failure
@nuwang nuwang deployed to cloud-integration June 28, 2026 07:28 — with GitHub Actions Active
AWS upload_from_file called boto3's upload_file with no TransferConfig, so it
used boto3's defaults and ignored CB_MULTIPART_* entirely -- unlike upload(),
which builds a TransferConfig from those knobs. Pass a TransferConfig built
from the same knobs so both upload paths honour a single configuration.
Introduce a provider-agnostic UploadConfig value object (threshold,
part_size, max_concurrency) that callers may pass to upload() and
upload_from_file() to tune a single transfer. It is deliberately not boto3's
TransferConfig, so the abstraction stays provider-neutral; each provider maps
the fields onto its native mechanism (boto3 TransferConfig, Azure
max_concurrency, the base clone-pool driver for GCP/Swift).

The three multipart resolvers now follow precedence: explicit UploadConfig
field -> provider/global config -> class default. Providers whose
upload_from_file uses a native uploader that manages its own segmenting
(GCP resumable, Swift SwiftService) accept the argument for interface
consistency but document that it does not affect that path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant