Add cross-provider multipart upload support#333
Open
nuwang wants to merge 5 commits into
Open
Conversation
Introduce a MultipartUpload/UploadPart abstraction in the interface and base layers, implemented by all four providers (AWS S3, Azure Blob, GCP Storage, OpenStack Swift) and therefore the moto mock. The explicit lifecycle is initiate -> upload_part(n) -> complete/abort, exposed via BucketObject.create_multipart_upload(). The high-level upload()/upload_from_file() methods now route inputs above a configurable threshold (CB_MULTIPART_THRESHOLD/PART_SIZE) through the same mechanism, streaming one part at a time so large payloads are never fully buffered. Existing method signatures and return values are preserved. Per-provider mapping: native S3 multipart; Azure block blobs (stage_block/commit_block_list, with a documented no-op abort); GCS compose over temporary part objects with >32-source chaining and cleanup; Swift Static Large Objects with a manifest PUT. Azure's whole-file in-memory upload is replaced with a streaming single-shot path, removing the unused create_blob_from_text/create_blob_from_file helpers. Adds object-store tests (roundtrip, out-of-order parts, abort, transparent multipart, single-shot threshold, part-size validation) that run on the mock provider in CI and on the cloud providers when selected.
The transparent multipart driver previously uploaded parts sequentially, which gave none of multipart's throughput benefit. It now uploads parts across a bounded thread pool (CB_MULTIPART_MAX_CONCURRENCY). To stay safe even on providers whose SDK client/connection is not thread-safe, each worker uploads through its own cloned provider, so no provider state is shared across threads. Reads are coalesced up to the part size so non-final parts are never undersized on short reads. Providers with an efficient, thread-safe native parallel uploader override the driver: AWS uses boto3 upload_fileobj (TransferManager) and Azure uses upload_blob(max_concurrency=...). GCP and OpenStack Swift inherit the base clone-pool driver, which gives Swift safe parallelism despite swiftclient's non-thread-safe connection. Adds a provider-agnostic unit test for the base driver (part ordering, short-read coalescing, bounded concurrency, per-worker clone isolation, abort-on-failure, part-size validation), since the AWS-backed mock provider exercises the native override rather than the base driver.
The abort test asserted the target object is absent after abort, which only holds on AWS, where objects.create() returns a bare handle. GCP, OpenStack and Azure materialise an empty placeholder on create(), so objects.get() returns that empty object rather than None. Assert the provider-agnostic contract instead: after abort the target is absent or empty, but never holds the uploaded part. Also clean up the placeholder so bucket teardown does not leak.
c0c9eff to
58ae94a
Compare
AWS upload_from_file called boto3's upload_file with no TransferConfig, so it used boto3's defaults and ignored CB_MULTIPART_* entirely -- unlike upload(), which builds a TransferConfig from those knobs. Pass a TransferConfig built from the same knobs so both upload paths honour a single configuration.
Introduce a provider-agnostic UploadConfig value object (threshold, part_size, max_concurrency) that callers may pass to upload() and upload_from_file() to tune a single transfer. It is deliberately not boto3's TransferConfig, so the abstraction stays provider-neutral; each provider maps the fields onto its native mechanism (boto3 TransferConfig, Azure max_concurrency, the base clone-pool driver for GCP/Swift). The three multipart resolvers now follow precedence: explicit UploadConfig field -> provider/global config -> class default. Providers whose upload_from_file uses a native uploader that manages its own segmenting (GCP resumable, Swift SwiftService) accept the argument for interface consistency but document that it does not affect that path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a clean, provider-agnostic multi-part upload capability to CloudBridge. Large objects can now be uploaded reliably and memory-efficiently across AWS S3, Azure Blob, GCP Storage and OpenStack Swift (and therefore the moto mock).
Two things are delivered together:
obj.create_multipart_upload()→upload_part(n, data)→complete(parts)/abort(). Parts may be uploaded in any order / in parallel;completeassembles them in ascending part-number order.upload()/upload_from_file()route inputs above a configurable threshold through the same mechanism, streaming one part at a time so large payloads are never fully buffered. Existing signatures and return values are unchanged.Why
Large-object handling was inconsistent across providers: Azure and GCP buffered whole files into memory, Swift's single-PUT path failed above 5 GB, and only AWS/Swift handled large files well. This makes behaviour uniform, safe, and memory-efficient everywhere, and gives callers a single provider-agnostic API.
Design
Follows CloudBridge's three-layer (interface → base → provider) + subservice +
@dispatch-event pattern:MultipartUpload/UploadPart,BucketObject.create_multipart_upload(), and fourBucketObjectServicemethods.BaseMultipartUpload(delegates to the_bucket_objectsservice) +BaseUploadPart; a memory-efficient streaming driver inBaseBucketObject; config knobsCB_MULTIPART_THRESHOLD/CB_MULTIPART_PART_SIZE(env + per-provider override) with a 5 MiB minimum enforced.stage_block/commit_block_list, with a documented no-op abort since Azure has no server-side cancel); GCSobjects.composeover temporary part objects with >32-source chaining + cleanup; Swift Static Large Objects with a manifest PUT.upload_from_file()keeps each provider's superior native path where one exists (AWSupload_file, SwiftSwiftService); Azure's whole-file in-memory upload is replaced with a streaming path (removing the now-unusedcreate_blob_from_*helpers).Tests
New object-store tests — roundtrip, out-of-order parts, abort, transparent multipart, single-shot threshold, part-size validation. Written TDD-first against the moto mock; they run on the mock provider in CI without credentials and on
aws/azure/gcp/openstackwhen selected viaCB_TEST_PROVIDER. Existing object-store/storage suites remain green;flake8(project CI config) is clean.Backward compatibility
upload()/upload_from_file()keep their existing signatures and return semantics; only an internal threshold branch is added. The small-input path is unchanged.Known limitation
GCS's >32-part compose chaining has no automated coverage (mock CI is AWS-only, and cloud tests use 3 parts). The logic is straightforward but only executes on very large GCS uploads.