Skip to content

Add an indirection field to split/splice calls#376

Open
meroton-benjamin wants to merge 1 commit into
bazelbuild:mainfrom
meroton:main
Open

Add an indirection field to split/splice calls#376
meroton-benjamin wants to merge 1 commit into
bazelbuild:mainfrom
meroton:main

Conversation

@meroton-benjamin
Copy link
Copy Markdown

The split and splice api calls have an implicit maximum file size they can represent due to the maximum grpc message size. Remove this file size limit by adding an indirect_chunk_list field to the SplitBlobResponse and SpliceBlobRequest respectively.

This allows implementations to, in a backwards compatible manner, defer the list of chunks to an object stored in CAS making the split and splice blob requests capable of representing blobs of any size.

The split and splice api calls have an implicit maximum file size they
can represent due to the maximum grpc message size. Remove this file
size limit by adding an indirect_chunk_list field to the
SplitBlobResponse and SpliceBlobRequest respectively.

This allows implementations to, in a backwards compatible manner, defer
the list of chunks to an object stored in CAS making the split and
splice blob requests capable of representing blobs of any size.
@sluongng
Copy link
Copy Markdown
Collaborator

cc: @tyler-french

Copy link
Copy Markdown
Collaborator

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the PR. This looks great for first draft.

I know we discussed this on the call, but it might be worth pointing out for other reviewers who werent there:

  • Size limitation of the current split/splice + CDC chunking implementation: what's the threshold, etc...

  • Whats' the current behavior? If it's undefined, or perhaps results in wrong file content, then worth calling that out.

  • Provide some level of guidance on how a server should decide when to use the new field and/or when a client should read the new field over the old one.

I hope we can either provide this information inline with the doc, or at least leave some traces inside the PR body/commit message to help guide others when implementing it. Thanks

// Deprecated: Use DigestFunction_Value.Descriptor instead.
func (DigestFunction_Value) EnumDescriptor() ([]byte, []int) {
return file_build_bazel_remote_execution_v2_remote_execution_proto_rawDescGZIP(), []int{40, 0}
return file_build_bazel_remote_execution_v2_remote_execution_proto_rawDescGZIP(), []int{41, 0}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: we should add pb.go into gitattribute as generated file so it's treated as so during code review

Copy link
Copy Markdown
Contributor

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change that requires an API version update?

ie if an older client doesn't use the new field gets a response that requires the use of the new field (because the list of blobs is too large), will things quietly break?

// The chunking function used to split the blob.
ChunkingFunction.Value chunking_function = 2;

// The digest of the indirect chunk list which MUST be present in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards-compatibility reasons, I guess this field needs to be optional? In which case this first sentence should say something like "if specified and non-null".

Comment on lines 2016 to 2018
// The server MUST use the same digest function as the one explicitly or
// implicitly (through hash length) specified in the split request.
repeated Digest chunk_digests = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly state that this field is required?

// The chunking function that the client used to split the blob.
ChunkingFunction.Value chunking_function = 5;

// The digest of the indirect chunk list stored in the CAS. The client MAY set
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify that the blob is of type ChunkList.

@tyler-french
Copy link
Copy Markdown
Contributor

Perhaps I missed some prior discussion here, but I'd like to propose a slightly different approach for very large blobs: add streaming variants of the split/splice APIs, where chunk digests are sent in groups.

I agree that very large blobs can expose issues with CDC, but storing an indirect chunk list in CAS feels like using CAS as API transport. That leaks what should be a server-side storage/detail into the protocol, requires additional client behavior to read/write the indirection object, and does not address one of the main scalability issues with the current SpliceBlob flow: the server only learns about the blob/chunk mapping at the very end.

It also does not seem fully backward-compatible in practice. A client needs to indicate that it supports the indirect representation, and then needs extra read/write logic to materialize or dereference the chunk-list object. Older clients that only understand the existing SplitBlob / SpliceBlob shape would not know how to use that indirection.

For the case that motivated this PR, has the indirection approach worked well in practice? In particular, is the server still able to verify, within a reasonable time, that concatenating all chunks produces the expected blob digest?

The current client flow is roughly:

FindMissingBlobs(some group of chunks)
Write(some chunks)
FindMissingBlobs(some group of chunks)
Write(some chunks)
FindMissingBlobs(some group of chunks)
Write(some chunks)
SpliceBlob(all chunk digests) // This can be very large and may time out.

The problem is that the client is uploading a blob chunk-by-chunk, but the server does not know how those chunks compose the final blob until the final SpliceBlob call. For example, for a 20 GB blob with ~40k chunks, SpliceBlob may need to read all chunk data, hash the concatenation, and verify that it matches the requested blob digest in one RPC.

With an indirect chunk list, the write path also gets more complicated: the client uploads the chunks, writes the chunk-list object into CAS, and then calls SpliceBlob with a reference to that object. The read path has a similar extra step: SplitBlob returns indirection metadata, and the client then has to fetch and parse the chunk-list object from CAS. That solves one message-size limit, but by adding more protocol states and moving chunk-list transport into CAS.

A more idiomatic and scalable approach would be to add streaming variants of these APIs. This keeps the chunk-list transport in the RPC layer, separate from any particular server-side storage strategy. It also lets clients and servers exchange chunk digests incrementally without exceeding unary message limits or requiring clients to understand how the server represents oversized chunk lists internally.

For example, the write-side API could be:

// SpliceChunks is a streaming variant of SpliceBlob.
// Clients stream chunk digests as they become available, then commit the
// splice by sending the expected blob digest on the final request.
rpc SpliceChunks(stream SpliceChunksRequest) returns (SpliceBlobResponse);
message SpliceChunksRequest {
  // The instance of the execution system to operate against. A server may
  // support multiple instances of the execution system (with their own workers,
  // storage, caches, etc.). The server MAY require use of this field to select
  // between them in an implementation-defined fashion, otherwise it can be
  // omitted. Subsequent requests MAY omit it, but if set it MUST match the
  // first request.
  string instance_name = 1;

  // Expected digest of the spliced blob. This commits the splice. Clients
  // SHOULD only set this on the final request and MUST set it exactly once.
  Digest blob_digest = 2;

  // The ordered list of digests of the chunks which need to be concatenated to
  // assemble the original blob. Chunk digests may be split across multiple
  // stream requests.
  repeated Digest chunk_digests = 3;

  // The digest function of all chunks to be concatenated and of the blob to be
  // spliced. The server MUST use the same digest function for both cases.
  // Subsequent requests MAY omit it, but if set it MUST match the first
  // request.
  DigestFunction.Value digest_function = 4;

  // The chunking function that the client used to split the blob.
  ChunkingFunction.Value chunking_function = 5;
}

And the read-side equivalent could be:

// SplitChunks is a streaming variant of SplitBlob.
// The complete chunk list is the concatenation of chunk_digests across all
// responses in stream order. The server sets blob_digest on the final response
// to indicate that there are no more chunks.
rpc SplitChunks(SplitBlobRequest) returns (stream SplitChunksResponse);
message SplitChunksResponse {
  // The ordered list of digests of the chunks into which the blob was split.
  repeated Digest chunk_digests = 1;

  // The digest of the split blob. This indicates that all chunk digests have
  // been sent. Servers SHOULD only set this on the final response and MUST set
  // it exactly once.
  Digest blob_digest = 2;

  // The chunking function used to split the blob. The server MUST send the same
  // value in all responses.
  ChunkingFunction.Value chunking_function = 3;
}

I put together a draft here: #377

Would you be open to this kind of streaming API instead? Are there benefits to keeping the chunk-list extension on the unary APIs that I am missing?

@meroton-benjamin
Copy link
Copy Markdown
Author

Thanks for raising the PR. This looks great for first draft.

I know we discussed this on the call, but it might be worth pointing out for other reviewers who werent there:

* Size limitation of the current split/splice + CDC chunking implementation: what's the threshold, etc...

With regards to the current limit, there are quite a bit of moving pieces to the current behavior as there are many parameters that could be set. In essence its $x_{min}\cdot\frac{g}{d}$ where $x_{min}$ is the minimum chunk size, $g$ is your max grpc message size and $d$ is the size of your digest including overhead for repeated etc. For small sized blobs in sha256 $d$ is about 73 bytes.

This gives us multiple scenarios. Using RepMaxCDC with an optimized profile (256KiB min chunks, 2MiB grpc messages) we get a worst case blob size of 7GiB (average 9GiB). When doing something similar with FastCDC and 512KiB average chunk size we get a worst case blob size of 3.5GiB (average 14GiB). The worst case chunking is unlikely to materialize outside of synthetic data.

Sacrificing performance to maximize the chunk size while remaining inside of 4MiB grpc messages we can get a worst case blobs of 112GiB lower limit (150GiB average) for RepMaxCDC or 14GiB lower limit (56GiB average) for FastCDC.

* Whats' the _current_ behavior? If it's undefined, or perhaps results in wrong file content, then worth calling that out.

The current behavior is a transport layer error. If a client were to attempt to send chunk lists that wouldn't fit they would get an RESOURCE_EXHAUSTED error together with a helpful message like "Message was N bytes but this server only accepts messages up to M bytes" or similar.

If the Server on the other hand tried to respond with a message that was too big the client would simply hang up with an RST_STREAM frame.

I was previously under the impression that we negotiated max_message_size as part of the GetCapabilities call but I don't think we do. We simply rely on all messages always being below the maximum message size of the server/client pair.

* Provide some level of guidance on how a server should decide when to use the new field and/or when a client should read the new field over the old one.

I hope we can either provide this information inline with the doc, or at least leave some traces inside the PR body/commit message to help guide others when implementing it. Thanks

My intention was that if the new field is set then the full list of chunks must be assumed to be in the new field, the old field should still be set to allow fetching of blobs during the roundtrip to the cas. Servers and clients should set the new field as soon as they believe they are close to exceeding maximum message size if they want to maximize compatibility.

@meroton-benjamin
Copy link
Copy Markdown
Author

@tyler-french I agree that this is less than ideal, but I think the larger issue stems from that we in REv2 only represent blobs by their hash leaving no room for having e.g. a file represented by a more lightweight tree construct that doesn't require streaming the full blob to splice even if 99% of the blob is unchanged from earlier versions.

This means that no matter how we construct api we do we will always have to deal with that issue, a compliant splice call for a 100GB blob will always have to download all 100GB worth of chunks to hash them and write the resulting blob somewhere. If we instead allowed files to be represented by more than one digest we we wouldn't be in this pickle but that's probably a change that requires a v3.

For your streaming variant suggestion, one issue I see is that it makes it difficult to cache a splice call, the splicing server needs to know which blob it is trying to splice at first call in order to be able to skip it if it has already been done. In flight deduplication is also not really possible as we at no point can know whether two streams are trying to splice the same blob with the same chunks until the very last message arrives for the stream.

@tyler-french
Copy link
Copy Markdown
Contributor

This means that no matter how we construct api we do we will always have to deal with that issue, a compliant splice call for a 100GB blob will always have to download all 100GB worth of chunks to hash them and write the resulting blob somewhere. If we instead allowed files to be represented by more than one digest we we wouldn't be in this pickle but that's probably a change that requires a v3.

Agreed. I like the idea mentioned about using Prolly trees - seems pretty nifty.

the splicing server needs to know which blob it is trying to splice at first call in order to be able to skip it if it has already been done. In flight deduplication is also not really possible as we at no point can know whether two streams are trying to splice the same blob with the same chunks until the very last message arrives for the stream.

This is good feedback, thank you! I adjusted the implementation to use a more "finish_write" style bool flag to commit the splice. I agree this is an important implementation detail: #377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants