Rest.li applications are built around Resources. The key ingredient to creating a resource is data model, whose internal structure is defined by Pegasus Data Schema in key-values style. A fundamental presumption was that such structure exists for every Rest.li data model. However, it’s not the case for unstructured data such as images or PDFs, which are usually consumed in raw binary forms without a containing data structure.
This user guide is about working with unstructured data in Rest.li framework. This is not a comprehensive guide to building Rest.li resources in general, which is already covered in great details at Writing Resources. This guide focuses on the differences of unstructured data resource.
See also Unstructured Data in Rest.li Quick Start.
To Rest.li, the key difference about unstructured data is that they don’t have any defined schema and don’t have to be represented by a single generated class in Rest.li like schema-base data does (RecordTemplate). Unstructured data can be handled in the rawest form as a bytes array or a more advanced form as InputStream/ByteBuffer in Java for example.
Additionally, there are several other differences that set them apart from the typical structured data:
By default, unstructured data enjoys the same level of support as structured data in Rest.li: they can be modeled as various resource types and most resource-supporting features and tooling should work. However, because of the lack of RecordTemplate-based data model, any feature that works on the structure of the resource value, such as Field Projections, Entity Validation etc, do not apply to unstructured data resources, although they will continue to work with structured data resources that live in the same Rest.li application.
Features Highlights:
Not Supported Features:
Streaming Support
In this context, streaming means the ability to transport and process unstructured data in small chunks without the need to buffering the whole content in memory. It sounds appealing, but it introduces complexities for the app developers that might be unnecessary in most simple use cases. Therefore, Rest.li supports both non-streaming and streaming method.
Base interface determines the resource type, the resource key and value type. Unstructured data has its own set of base interfaces. The main difference is the absence of the resource value type. Each resource type has two variants: Non-Streaming and Streaming version. Non-streaming comes with synchronous and asynchronous style, while there is no such distinction for streaming.
Non-Streaming base interfaces
UnstructuredDataCollectionResource
UnstructuredDataCollectionResourceAsync
UnstructuredDataCollectionResourceTask
UnstructuredDataCollectionResourcePromise
UnstructuredDataAssociationResource
UnstructuredDataAssociationResourceAsync
UnstructuredDataAssociationResourceTask
UnstructuredDataAssociationResourcePromise
UnstructuredDataSimpleResource
UnstructuredDataSimpleResourceAsync
UnstructuredDataSimpleResourceTask
UnstructuredDataSimpleResourcePromise
Streaming base interfaces
UnstructuredDataCollectionResourceReactive
UnstructuredDataAssociationResourceReactive
UnstructuredDataSimpleResourceReactive
Highlights:
The definition of streaming unstructured data resource is similar to a regular resource. However, no value type is needed.
@RestLiCollection(name = "resumes", namespace = "com.mycompany")
public class ResumesResource extends UnstructuredDataCollectionResourceReactiveTemplate<String> { ... }
The interface of streaming resource is similar to the asynchronous resources with a callback parameter that’s used to return the result.
@Override
public void get(String resumeId, @CallbackParam Callback<UnstructuredDataReactiveResult> callback) {
Writer<ByteString> writer = new SingletonWriter<>(ByteString.copy(UNSTRUCTURED_DATA_BYTES));
callback.onSuccess(new UnstructuredDataReactiveResult(EntityStreams.newEntityStream(writer), MIME_TYPE));
}
Get Response
UnstructuredDataReactiveResult represents the download response which encapsulates the unstructured data EntityStream as well as the metadata needed to return a successful response. Its merely a container and could be subclassed if desires.
Writing Unstructured Data
Streaming requires the data to be read/write in continuous chunks manner. Simple bytes array or InputStream won’t do the job. Rest.li adopt the EntityStream interface
ByteString is essentially Rest.li’s immutable bytes array implementation and is used here to represent a single chunk. EntityStream is the interface that provides the chunks when they are requested. Note that the chunk size is not enforced, however, it’s recommended to make the size reasonable and consistent.
Writing Unstructured Data w. R2 Writer
Rest.li’s R2 layer has its own similar EntityStream implementation. If a writer is already provided, it can be easily converted to Rest.Li Writer using EntityStreamAdapters util.
Writer dataWriter = new ResumeDataWriter(id);
com.linkedin.entitystream.Writer<ByteString> writer = EntityStreamAdapters.toGenericWriterx(dataWriter);
Setting the Content-Type
A content-specific MIME content-type is required for the unstructured response to be handled correctly by its clients. It is required as part of the UnstructuredDataReactiveResult and is used as it-is in the HTTP response header. No validation is done by Rest.li.
Setting Additional Headers
More headers/metadata can be set using the ResourceContext. Here is an example to add a ‘disposition’ header to the response:
getContext().setResponseHeader("Content-Disposition", "attachment; filename=\"filename.jpg\"");
Rest.li filters currently don’t support access to the unstructured data payload. Any existing or new filter that tries to access the payload will get an empty record. (No, they won’t just fail.)
Resource IDLs are also generated for unstructured data resources, with a few minor differences in the generated IDL and Restspec files:
Rest.li generates online API documentation for every resource. It also works for unstructured data resource, however, in the API page, unstructured data is currently treated as an empty missing model.
Highlights
A simple unstructured data GET response:
curl 'http://myhost/resumes/1'
HTTP/1.1 200 OK
Content-Type: application/pdf
Content:
<<< bytes >>>
One common Http client is a native browser (not the JavaScript client lives in a browser). Unstructured data resource endpoints can be used in place wherever a standard web resource link is expected.
<html>
<a src="http://myhost/resumes/1">Download Resume</a>
</html>
D2 is what powers the host finding capability of RestClient under the hood. With a D2 client, you can send a request without having to specify the actual hostname of your resources:
URI uri = URI.create("d2://resumes/1");
StreamRequest req = new StreamRequestBuilder(uri).build(...);
d2Client.streamRequest(req, responseCallback)
...
A request to unstructured data resource could fail in the same ways a regular resource does. Moreover, even when a request such as Get is successful, the data flow could still be interrupted or timeout. When that happens, you could receive a successful HTTP status like 200 and still get an incomplete response or long hanging that results in a client timeout.
Q: Is there a size limit on how large an unstructured data could be served by Rest.li resource? A: No. But practically, the size will be cap base on your server’s timeout value. If you are seeing incompleted content on the client, it could be caused by an undersize server timeout value.
Q: What should be a reasonable server timeout? A: It depends on if the Rest.li application hosts a mix of structured and unstructured resources. Currently, Rest.li only allow one timeout setting for the entire app. You may not want a long timeout for APIs that serve small structured data. On the other hand, a short timeout for APIs that serve large unstructured data.
Q: Should I create a streaming or non-streaming resource for my unstructured data? A: First of all, streaming doesn’t come for free and true end-to-end streaming also depends on your other nodes in the network, so make sure you understand what you are getting into. Secondly, the performance depends on many factors such as the size of the data and I/O performances etc.
Q: What is reactive streaming and how can I leverage it? A: EntityStream