There are three architectural layers that define how data is stored in-memory and provide the API’s used to access this data.
At the conceptual level, the Data layer provides generic in-memory
representations of JSON objects and arrays. A DataMap
and a DataList
provide the in-memory representation of a JSON object and a JSON array
respectively. These DataMaps and DataLists are the primary in-memory
data structures that store and manage data belonging to instances of
complex schema types. This layer allows data to be serialized and
de-serialized into in-memory representations without requiring the
schema to be known. In fact, the Data layer is not aware of schemas and
do not require a schema to access the underlying data.
The main motivations behind the Data layer are:
The Data layer implements the following constraints:
DataMap
and not DataList
) are
immutable.Data.NULL
constant is used to represent
null deserialized from or to be serialized to JSON. Avoiding null
Java values reduces complexity by reducing the number of states a
field may have. Without null values, a field can have two states,
“absent” or “has valid value”. If null values are permitted, a
field can have three states, “absent”, “has null value”, and “has
valid value”.DataMap
is always java.lang.String
.The Data layer provides the following additional features (above and beyond what the Java library provides.)
DataMap
and DataList
may be made read-only. Once it is
read-only, mutations will no longer be allowed and will throw
java.lang.UnsupportedOperationException
. There is no way to revert
a read-only instance to read-write.com.linkedin.data.Instrumentable
for
details.java.lang.Integer
java.lang.Long
java.lang.Float
java.lang.Double
java.lang.Boolean
java.lang.String
com.linkedin.data.ByteString
com.linkedin.data.DataMap
com.linkedin.data.DataList
Note Enum types are not allowed because enum types are not generic and portable. Enum values are stored as a string.
Both DataMap
and DataList
implement the
com.linkedin.data.DataComplex
interface. This interface declares the
methods that supports the additional features common to a DataMap
and
a DataList
. These methods are:
Method | Declared by | Description |
---|---|---|
DataComplex clone() |
DataComplex |
A shallow copy of the instance. The read-only state is not copied, the clone will be mutable. The instrumentation state is also not copied. Although java.lang.CloneNotSupportedException is declared in the throws clause, the method should not throw this exception. |
DataComplex copy() |
DataComplex |
A deep copy of the object graph rooted at the instance. The copy will be isomorphic to the original. The read-only state is not deep copied, and the new DataComplex copies will be mutable. The instrumentation state is also not copied. Although java.lang.CloneNotSupportedException is declared in the throws clause, the method should not throw this exception. |
void setReadOnly() |
CowCommon |
Make the instance read-only. It does not affect the read-only state of contained DataComplex values. |
boolean isReadOnly() |
CowCommon |
Whether the instance is in read-only state. |
void makeReadOnly() |
DataComplex |
Make the object graph rooted at this instance read-only. |
void isMadeReadOnly() |
DataComplex |
Whether the object graph rooted at this instance has been made read-only. |
Collection<Object> values() |
DataComplex |
Returns the values stored in the DataComplex instance, i.e. returns the values of a DataMap or the elements of a DataList . |
void startInstrumentatingAccess() |
Instrumentable |
Starts instrumenting access. |
void stopInstrumentingAccess() |
Instrumentable |
Stops instrumenting access. |
void clearInstrumentedData() |
Instrumentable |
Clears instrumentation data collected. |
void collectInstrumentedData(...) |
Instrumentable |
Collect data gathered when instrumentation was enabled. |
Note: Details on CowCommon
, CowMap
, and CowList
have been omitted
or covered under DataComplex
. Cow provides copy-on-write
functionality. The semantics of CowMap
and CowList
is similar to
HashMap
and ArrayList
.
The com.linkedin.data.DataMap
class has the following characteristics:
DataMap
implements java.util.Map<String, Object>
.entrySet()
, keySet()
, and values()
methods return
unmodifiable set and collection views.clone()
and copy()
methods returns a DataMap
.The com.linkedin.data.DataList
class has the following
characteristics.
DataList
implements java.util.List<Object>
.clone()
and copy()
method return a DataList
.The Data Schema layer provides the in-memory representation of the data schema. The Data Schema Layer provides the following main features:
Their common base class for Data Schema classes is
com.linkedin.data.schema.DataSchema
. It defines the following methods:
Method | Description |
---|---|
Type getType() |
Provide the type of the schema, can be BOOLEAN , INT , LONG , FLOAT , DOUBLE , BYTES , STRING , FIXED , ENUM , NULL , ARRAY , RECORD , MAP , UNION . |
boolean hasError() |
Whether the schema definition contains at least one error. |
boolean isPrimitive() |
Whether the schema type is a primitive schema type. |
boolean isComplex() |
Whether the schema type is a complex schema type, i.e. not primitive type. |
Map<String,Object> getProperties() |
Return the properties of the schema. These properties are the keys and values from the JSON fields in complex schema definitions that are not processed and interpreted by the schema parser. For primitive types, this method always return an immutable empty map. |
String getUnionMemberKey() |
If this type is used as a member of a union without an alias, this will be the key that uniquely identifies/selects this type within the union. This value of this key is as defined by the Avro 1.4.1 specification for JSON serialization. |
String toString() |
A more human consumable formatting of the schema in JSON encoding. Space will added between fields, items, names, values, … etc. |
Type getDereferencedType |
If the type is a typeref, it will follow the typeref reference chain and return the type referenced at the end of the typeref chain. |
DataSchema getDereferencedSchema |
If the type is a typeref, it will follow the typeref reference chain and return the DataSchema referenced at the end of the typeref chain. |
The following table shows the mapping of schema types to Data Schema classes.
Schema Type |
Data Schema class |
Relevant Specific Attributes |
---|---|---|
int |
IntegerDataSchema |
|
long |
LongDataSchema |
|
float |
FloatDataSchema |
|
double |
DoubleDataSchema |
|
boolean |
BooleanDataSchema |
|
string |
StringDataSchema |
|
bytes |
BytesDataSchema |
|
enum |
EnumDataSchema |
List<String> getSymbols() int index(String symbol) boolean contains(String symbol) |
array |
ArrayDataSchema |
DataSchema getItems() |
map |
MapDataSchema |
DataSchema getValues() |
fixed |
FixedDataSchema |
int getSize() |
record, error |
RecordDataSchema |
RecordType recordType() (record or error) boolean isErrorRecord() List<Field> getFields() int index(String fieldName) boolean contains(String fieldName) Field getField(String fieldName) |
union |
UnionDataSchema |
List<Member> getMembers() boolean contains(String memberKey) DataSchema getTypeByMemberKey(String memberKey) boolean areMembersAliased() |
null |
NullDataSchema |
The ValidateDataAgainstSchema
class provides methods for validating
Data layer instances with a Data Schema. The ValidationOption
class is
used to specify how validation should be performed and how to fix-up the
input Data layer objects to conform to the schema. There are two
independently configuration options:
RequiredMode
option indicates how required fields should be
handled during validation.CoercionMode
option indicates how to coerce Data layer objects to
the Java type corresponding to their schema type.Example Usage:
ValidationResult validationResult =
ValidateDataAgainstSchema.validate(dataTemplate, dataTemplate.schema(),
new ValidationOptions());
if (!validationResult.isValid())
{
// do something
}
The available RequiredModes are:
IGNORE
MUST_BE_PRESENT
CAN_BE_ABSENT_IF_HAS_DEFAULT
FIXUP_ABSENT_WITH_DEFAULT
DataMap
containing the field cannot be modified because it is read-only.Since JSON does not have or encode enough information on the actual
types of primitives, and schema types like bytes and fixed are not
represented by native types in JSON, the initial de-serialized in-memory
representation of instances of these types may not be the actual type
specified in the schema. For example, when de-serializing the number 52,
it will be de-serialized into an Integer
even though the schema type
may be a Long
. This is because a schema is not required to serialize
or de-serialize.
When the data is accessed via schema aware language binding like the
Java binding, the conversion/coercion can occur at the language binding
layer. In cases when the language binding is not used, it may be
desirable to fix-up a Data layer object by coercing it the Java type
corresponding to the object’s schema. For example, the appropriate Java
type the above example would be a Long
. Another fix-up would be to
fixup Avro-specified string encoding of binary data (bytes or fixed)
into a ByteString
. In another case, it may be desirable to coerce the
string representation of a value to the Java type corresponding to the
object’s schema. For example, coerce “65” to 65, the integer, if the
schema type is “int”.
Whether an how coercion is performed is specified by CoercionMode
. The
available CoercionModes are:
OFF
NORMAL
STRING_TO_PRIMITIVE
NORMAL
. In addition, also
coerces string representations of numbers to the schema’s
corresponding numeric type, and string representation of booleans
(“true” or “false” case-insenstive) to Boolean
.NORMAL
Coercion ModeThe following table provides additional details on the NORMAL
validation and coercion mode.
Schema Type |
Post-coercion Java Type |
Pre-coercion Input Java Types |
Validation Performed |
Coercion Method |
---|---|---|---|---|
int |
java.lang.Integer |
java.lang.Number (1) |
Value must be a Number . |
Number.intValue() |
long |
java.lang.Long |
java.lang.Number (1) |
Value must be a Number . |
Number.longValue() |
float |
java.lang.Float |
java.lang.Number (1) |
Value must be a Number . |
Number.floatValue() |
double |
java.lang.Double |
java.lang.Number (1) |
Value must be a Number . |
Number.doubleValue() |
boolean |
java.lang.Boolean |
java.lang.Boolean (2) |
Value must be a Boolean . |
|
string |
java.lang.String |
java.lang.String (2) |
Value must be a String . |
|
bytes |
com.linkedin.data.ByteString |
com.linkedin.data.ByteString , java.lang.String (3) |
If the value is a String , the String must be a valid encoding of binary data as specified by the Avro specification for encoding bytes into a JSON string. |
ByteString.copyFromAvroString() |
enum |
java.lang.String |
java.lang.String |
The value must be a symbol defined by the enum schema. |
|
array |
com.linkedin.data.DataList |
com.linkedin.data.DataList (2) |
Each element in the DataList must be a valid Java type for the schema’s item type. For example, if the schema is an array of longs, then every element in the DataList must be a Number . |
|
map |
com.linkedin.data.DataMap |
com.linkedin.data.DataMap (2) |
Each value in the DataMap must be a valid Java type for the schema’s value type. For example, if the schema is a map of longs, then every value in the DataMap must be a Number . |
|
fixed |
com.linkedin.data.ByteString |
com.linked.data.ByteString (2), java.lang.String (3) |
If the value is a String , the String must be a valid encoding of binary data as specified by the Avro specification for encoding bytes into a JSON string and the correct size for the fixed schema type. If the value is a ByteString , the ByteString must be the correct size for the fixed schema type. |
ByteString.copyFromAvroString() |
record |
com.linkedin.data.DataMap |
com.linkedin.data.DataMap (2) |
Each key in the DataMap will be used lookup a field in the record schema. The value associated with this key must be a valid Java type for the field’s type. If the required validation option is enabled, then all required fields must also be present. |
|
union |
com.linkedin.data.DataMap |
java.lang.String , com.linkedin.data.DataMap (2) |
If the value is a String , the value must be Data.NULL . If the value is a DataMap , then the DataMap must have exactly one entry. The key of the entry must identify a member of the union schema, and the value must be a valid type for the identified union member’s type. |
(1) Even though Number
type is allowed and used for fixing up to the
desired type, the Data layer only allows Integer
, Long
, Float
, and
Double
values to be held in a DataMap
or DataList
.
(2) No fix-up is performed.
(3) the String
must be a valid encoding of binary data as specified by
the Avro specification for encoding bytes into a JSON string.
STRING_TO_PRIMITIVE
Coercion ModeThis mode includes allowed input types and associated validation and
coercion’s of NORMAL
. In addition, it allows the following additional
input types and performs the following coercions on these additional
allowed input types.
Schema Type |
Post-coercion Java Type |
Pre-coercion Input Java Types |
Validation Performed |
_. Coercion Method |
---|---|---|---|---|
int |
java.lang.Integer |
java.lang.String |
If value is a String , it must be acceptable to BigDecimal(String val) , else it has to be a Number (see “NORMAL ”) |
(new BigDecimal(value)).intValue() |
long |
java.lang.Long |
java.lang.String |
If value is a String , it must be acceptable to BigDecimal(String val) , else it has to be a Number (see “NORMAL ”) |
(new BigDecimal(value)).longValue() |
float |
java.lang.Float |
java.lang.String |
If value is a String , it must be acceptable to BigDecimal(String val) , else it has to be a Number (see “NORMAL ”) |
(new BigDecimal(value)).floatValue() |
double |
java.lang.Double |
java.lang.String |
If value is a String , it must be acceptable to BigDecimal(String val) , else it has to be a Number (see “NORMAL ”) |
(new BigDecimal(value)).doubleValue() |
boolean |
java.lang.Boolean |
java.lang.String |
if value is a String , its value must be either "true" or "false" ignoring case, else it has to be a Boolean (see “NORMAL ”) |
<pre>if ("true".equalsIgnoreCase(value)) </pre> |
The result of validation is returned through an instance of the
ValidationResult
class. This class has the following methods:
Method | Description |
---|---|
boolean hasFix() |
Whether any fix-ups (i.e., modification or replacement of input Data layer objects) have been proposed. Fixes may be proposed but not applied because fixes cannot be applied to read-only complex objects. |
boolean hasFixupReadOnlyError() |
Whether any fix-ups could not be applied because of read-only complex objects. |
Object getFixed() |
Return a fixed object. In-place fixes may or may not be possible because some objects are immutable. For example, if the schema type is “fixed” and String object is provided as the Data object, the fixed-up object that would be returned will be a ByteString. Since String and ByteString are both immutable and have different types, the fixed object will be a different object, i.e. the fix-up cannot be done in-place. For complex objects, the fix-ups can be applied in place. This is because the new values can replace the old values in a DataMap or DataList . |
boolean isValid() |
Whether the fixed object returns by getFixed() contains any errors. If it returns false , then the fixed object and its dependents are fixed up according to the provided schema. |
String getMessage() |
Provides details on validation and fix-up failures. Returns empty string if isValid() is true and fix-ups/validation have occurred without problems. |
Note: Schema validation and coercion are currently explicit operations. They are not implicitly performed when data are de-serialized as part of remote invocations.
The Data Template layer provides Java type-safe access to the underlying
data stored in the Data layer. It has explicit knowledge of the schema
of the data stored. The code generator generates classes for complex
schema types that derive from base classes in this layer. The common
base of these generated is com.linkedin.data.DataTemplate
. Typically,
a DataTemplate
instance is an overlay or wrapper for a DataMap
or
DataList
instance. It allows type-safe access to the underlying data
in the DataMap
or DataList
. (The exception is the FixedTemplate
which is a subclass of DataTemplate
for fixed schema types.)
The Data Template layer provides the following abstract base classes that are used to construct Java bindings for different complex schema types.
Class | Underlying Data |
Description |
---|---|---|
AbstractArrayTemplate |
DataList |
Base class for array types. |
DirectArrayTemplate |
DataList |
Base class for array types containing unwrapped item types, extends AbstractArrayTemplate . |
WrappingArrayTemplate |
DataList |
Base class for array types containing wrapped item types, extends AbstractArrayTemplate . |
AbstractMapTemplate |
DataMap |
Base class for map types. |
DirectMapTemplate |
DataMap |
Base class for map types containing unwrapped value types, extends AbstractMapTemplate . |
WrappingMapTemplate |
DataMap |
Base class for map types containing wrapped value types, extends AbstractMapTemplate . |
FixedTemplate |
ByteString |
Base class for fixed types. |
RecordTemplate |
DataMap |
Base class for record types. |
ExceptionTemplate |
DataMap |
Base class for record types that declared as errors. |
UnionTemplate |
DataMap |
Base class for union types. |
The unwrapped schema types are:
The wrapped schema types are types whose Java type-safe bindings are not
the same as their data type in the Data layer. These types require a
DataTemplate
wrapper to provide type-safe access to the underlying
data managed by the Data layer. The wrapped types are:
Enum
is an unwrapped type even though its Java type-safe binding is
not the same as its storage type in the Data layer. This is because enum
conversions are done through coercing to and from java.lang.String
s
implemented by the Data Template layer. This is similar to coercing
between different numeric types also implemented by the Data Template
layer.
The following table shows the relationships among types defined in the data schema, types stored and managed by the Data layer, and the types of the Java binding in the Data Template layer.
Schema Type |
Data Layer |
Data Template Layer |
---|---|---|
int |
java.lang.Integer |
Coerced to java.lang.Integer or int (2) |
long |
java.lang.Integer or java.lang.Long (1) |
Coerced to java.lang.Long or long (2) |
float |
java.lang.Integer , java.lang.Long , java.lang.Float or java.lang.Double (1) |
Coerced to java.lang.Float or float (2) |
double |
java.lang.Integer , java.lang.Long , java.lang.Float or java.lang.Double (1) |
Coerced to java.lang.Double or double (2) |
boolean |
java.lang.Boolean |
Coerced to java.lang.Boolean or boolean (2) |
string |
java.lang.String |
java.lang.String |
bytes |
java.lang.String or com.linkedin.data.ByteString (3) |
com.linkedin.data.ByteString |
enum |
java.lang.String |
Generated enum class. |
array |
com.linkedin.data.DataList |
Generated or built-in array class. |
map |
com.linkedin.data.DataMap |
Generated or built-in map class. |
fixed |
java.lang.String or com.linkedin.data.ByteString |
Generated class that derives from FixedTemplate |
record |
com.linkedin.data.DataMap |
Generated class that derives from RecordTemplate |
error |
com.linkedin.data.DataMap |
Generated class that derives from ExceptionTemplate |
union |
com.linkedin.data.DataMap |
Generated class that derives from UnionTemplate |
(1) When a JSON object is deserialized, the actual schema type is not
known. Typically, the smallest sized type that can represent the
deserialized value will be used to store the value in-memory.
(2) Depending on the method, un-boxed types will be preferred to boxed
types if applicable and the input or output arguments can never be
null.
(3) When a JSON object is deserialized, the actual schema type is not
known for bytes and fixed. Values of bytes and fixed types are stored as
strings as serialized representation is a string. However, ByteString
is an equally valid Java type for these schema types.