PDSC is a Pegasus schema definition language, and it is inspired by the Avro 1.4.1 specification.The Pegasus schema definition language is inspired by the Avro 1.4.1 specification.
The Pegasus schema differs from Avro in the following ways:
.pdsc
instead
of .avsc
.Each schema should be stored in its own file with a .pdsc
extension.
The Pegasus code generator implements a resolver that is similar to Java
class loaders. If there is a reference to a named schema, the code
generator will try to look for a file in the code generator’s resolver
path. The resolver path is similar to a Java classpath. The fully
qualified name of the named schema will be translated to a relative file
name. The relative file name is computed by replacing dots (“.”) in the
fully qualified name by the directory path separator (typically “/”) and
appending a .pdsc
extension. This relative file name is appended to
each path in the resolver path. The resolver opens each of these files
until it finds a file that contains the named schema.
The named schema declarations support the following attributes:
type
a JSON string providing the type of the named schema
(required).name
a JSON string providing the name of the named schema
(required).namespace
a JSON string that qualifies the namespace for the named
schema.package
a JSON string that qualifies the language binding
namespace for the named schema (optional). If this is not specified,
language binding class for the named schema will use namespace
as
its default namespace.doc
a JSON string providing documentation to the user of this
named schema (optional).The named schemas with type “enum” also supports a symbolDocs
attribute to provide documentation for each enum symbol.
Note: Due to the addition of doclint in JDK8, anything under the
doc
or symbolDocs
attribute must be W3C HTML 4.01 compliant. This is
because the contents of this string will appear as Javadocs in the
generated Java ‘data template’ classes later. Please take this into
consideration when writing your documentation.
The following are a few example schemas and their file names.
com/linkedin/pegasus/generator/examples/Foo.pdsc
{
"type" : "record",
"name" : "Foo",
"namespace" : "com.linkedin.pegasus.generator.examples",
"doc" : "A foo record",
"fields" : [
{ "name" : "intField", "type" : "int" },
{ "name" : "longField", "type" : "long" },
{ "name" : "floatField", "type" : "float" },
{ "name" : "doubleField", "type" : "double" },
{ "name" : "bytesField", "type" : "bytes" },
{ "name" : "stringField", "type" : "string" },
{ "name" : "fruitsField", "type" : "Fruits" },
{ "name" : "intArrayField", "type" : { "type" : "array", "items" : "int" } },
{ "name" : "stringMapField", "type" : { "type" : "map", "values" : "string" } },
{
"name" : "unionField",
"type" : [
"int",
"string",
"Fruits",
"Foo",
{ "type" : "array", "items" : "string" },
{ "type" : "map", "values" : "long" },
"null"
]
}
]
}
com/linkedin/pegasus/generator/examples/FooWithNamespaceOverride.pdsc (please see Java Binding section to see how this “package” affects generated java class.)
{
"type" : "record",
"name" : "FooWithNamespaceOverride",
"namespace" : "com.linkedin.pegasus.generator.examples",
"package" : "com.linkedin.pegasus.generator.examples.record",
"doc" : "A foo record",
"fields" : [
{ "name" : "intField", "type" : "int" },
{ "name" : "longField", "type" : "long" },
{ "name" : "floatField", "type" : "float" },
{ "name" : "doubleField", "type" : "double" },
{ "name" : "bytesField", "type" : "bytes" },
{ "name" : "stringField", "type" : "string" },
{ "name" : "fruitsField", "type" : "Fruits" },
{ "name" : "intArrayField", "type" : { "type" : "array", "items" : "int" } },
{ "name" : "stringMapField", "type" : { "type" : "map", "values" : "string" } },
{
"name" : "unionField",
"type" : [
"int",
"string",
"Fruits",
"Foo",
{ "type" : "array", "items" : "string" },
{ "type" : "map", "values" : "long" },
"null"
]
}
]
}
com/linkedin/pegasus/generator/examples/Fruits.pdsc
{
"type" : "enum",
"name" : "Fruits",
"namespace" : "com.linkedin.pegasus.generator.examples",
"doc" : "A fruit",
"symbols" : [ "APPLE", "BANANA", "ORANGE", "PINEAPPLE" ],
"symbolDocs" : { "APPLE":"A red, yellow or green fruit.", "BANANA":"A yellow fruit.", "ORANGE":"An orange fruit.", "PINEAPPLE":"A yellow fruit."}
}
com/linkedin/pegasus/generator/examples/MD5.pdsc
{
"type" : "fixed",
"name" : "MD5",
"namespace" : "com.linkedin.pegasus.generator.examples",
"doc" : "MD5",
"size" : 16
}
com/linkedin/pegasus/generator/examples/StringList.pdsc
{
"type" : "record",
"name" : "StringList",
"namespace" : "com.linkedin.pegasus.generator.examples",
"doc" : "A list of strings",
"fields" : [
{ "name" : "element", "type" : "string" },
{ "name" : "next" , "type" : "StringList", "optional" : true }
]
}
com/linkedin/pegasus/generator/examples/InlinedExample.pdsc
{
"type": "record",
"name": "InlinedExample",
"namespace": "com.linkedin.pegasus.generator.examples",
"doc": "Example on how you can declare an enum and a record inside another record",
"fields": [
{
"name": "myEnumField",
"type": {
"type" : "enum",
"name" : "EnumDeclarationInTheSameFile",
"symbols" : ["FOO", "BAR", "BAZ"]
},
"doc": "This is how we inline enum declaration without creating a new pdsc file",
"symbolDocs": {"FOO":"It's a foo!", "HASH":"It's a bar!", "NONE":"It's a baz!"}
},
{
"name": "stringField",
"type": "string",
"doc": "A regular string"
},
{
"name": "intField",
"type": "int",
"doc": "A regular int"
},
{
"name": "UnionFieldWithInlineRecordAndEnum",
"doc": "In this example we will declare a record and an enum inside a union",
"type": [
{
"type" : "record",
"name" : "myRecord",
"fields": [
{
"name": "foo1",
"type": "int",
"doc": "random int field"
},
{
"name": "foo2",
"type": "int",
"doc": "random int field"
}
]
},
{
"name": "anotherEnum",
"type" : "enum",
"symbols" : ["FOOFOO", "BARBAR"],
"doc": "Random enum",
"symbolDocs": {"FOOFOO":"description about FOOFOO", "BARBAR":"description about BARBAR"}
}
],
"optional": true
}
]
}
Pegasus supports a new schema type known as a typeref. A typeref is like a typedef in C. It does not declare a new type but declares an alias to an existing type.
Typerefs use the type name typeref
and support the following attributes:
name
a JSON string providing the name of the typeref (required).namespace
a JSON string that qualifies the name;package
a JSON string that qualifies the language binding
namespace for this typeref (optional). If this is not specified,
language binding class for the typeref will use namespace
as its
default namespace.doc
a JSON string providing documentation to the user of this
schema (optional).ref
the schema that the typeref refers to.Note: Due to the addition of doclint in JDK8, anything under the
doc
attribute must be W3C HTML 4.01 compliant. This is because the
contents of this string will appear as Javadocs in the generated Java
‘data template’ classes later. Please take this into consideration
when writing your documentation.
Here are a few examples:
Differentiate URN from string
{
"type" : "typeref",
"name" : "URN",
"ref" : "string",
"doc" : "A URN, the format is defined by RFC 2141"
}
Differentiate time from long
{
"type" : "typeref",
"name" : "time",
"ref" : "long",
"doc" : "Time in milliseconds since Jan 1, 1970 UTC"
}
Typerefs (by default) do not alter the serialization format or in-memory representation.
You can use the following record field attributes to annotate fields with metadata.
optional
- marks the field as optional, meaning it doesn’t require
a value. See Optional Fields.Pegasus supports optional fields implicitly through the “optional” flag in the field definition. A field is required unless the field is declared with the optional flag set to true. A field without the optional flag is required. A field with the optional flag set to false is also required. An optional field may be present or absent in the in-memory data structure or serialized data.
The Java binding provides methods to determine if an optional field is present and a specialized get accessor allows the caller to specify whether the default value or null should be returned when an absent optional field is accessed. The has field accessor may also be used to determine if an optional field is present.
{
"type" : "record",
"name" : "Optional",
"namespace" : "com.linkedin.pegasus.generator.examples",
"fields" :
[
{
"name" : "foo",
"type" : "string",
"optional" : true
}
]
}
Optional field present
{
"foo" : "abcd"
}
Optional field absent
{
}
See GetMode for more detailed information on how the code generated data stubs access optional/default fields.
DO NOT USE UNION WITH NULL TO DECLARE AN OPTIONAL FIELD IN PEGASUS.
Avro’s approach to declaring an optional field is to use a union type whose values may be null or the desired value type. Pegasus discourages this practice and may remove support for union with null as well as the null type in the future. The reason for this is that declaring an optional field using a union causes the optional field to be always present in the underlying in-memory data structure and serialized data. Absence of a value is represented by a null value. Presence of a value is represented by a value in the union of the value’s type.
{
"type" : "record",
"name" : "OptionalWithUnion",
"namespace" : "com.linkedin.pegasus.generator.examples",
"fields" :
[
{
"name" : "foo",
"type" : ["string", "null"]
}
]
}
optional field present
{
"foo" : { "string" : "abcd" }
}
optional field absent
{
"foo" : null
}
Note Avro uses the union approach because Avro serialization is optimized to not include a field identifier in the serialized data.
Union is a powerful way to model data that can be of different types at
different scenarios. Record fields that have this behavior can be
defined as a Union type with the expected value types as its members.
Here is an example of a record field defined an a union with string
and array
as its members.
{
"type" : "record",
"name" : "RecordWithUnion",
"namespace" : "com.linkedin.pegasus.examples",
"fields" :
[
{
"name" : "result",
"type" : [
"string",
{
"type" : "array",
"items" : "Result"
}
]
}
]
}
In addition to the unions with disparate member types, Pegasus unions can have more than one member of the same type inside a single union definition. For example a union can have multiple members that are of,
Such unions must specify an unique alias for each member in the union definition. When an alias is specified, it acts as the member’s discriminator unlike the member type name on the standard unions defined above.
In the example below, the union definition has a string
and two
array
members with unique aliases for each of them.
{
"type" : "record",
"name" : "RecordWithAliasedUnion",
"namespace" : "com.linkedin.pegasus.examples",
"fields" :
[
{
"name" : "result",
"type" : [
{
"type" : "string",
"alias" : "message"
},
{
"type": {
"type" : "array",
"items" : "Result"
},
"alias" : "successResults"
},
{
"type": {
"type" : "array",
"items" : "Result"
},
"alias" : "failureResults"
}
]
}
]
}
There are few constraints that must be taken in consideration while specifying aliases for union members,
null
member types which means
there can only be one null member inside a union definition.A field may be declared to have a default value. The default value will be validated against the schema type of the field. For example, if the type is record and it has non-optional fields, then the default value must include the non-optional fields in the default value.
A default value may be declared for a required or an optional field. The default value is used by the get accessor. Unlike Avro, default values are not assigned to absent fields on de-serialization.
If a default value is declared for a required field, for all three
GetMode, the get accessor behaves the same as that for an
optional field with a default value. More specifically, the get accessor
doesn’t fail even if the field is absent. However, in this case, data
validation may fail if
RequiredMode is set to MUST_BE_PRESENT
.
{
"type" : "record",
"name" : "Default",
"namespace" : "com.linkedin.pegasus.generator.examples",
"fields" :
[
{
"name" : "mandatoryWithDefault",
"type" : "string",
"default" : "this is the default string"
},
{
"name" : "optionalWithDefault",
"type" : "string",
"optional" : true,
"default" : "this is the default string"
}
]
}
See GetMode for more detailed information on how the code generated data stubs access optional/default fields.
An Avro default value for a union type does not include the member discriminator and the type of the default value must be the first member type in the list of member types.
{
...
"fields" :
[
{
"name" : "foo",
"type" : [ "int", "string" ],
"default" : 42
}
]
}
A Pegasus default value for a union type must include the member discriminator. This allows the same typeref’ed union to have default values of different member types.
{
...
"fields" :
[
{
"name" : "foo",
"type" : [ "int", "string" ]
"default" : { "int" : 42 }
}
]
}
For unions with aliased members, the specified alias is used as member discriminator instead of the type name.
{
...
"fields" :
[
{
"name" : "foo",
"type" : [
{ "type" : "int", "alias" : "count" },
{ "type" : "string", "alias" : "message" }
],
"default" : { "count" : 42 }
}
]
}
Note the Avro syntax optimizes the most common union with null pattern that is used to represent optional fields to be less verbose. However, Pegasus has explicit support for optional fields and discourages the use of union with null. This significantly reduces the syntactical sugar benefit of the Avro optimization.
If a union type may be translated to Avro, all default values provided for the union type must be of the first member type in the list of member types. This restriction is because Avro requires default values of a union type to be of the first member type of the union.
Pegasus implements the “include” attribute. It is used to include all fields from another record into the current record. It does not include any other attribute of the record. Include is transitive, if record A includes record B and record B includes record C, record A contains all the fields declared in record A, record B and record C.
The value of the “include” attribute should be a list of records or typerefs of records. It is an error to specify non-record types in this list.
{
"doc" : "Bar includes fields of Foo, Bar will have fields f1 from itself and b1 from Bar",
"type" : "record",
"name" : "Bar",
"include" : [ "Foo" ],
"fields" : [
{
"name" : "b1",
"type" : "string",
}
]
}
{
"type" : "record",
"name" : "Foo",
"fields" : [
{
"name" : "f1",
"type" : "string",
}
]
}
Named schema, record field, and enum symbol can be deprecated using
deprecated
property or deprecatedSymbols
property in enum
declaration. The property value can be a string describing why the
schema element is deprecated or an alternative, or simply boolean
true
. However, the latter use is discouraged and may be removed in the
future. The Java binding generated for these elements will be marked as
deprecated.
To deprecate a named schema, add deprecated
property to its
declaration.
{
"type" : "record",
"name" : "Deprecated",
"namespace" : "com.linkedin.pegasus.generator.test",
"deprecated": "Use Foo instead.",
"fields" : [
...
]
}
To deprecate a record field, add deprecated
property to its
declaration.
{
"type" : "record",
"name" : "Foo",
"namespace" : "com.linkedin.pegasus.generator.test",
"fields" : [
{
"name" : "deprecatedInt",
"type" : "int",
"deprecated": "Reason for int deprecation."
}
]
}
To deprecate an enum symbol, add deprecatedSymbols
property to the
enum declaration. The value of the property is a map from the symbol
name to a string description.
{
"name" : "Planet",
"namespace" : "com.linkedin.pegasus.generator.test",
"type" : "enum",
"symbols" : [ "MERCURY", "VENUS", "EARTH", "MARS", "JUPITER", "SATURN", "URANUS", "NEPTUNE", "PLUTO" ],
"deprecatedSymbols": {
"PLUTO": "Reclassified as dwarf planet."
}
}