Protocol buffers
序列化
序列化(serialization、marshalling)的过程是指将数据结构或者对象的状态转换成可以存储(比如文件、内存)或者传输的格式(比如网络)。反向操作就是反序列化(deserialization、unmarshalling)的过程。
1987年曾经的Sun Microsystems发布了XDR。 二十世纪九十年代后期,XML开始流行,它是一种人类易读的基于文本的编码方式,易于阅读和理解,但是失去了紧凑的基于字节流的编码的优势。
JSON是一种更轻量级的基于文本的编码方式,经常用在client/server端的通讯中。 YAML类似JSON,新的特性更强大,更适合人类阅读,也更紧凑。
除了上面这些和Protobuf,还有许许多多的序列化格式,比如Thrift、Avro、BSON、CBOR、MessagePack, 还有很多非跨语言的编码格式。项目gosercomp对比了各种go的序列化库,包括序列化和反序列的性能,以及序列化后的数据大小。总体来说Protobuf序列化和反序列的性能都是比较高的,编码后的数据大小也不错。
Protocol buffers 提供了一种序列化格式,用于处理类型化、结构化数据包,其大小可达几兆字节。这种格式既适合临时网络流量,也适合长期数据存储。Protocol buffers 可以通过添加新信息来扩展,而不会使现有数据失效或需要更新代码。
Protobuf包含序列化格式的定义、各种语言的库以及一个IDL编译器。正常情况下你需要定义proto文件,然后使用IDL编译器编译成你需要的语言。
消息类型定义
https://protobuf.dev/programming-guides/proto3/#scalar
First let’s look at a very simple example. Let’s say you want to define a search request message format
syntax = "proto3";
message SearchRequest {
string query = 1;
repeated int32 page_number = 2;
int32 results_per_page = 3;
}
- The first line of the file specifies that you’re using the proto3 revision of the protobuf language spec.
文件的第一行指定你正在使用 protobuf 语言规范的 proto3 版本。- The
edition(orsyntaxfor proto2/proto3) must be the first non-empty, non-comment line of the file.
edition(对于 proto2/proto3 也可以是syntax)必须是文件中第一个非空、非注释的行。 - If no
editionorsyntaxis specified, the protocol buffer compiler will assume you are using proto2.
如果没有指定edition或syntax,协议缓冲区编译器将假定您正在使用 proto2。
- The
- The
SearchRequestmessage definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.
SearchRequest消息定义指定了三个字段(名称/值对),每个字段对应您想要包含在此类型消息中的数据片段。每个字段都有一个名称和类型。
分字段编号 Assigning Field Numbers
You must give each field in your message definition a number between 1 and 536,870,911 with the following restrictions:
您必须为消息定义中的每个字段分配一个介于 1 和 536,870,911 之间的编号,并遵守以下限制:
- The given number must be unique among all fields for that message.
分配的编号必须在该消息的所有字段中唯一。 - Field numbers
19,000to19,999are reserved for the Protocol Buffers implementation. The protocol buffer compiler will complain if you use one of these reserved field numbers in your message.
字段编号19,000到19,999被保留用于 Protocol Buffers 实现。如果您的消息中使用了这些保留的字段编号,协议缓冲区编译器将会报错。 - You cannot use any previously reserved field numbers or any field numbers that have been allocated to extensions.
你不能使用任何先前保留的字段编号,也不能使用已分配给扩展的字段编号。
一旦你的消息类型在使用中,这个编号就不能更改,因为它标识了消息二进制格式中的字段。“更改”字段编号相当于删除该字段并创建一个具有相同类型但新编号的新字段。
字段基数 Specifying Field Cardinality
Message fields can be one of the following:
-
Singular: 单一: In proto3, there are two types of singular fields:
在 proto3 中,单一字段有两种类型:-
optional: (recommended) Anoptionalfield is in one of two possible states:
optional: (推荐) 一个optional字段处于以下两种可能状态之一:- the field is set, and contains a value that was explicitly set or parsed from the wire. It will be serialized to the wire.
字段已设置,并包含一个显式设置或从数据流中解析的值。它将被序列化到数据流中。 - the field is unset, and will return the default value. It will not be serialized to the wire.
字段未设置,将返回默认值。它不会被序列化到数据流中。 You can check to see if the value was explicitly set.
您可以检查值是否被显式设置。optionalis recommended over implicit fields for maximum compatibility with protobuf editions and proto2.
optional建议优先于隐式字段,以最大程度兼容 Protocol Buffers 版本和 proto2。
- the field is set, and contains a value that was explicitly set or parsed from the wire. It will be serialized to the wire.
-
implicit: (not recommended) An implicit field has no explicit cardinality label and behaves as follows:
隐式:(不推荐)隐式字段没有显式的基数标签,其行为如下:-
if the field is a message type, it behaves just like an
optionalfield.
如果字段是消息类型,它表现得就像一个optional字段。 -
if the field is not a message, it has two states:
如果字段不是消息类型,它有两种状态:- the field is set to a non-default (non-zero) value that was explicitly set or parsed from the wire. It will be serialized to the wire.
字段被设置为非默认(非零)值,该值是显式设置的或从数据流中解析得到的。它将被序列化到数据流中。 - the field is set to the default (zero) value. It will not be serialized to the wire. In fact, you cannot determine whether the default (zero) value was set or parsed from the wire or not provided at all. For more on this subject, see Field Presence.
字段被设置为默认(零)值。它不会被序列化到数据流中。实际上,你无法确定默认(零)值是显式设置的、从数据流中解析得到的,还是根本没有提供。有关此主题的更多信息,请参阅字段存在性。
- the field is set to a non-default (non-zero) value that was explicitly set or parsed from the wire. It will be serialized to the wire.
-
-
-
repeated: this field type can be repeated zero or more times in a well-formed message. The order of the repeated values will be preserved.
repeated:此字段类型在一个格式良好的消息中可以重复零次或多次。重复值的顺序将被保留。 -
map: this is a paired key/value field type. See Maps for more on this field type.
map:这是一个键/值对字段类型。有关此字段类型的更多信息,请参阅映射。
数据类型 Scalar Value Types
A scalar message field can have one of the following types – the table shows the type specified in the .proto file, and the corresponding type in the automatically generated class:
| Proto Type | Notes |
|---|---|
| double | |
| float | |
| int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. |
| int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. |
| uint32 | Uses variable-length encoding. |
| uint64 | Uses variable-length encoding. |
| sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
| sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
| fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. |
| fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. |
| sfixed32 | Always four bytes. |
| sfixed64 | Always eight bytes. |
| bool | |
| string | A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 2^32. |
| bytes | May contain any arbitrary sequence of bytes no longer than 2^32. |
| Proto Type | C++ Type | Java/Kotlin Type[1] | Python Type[3] | Go Type | Ruby Type | C# Type | PHP Type | Dart Type | Rust Type |
|---|---|---|---|---|---|---|---|---|---|
| double | double | double | float | float64 | Float | double | float | double | f64 |
| float | float | float | float | float32 | Float | float | float | double | f32 |
| int32 | int32_t | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int | i32 |
| int64 | int64_t | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 | i64 |
| uint32 | uint32_t | int[2] | int/long[4] | uint32 | Fixnum or Bignum (as required) | uint | integer | int | u32 |
| uint64 | uint64_t | long[2] | int/long[4] | uint64 | Bignum | ulong | integer/string[6] | Int64 | u64 |
| sint32 | int32_t | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int | i32 |
| sint64 | int64_t | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 | i64 |
| fixed32 | uint32_t | int[2] | int/long[4] | uint32 | Fixnum or Bignum (as required) | uint | integer | int | u32 |
| fixed64 | uint64_t | long[2] | int/long[4] | uint64 | Bignum | ulong | integer/string[6] | Int64 | u64 |
| sfixed32 | int32_t | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int | i32 |
| sfixed64 | int64_t | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 | i64 |
| bool | bool | boolean | bool | bool | TrueClass/FalseClass | bool | boolean | bool | bool |
| string | std::string | String | str/unicode[5] | string | String (UTF-8) | string | string | String | ProtoString |
| bytes | std::string | ByteString | str (Python 2), bytes (Python 3) | []byte | String (ASCII-8BIT) | ByteString | string | List | ProtoBytes |
字段的默认值 Default Field Values
When a message is parsed, if the encoded message bytes do not contain a particular field, accessing that field in the parsed object returns the default value for that field. The default values are type-specific:
- For strings, the default value is the empty string.
- For bytes, the default value is empty bytes.
- For bools, the default value is false.
- For numeric types, the default value is zero.
- For message fields, the field is not set. Its exact value is language-dependent. See the generated code guide for details.
- For enums, the default value is the first defined enum value, which must be 0. See Enum Default Value.