From: Mark Slee Date: Tue, 3 Apr 2007 19:47:04 +0000 (+0000) Subject: Fixing typos in the Thrift whitepaper X-Git-Tag: 0.2.0~1390 X-Git-Url: https://source.supwisdom.com/gerrit/gitweb?a=commitdiff_plain;h=afaf27607d24de7f494a7823d38b2c8a31878934;p=common%2Fthrift.git Fixing typos in the Thrift whitepaper Reviewed By: eugene, yishan git-svn-id: https://svn.apache.org/repos/asf/incubator/thrift/trunk@665082 13f79535-47bb-0310-9956-ffa450edef68 --- diff --git a/doc/thrift.pdf b/doc/thrift.pdf index 2bbea909..8aba919e 100644 Binary files a/doc/thrift.pdf and b/doc/thrift.pdf differ diff --git a/doc/thrift.tex b/doc/thrift.tex index cb7582f6..d93bfa38 100644 --- a/doc/thrift.tex +++ b/doc/thrift.tex @@ -43,7 +43,7 @@ backend services. Its primary goal is to enable efficient and reliable communication across programming languages by abstracting the portions of each language that tend to require the most customization into a common library that is implemented in each language. Specifically, Thrift allows developers to -define data types and service interfaces in a single language-neutral file +define datatypes and service interfaces in a single language-neutral file and generate all the necessary code to build RPC clients and servers. This paper details the motivations and design choices we made in Thrift, as @@ -69,13 +69,13 @@ these services, various programming languages have been selected to optimize for the right combination of performance, ease and speed of development, availability of existing libraries, etc. By and large, Facebook's engineering culture has tended towards choosing the best -tools and implementations avaiable over standardizing on any one +tools and implementations available over standardizing on any one programming language and begrudgingly accepting its inherent limitations. Given this design choice, we were presented with the challenge of building a transparent, high-performance bridge across many programming languages. We found that most available solutions were either too limited, did not offer -sufficient data type freedom, or suffered from subpar performance. +sufficient datatype freedom, or suffered from subpar performance. \footnote{See Appendix A for a discussion of alternative systems.} The solution that we have implemented combines a language-neutral software @@ -83,8 +83,8 @@ stack implemented across numerous programming languages and an associated code generation engine that transforms a simple interface and data definition language into client and server remote procedure call libraries. Choosing static code generation over a dynamic system allows us to create -validated code with implicit guarantees that can be run without the need for -any advanced intropsecive run-time type checking. It is also designed to +validated code that can be run without the need for +any advanced introspective run-time type checking. It is also designed to be as simple as possible for the developer, who can typically define all the necessary data structures and interfaces for a complex service in a single short file. @@ -97,7 +97,7 @@ In evaluating the challenges of cross-language interaction in a networked environment, some key components were identified: \textit{Types.} A common type system must exist across programming languages -without requiring that the application developer use custom Thrift data types +without requiring that the application developer use custom Thrift datatypes or write their own serialization code. That is, a C++ programmer should be able to transparently exchange a strongly typed STL map for a dynamic Python dictionary. Neither @@ -111,14 +111,14 @@ The same application code should be able to run against TCP stream sockets, raw data in memory, or files on disk. Section 3 details the Thrift Transport layer. -\textit{Protocol.} Data types must have some way of using the Transport +\textit{Protocol.} Datatypes must have some way of using the Transport layer to encode and decode themselves. Again, the application developer need not be concerned by this layer. Whether the service uses an XML or binary protocol is immaterial to the application code. All that matters is that the data can be read and written in a consistent, deterministic matter. Section 4 details the Thrift Protocol layer. -\textit{Versioning.} For robust services, the involved data types must +\textit{Versioning.} For robust services, the involved datatypes must provide a mechanism for versioning themselves. Specifically, it should be possible to add or remove fields in an object or alter the argument list of a function without any interruption in service (or, @@ -138,7 +138,8 @@ The goal of the Thrift type system is to enable programmers to develop using completely natively defined types, no matter what programming language they use. By design, the Thrift type system does not introduce any special dynamic types or wrapper objects. It also does not require that the developer write -any code for object serialization or transport. The Thrift IDL file is +any code for object serialization or transport. The Thrift IDL (Interface +Definition Language) file is logically a way for developers to annotate their data structures with the minimal amount of extra information necessary to tell a code generator how to safely transport the objects across languages. @@ -172,20 +173,32 @@ identifiers. In this case, the sign is irrelevant. Signed integers serve this same purpose and can be safely cast to their unsigned counterparts (most commonly in C++) when absolutely necessary. +\subsection{Structs} + +A Thrift struct defines a common object to be used across languages. A struct +is essentially equivalent to a class in object oriented programming +languages. A struct has a set of strongly typed fields, each with a unique +name identifier. The basic syntax for defining a Thrift struct looks very +similar to a C struct definition. Fields may be annotated with an integer field +identifier (unique to the scope of that struct) and optional default values. +Field identifiers will be automatically assigned if omitted, though they are +strongly encouraged for versioning reasons discussed later. + \subsection{Containers} Thrift containers are strongly typed containers that map to the most commonly used containers in common programming languages. They are annotated using -C++ template (or Java Generics) style. There are three types available: +the C++ template (or Java Generics) style. There are three types available: \begin{itemize} \item \texttt{list} An ordered list of elements. Translates directly into -an STL vector, Java ArrayList, or native array in scripting languages. May +an STL \texttt{vector}, Java \texttt{ArrayList}, or native array in scripting languages. May contain duplicates. \item \texttt{set} An unordered set of unique elements. Translates into -an STL set, Java HashSet, or native dictionary in PHP/Python/Ruby. +an STL \texttt{set}, Java \texttt{HashSet}, \texttt{set} in Python, or native +dictionary in PHP/Ruby. \item \texttt{map} A map of strictly unique keys to values -Translates into an STL map, Java HashMap, PHP associative array, -or Python/Ruby dictionary. +Translates into an STL \texttt{map}, Java texttt{HashMap}, PHP associative +array, or Python/Ruby dictionary. \end{itemize} While defaults are provided, the type mappings are not explicitly fixed. Custom @@ -196,17 +209,6 @@ only requirement is that the custom types support all the necessary iteration primitives. Container elements may be of any valid Thrift type, including other containers or structs. -\subsection{Structs} - -A Thrift struct defines a common object to be used across languages. A struct -is essentially equivalent to a class in object oriented programming -languages. A struct has a set of strongly typed fields, each with a unique -name identifier. The basic syntax for defining a Thrift struct looks very -similar to a C struct definition. Fields may be annotated with an integer field -identifier (unique to the scope of that struct) and optional default values. -Field identifiers will be automatically assigned if omitted, though they are -strongly encouraged for versioning reasons discussed later. - \begin{verbatim} struct Example { 1:i32 number=10, @@ -226,15 +228,16 @@ that they are declared using the \texttt{exception} keyword instead of the \texttt{struct} keyword. The generated objects inherit from an exception base class as appropriate -in each target programming language, the goal being to offer seamless -integration with native exception handling for the developer in any given +in each target programming language, in order to seamlessly +integrate with native exception handling in any given language. Again, the design emphasis is on making the code familiar to the application developer. \subsection{Services} Services are defined using Thrift types. Definition of a service is -semantically equivalent to defining a pure virtual interface in object oriented +semantically equivalent to defining an interface (or a pure virtual abstract +class) in object oriented programming. The Thrift compiler generates fully functional client and server stubs that implement the interface. Services are defined as follows: @@ -257,19 +260,19 @@ service StringCache { Note that \texttt{void} is a valid type for a function return, in addition to all other defined Thrift types. Additionally, an \texttt{async} modifier -keyword may be added to a void function, which will generate code that does +keyword may be added to a \texttt{void} function, which will generate code that does not wait for a response from the server. Note that a pure \texttt{void} function will return a response to the client which guarantees that the operation has completed on the server side. With \texttt{async} method calls -the client can only be guaranteed that the request succeeded at the +the client will only be guaranteed that the request succeeded at the transport layer. (In many transport scenarios this is inherently unreliable due to the Byzantine Generals' Problem. Therefore, application developers -should take care only to use the async optimization in cases where dopped +should take care only to use the async optimization in cases where dropped method calls are acceptable or the transport is known to be reliable.) -Also of note is the fact that argument and exception lists to functions are -implemented as Thrift structs. They are identical in both notation and -behavior. +Also of note is the fact that argument lists and exception lists for functions +are implemented as Thrift structs. All three constructs are identical in both +notation and behavior. \section{Transport} @@ -277,7 +280,7 @@ The transport layer is used by the generated code to facilitate data transfer. \subsection{Interface} -A key design choice in the implementation of Thrift was to abstract the +A key design choice in the implementation of Thrift was to decouple the transport layer from the code generation layer. Though Thrift is typically used on top of the TCP/IP stack with streaming sockets as the base layer of communication, there was no compelling reason to build that constraint into @@ -287,22 +290,22 @@ immaterial compared to the cost of actual I/O operations (typically invoking system calls). Fundamentally, generated Thrift code only needs to know how to read and -write data. Where the data is going is irrelevant, it may be a socket, a -segment of shared memory, or a file on the local disk. The Thrift transport -interface supports the following methods. +write data. The origin and destination of the data are irrelevant; it may be a +socket, a segment of shared memory, or a file on the local disk. The Thrift +transport interface supports the following methods: \begin{itemize} -\item \texttt{open()} Opens the tranpsort -\item \texttt{close()} Closes the tranport -\item \texttt{isOpen()} Whether the transport is open -\item \texttt{read()} Reads from the transport -\item \texttt{write()} Writes to the transport -\item \texttt{flush()} Force any pending writes +\item \texttt{open} Opens the tranpsort +\item \texttt{close} Closes the tranport +\item \texttt{isOpen} Indicates whether the transport is open +\item \texttt{read} Reads from the transport +\item \texttt{write} Writes to the transport +\item \texttt{flush} Forces any pending writes \end{itemize} There are a few additional methods not documented here which are used to aid -in batching reads and optionally signaling completion of reading or writing -chunks of data by the generated code. +in batching reads and optionally signaling the completion of a read or +write operation from the generated code. In addition to the above \texttt{TTransport} interface, there is a\\ @@ -311,11 +314,10 @@ used to accept or create primitive transport objects. Its interface is as follows: \begin{itemize} -\item \texttt{open()} Opens the tranpsort -\item \texttt{listen()} Begins listening for connections -\item \texttt{accept()} Returns a new client transport -\item \texttt{close()} Closes the transport - +\item \texttt{open} Opens the transport +\item \texttt{listen} Begins listening for connections +\item \texttt{accept} Returns a new client transport +\item \texttt{close} Closes the transport \end{itemize} \subsection{Implementation} @@ -332,28 +334,28 @@ provides a common, simple interface to a TCP/IP stream socket. \subsubsection{TFileTransport} The \texttt{TFileTransport} is an abstraction of an on-disk file to a data -stream. It can be used to write out a set of incoming Thrift request to a file -on disk. The on-disk data can then be replayed from the log, either for post-processing -or for recreation and simulation of past events. \texttt(TFileTransport). +stream. It can be used to write out a set of incoming Thrift requests to a file +on disk. The on-disk data can then be replayed from the log, either for +post-processing or for reproduction and/or simulation of past events. \subsubsection{Utilities} The Transport interface is designed to support easy extension using common -OOP techniques such as composition. Some simple utilites include the -\texttt{TBufferedTransport}, which buffers writes and reads on an underlying -transport, the \texttt{TFramedTransport}, which transmits data with frame -size headers for chunking optimzation or nonblocking operation, and the -\texttt{TMemoryBuffer}, which allows reading and writing directly from heap or -stack memory owned by the process. +OOP techniques, such as composition. Some simple utilites include the +\texttt{TBufferedTransport}, which buffers the writes and reads on an +underlying transport, the \texttt{TFramedTransport}, which transmits data with frame +size headers for chunking optimization or nonblocking operation, and the +\texttt{TMemoryBuffer}, which allows reading and writing directly from the heap +or stack memory owned by the process. \section{Protocol} A second major abstraction in Thrift is the separation of data structure from transport representation. Thrift enforces a certain messaging structure when transporting data, but it is agnostic to the protocol encoding in use. That is, -it does not matter whether data is encoded in XML, human-readable ASCII, or a -dense binary format, so long as the data supports a fixed set of operations -that allow generated code to deterministically read and write. +it does not matter whether data is encoded as XML, human-readable ASCII, or a +dense binary format as long as the data supports a fixed set of operations +that allow it to be deterministically read and written by generated code. \subsection{Interface} @@ -404,27 +406,27 @@ double = readDouble() string = readString() \end{verbatim} -Note that every write function has exactly one read function counterpart, with -the exception of the \texttt{writeFieldStop()} method. This is a special method +Note that every \texttt{write} function has exactly one \texttt{read} counterpart, with +the exception of \texttt{writeFieldStop()}. This is a special method that signals the end of a struct. The procedure for reading a struct is to -\texttt{readFieldBegin()} until the stop field is encountered, and to then +\texttt{readFieldBegin()} until the stop field is encountered, and then to \texttt{readStructEnd()}. The -generated code relies upon this structure to ensure that everything written by +generated code relies upon this call sequence to ensure that everything written by a protocol encoder can be read by a matching protocol decoder. Further note that this set of functions is by design more robust than necessary. For example, \texttt{writeStructEnd()} is not strictly necessary, as the end of a struct may be implied by the stop field. This method is a convenience for -verbose protocols where it is cleaner to separate these calls (i.e. a closing +verbose protocols in which it is cleaner to separate these calls (e.g. a closing \texttt{} tag in XML). \subsection{Structure} Thrift structures are designed to support encoding into a streaming -protocol. That is, the implementation should never need to frame or compute the +protocol. The implementation should never need to frame or compute the entire data length of a structure prior to encoding it. This is critical to performance in many scenarios. Consider a long list of relatively large -strings. If the protocol interface required reading or writing a list as an -atomic operation, then the implementation would require a linear pass over the +strings. If the protocol interface required reading or writing a list to be an +atomic operation, then the implementation would need to perform a linear pass over the entire list before encoding any data. However, if the list can be written as iteration is performed, the corresponding read may begin in parallel, theoretically offering an end-to-end speedup of $(kN - C)$, where $N$ is the size @@ -434,11 +436,11 @@ and becoming available to read. Similarly, structs do not encode their data lengths a priori. Instead, they are encoded as a sequence of fields, with each field having a type specifier and a -unique field identifier. Note that the inclusion of type specifiers enables +unique field identifier. Note that the inclusion of type specifiers allows the protocol to be safely parsed and decoded without any generated code or access to the original IDL file. Structs are terminated by a field header with a special \texttt{STOP} type. Because all the basic types can be read -deterministically, all structs (including those with nested structs) can be +deterministically, all structs (even those containing other structs) can be read deterministically. The Thrift protocol is self-delimiting without any framing and regardless of the encoding format. @@ -459,14 +461,14 @@ sufficient. We decided against some extreme storage optimizations (i.e. packing small integers into ASCII or using a 7-bit continuation format) for the sake of simplicity and clarity in the code. These alterations can easily be made -if and when we encounter a performance critical use case that demands them. +if and when we encounter a performance-critical use case that demands them. \section{Versioning} Thrift is robust in the face of versioning and data definition changes. This -is critical to enable a staged rollout of changes to deployed services. The -system must be able to support reading of old data from logfiles, as well as -requests from out of date clients to new servers, or vice versa. +is critical to enable staged rollouts of changes to deployed services. The +system must be able to support reading of old data from log files, as well as +requests from out-of-date clients to new servers, and vice versa. \subsection{Field Identifiers} @@ -486,7 +488,8 @@ struct Example { 4:string name="thrifty" }\end{verbatim} -To avoid conflicts, fields with omitted identifiers are automatically assigned +To avoid conflicts between manually and automatically assigned identifiers, +fields with identifiers omitted are assigned identifiers decrementing from -1, and the language only supports the manual assignment of positive identifiers. @@ -494,7 +497,7 @@ When data is being deserialized, the generated code can use these identifiers to properly identify the field and determine whether it aligns with a field in its definition file. If a field identifier is not recognized, the generated code can use the type specifier to skip the unknown field without any error. -Again, this is possible due to the fact that all data types are self +Again, this is possible due to the fact that all datatypes are self delimiting. Field identifiers can (and should) also be specified in function argument @@ -512,7 +515,7 @@ service StringCache { The syntax for specifying field identifiers was chosen to echo their structure. Structs can be thought of as a dictionary where the identifiers are keys, and -the values are strongly typed, named fields. +the values are strongly-typed named fields. Field identifiers internally use the \texttt{i16} Thrift type. Note, however, that the \texttt{TProtocol} abstraction may encode identifiers in any format. @@ -522,8 +525,8 @@ that the \texttt{TProtocol} abstraction may encode identifiers in any format. When an unexpected field is encountered, it can be safely ignored and discarded. When an expected field is not found, there must be some way to signal to the developer that it was not present. This is implemented via an -inner \texttt{isset} structure inside the defined objects. (In PHP, this is -implicit with a \texttt{null} value, or \texttt{None} in Python +inner \texttt{isset} structure inside the defined objects. (Isset functionality +is implicit with a \texttt{null} value in PHP, \texttt{None} in Python and \texttt{nil} in Ruby.) Essentially, the inner \texttt{isset} object of each Thrift struct contains a boolean value for each field which denotes whether or not that field is present in the @@ -566,7 +569,7 @@ There are four cases in which version mismatches may occur. \begin{enumerate} \item \textit{Added field, old client, new server.} In this case, the old client does not send the new field. The new server recognizes that the field -is not set, and implements default behavior for out of date requests. +is not set, and implements default behavior for out-of-date requests. \item \textit{Removed field, old client, new server.} In this case, the old client sends the removed field. The new server simply ignores it. \item \textit{Added field, new client, old server.} The new client sends a @@ -591,7 +594,7 @@ Note that the exact same is true of the \texttt{TTransport} interface. For example, if we wished to add some new checksumming or error detection to the \texttt{TFileTransport}, we could simply add a version header into the data it writes to the file in such a way that it would still accept old -logfiles without the given header. +log files without the given header. \section{RPC Implementation} @@ -645,11 +648,11 @@ A client class is generated, which implements the interface and uses two \texttt{TProtocol} instances to perform the I/O operations. The generated processor implements the \texttt{TProcessor} interface. The generated code has all the logic to handle RPC invocations via the \texttt{process()} -call, and takes as a parameter an instance of the service interface, +call, and takes as a parameter an instance of the service interface, as implemented by the application developer. -The user provides an implementation of the application interface in their own, -non-generated source file. +The user provides an implementation of the application interface in separate, +non-generated source code. \subsection{TServer} @@ -697,14 +700,14 @@ Though Thrift was explicitly designed to be much more efficient and robust than typical web technologies, as we were designing our XML-based REST web services API we noticed that Thrift could be easily used to define our service interface. Though we do not currently employ SOAP envelopes (in the -author's opinion there is already far too much repetetive enterprise Java +authors' opinions there is already far too much repetitive enterprise Java software to do that sort of thing), we were able to quickly extend Thrift to generate XML Schema Definition files for our service, as well as a framework for versioning different implementations of our web service. Though public web services are admittedly tangential to Thrift's core use case and design, Thrift facilitated rapid iteration and affords us the ability to quickly migrate our entire XML-based web service onto a higher performance system -should the future need arise. +should the need arise. \subsection{Generated Structs} We made a conscious decision to make our generated structs as transparent as @@ -715,13 +718,13 @@ Developers have the option to use these fields to write more robust code, but the system is robust to the developer ignoring the \texttt{isset} construct entirely and will provide suitable default behavior in all cases. -The reason for this choice was for ease of application development. Our stated +This choice was motivated by the desire to ease application development. Our stated goal is not to make developers learn a rich new library in their language of choice, but rather to generate code that allow them to work with the constructs that are most familiar in each language. We also made the \texttt{read()} and \texttt{write()} methods of the generated -objects public members so that the objects can be used outside of the context +objects public so that the objects can be used outside of the context of RPC clients and servers. Thrift is a useful tool simply for generating objects that are easily serializable across programming languages. @@ -768,15 +771,15 @@ std::map