Readers of Issuu’s engineering blog will know that we use the OCaml programming language in many of our production systems. As members of the small but vibrant OCaml community, we are proud to contribute to the health of this ecosystem by releasing as open-source many of the libraries and tools developed for internal use. This blog post describes the genesis and development of one such tool: ppx_mysql, a syntax extension aimed at reducing the pain and boilerplate traditionally associated with using MySQL bindings from OCaml.

The Genesis of ppx_mysql

Issuu makes extensive use of MySQL for relational data storage. Fortunately, ocaml-mysql offers OCaml bindings to libmysqlclient, and we have relied on those for many years. Though grateful that we could find in the OCaml community a ready-made way to communicate with MySQL from OCaml, we have long suffered the disadvantages of the fairly low-level interface offered by these bindings.

Our primary source of frustration with the low-level interface is the copious amount of boilerplate that it requires. Statements must be prepared, the input parameters must be serialised into strings, the result set must be looped through, and the output parameters must be deserialised into the correct types. This is cumbersome to write, tedious to review, and prone to errors. There was a strong consensus that we needed a better solution!

In many language ecosystems, the common approach that sidesteps the abovementioned issues is to rely instead on a library offering some form of Object Relational Mapping (ORM). ORMs bring their own set of problems, however. For one, there is typically some impedance mismatch between the relational model and the object model used by the ORM. Moreover, effective use of an ORM demands learning yet another API that typically encompasses only a subset of SQL — and seldom the subset you actually require! This is particularly galling at a company like Issuu, where our backend engineers are comfortable with SQL and want to make practical use of its features and expressiveness.

Fortunately, there is a third way. When it comes to interfacing with SQL databases — though alas restricted to PostgreSQL — the best known example in the OCaml community is arguably PG’OCaml. Originally developed by Richard W. M. Jones, and maintained and upgraded over the years by various members of the OCaml community, PG’OCaml leveraged the meta-programming facilities of OCaml to offer the best of both worlds: it allows you to write SQL directly within your program (with some special sigils for identifying input parameters), and automatically generates the boilerplate code required for interfacing with the PostgreSQL server. Moreover, it takes advantage of the Describe command in PostgreSQL’s wire protocol to determine the types of the query parameters at compile-time, and is this way able to automatically invoke the appropriate (de)serialisation functions.

Even though MySQL does not have the same Describe feature offered by PostgreSQL, and therefore some degree of manual type annotations would always be required, we felt that a syntax extension in the spirit of PG’OCaml’s could nevertheless offer a substantial improvement to the way we used the MySQL bindings for OCaml.

Design considerations

Introduced with OCaml 4.02, PPX rewriters quickly replaced the old Camlp4 system as the standard way to annotate and extend OCaml’s grammar for compile-time code generation. It was therefore obvious from the get-go that some sort of PPX would be the way forward for us as well. The space of design possibilities is nevertheless vast, so in the following sections we discuss and justify the decisions we took. These sections also serve as an introduction to the various features of the PPX.

Overall format

The first design decision concerns the type of code that the PPX should generate. Interestingly, the two well-known projects using syntax extensions to facilitate communication with SQL databases (the aforementioned PG’OCaml and OCaml-sqlexpr) each took a different route. Our approach was inspired by PG’OCaml, but it is not quite the same.

Basically, our PPX expects its extension points to take the form [%mysql ACTION SQL], where ACTION is one of execute / select_all / select_one / select_opt, and SQL is a string containing the actual SQL statement. The code it generates is a function that takes at least one parameter (the database handle) and potentially more depending on the number of input parameters declared within the SQL query. The function’s return type depends on the output parameters in the query, and on the ACTION. The latter also determines how the result set from MySQL should be processed.

The existence of the various actions is based on the realisation that one invariably performs one of four different kinds of post-processing on a query’s result set. Action execute is meant for queries whose sole purpose is to modify the database, and whose result set is meant to be empty. Actions select_all, select_one, and select_opt are meant for SELECT queries whose result set should contain a list with an arbitrary number of rows, a single row, or either zero or one rows, respectively.

Use the result type instead of exceptions

At Issuu, we avoid using exceptions for signaling errors, preferring to rely instead on the result type using polymorphic variants for errors. It should not come as a surprise that we have adopted this same policy for ppx_mysql. Note that in practice this means that the set of error variants in the generated function’s result type will depend on its action and on the parameters declared in the SQL string. As an example, a query using select_one as its action includes Expected_one_found_none among the possible error variants, which is absent if the action is instead select_all. In a similar vein, the error variant Column_errors — used to signal deserialisation errors — is only present if the SQL string actually contains any output parameters. The only variant common to all possible generated functions is Mysql_error, which is returned in case an internal error is encountered.

Manual type annotations

Unfortunately, MySQL does not provide anything similar to the Describe feature offered by PostgreSQL, and therefore manual type annotations are required. Superficially, our conventions resemble those used by OCaml-sqlexpr, where input parameters are prefixed with %, output parameters are prefixed with @, and where a ? after the type specifier indicates we are dealing with a nullable column. The similarities end there, however. Instead of single-letter type specifications, we use the full name of the OCaml type. Moreover, we also require that input parameters include the name of a variable inside curly brackets. This is because the function generated by the PPX uses named parameters for all the inputs. As an example, consider the signature of the function generated by the PPX in the code below. Note how the names and types of the function’s parameters match those provided within the query.

# [%mysql execute "INSERT INTO users (id, name, phone) VALUES (%int{id}, %string{name}, %string?{phone})"];;

- : Mysql.dbd -> id:int -> name:string -> phone:string option ->
    (unit, [> `Expected_none_found_one | `Mysql_error of exn ]) IO_result.t
= <fun>

Compared with PG’OCaml, we lose the compile-time verification that the column types in the database match those used in the OCaml code. We even lose the ability to check at compile-time that the SQL statement is valid in the first place. On the other hand, the lack of coupling between our PPX and the database does have the advantage that compile-time access to the MySQL server is not required.

Support for custom types and (de)serialisation functions

As described in a previous post about Issuu’s best OCAML practices, we tend to wrap all sorts of identifiers in modules, and we make the type itself opaque. This practice has many advantages when it comes to safety and abstraction, but it did pose a problem for the first versions of ppx_mysql. Suppose we have an opaque type Foobar.t, represented internally as just an integer, and that module Foobar offered functions to_int and of_int. If ppx_mysql supported only OCaml’s base types, one would need to use the int type specifier within the SQL expression, and manually call Foobar.to_int or Foobar.of_int depending on whether we were dealing with an input or output parameter. Not only does this force us to write boring boilerplate code, it also breaks the abstraction offered by the Foobar module. A better solution was in order.

The solution we converged upon is simple yet effective. Whenever the PPX encounters a type specification whose first letter is capitalised, it assumes it is dealing with a module that satisfies the signature below. Note that the reason why functions of_mysql and to_mysql use strings for (de)serialisation is because that is the representation used by MySQL’s wire protocol.

module type SERIALIZABLE = sig
  type t
  val of_mysql : string -> (t, string) result
  val to_mysql : t -> string
end

As an example, suppose the PPX encounters the type specification %Foobar{foobar}. It will generate a function with a named parameter foobar of type Foobar.t, and it will call function Foobar.to_mysql to serialise it. No more boilerplate needed and abstraction is preserved.

Support for lists as input parameters

Consider the SQL statements below, which exemplify two common use cases for lists in SQL: Inserting multiple rows with a single statement, and selecting based on list membership.

INSERT INTO users (id, name) VALUES (1, 'john'), (2, 'mary'), (3, 'claire');

SELECT COUNT(*) FROM users WHERE id IN (1, 2, 3);

Our PPX supports both use cases via a special list type specification available for input parameters. The code below illustrates its use for each of the SQL statements under discussion:

# [%mysql execute "INSERT INTO users (id, name) VALUES %list{(%int{id}, %string{name})}"];;
- : Mysql.dbd -> (int * string) list -> (unit, [> ... ]) IO_result.t = <fun>

# [%mysql select_one "SELECT @int{COUNT(*)} FROM users WHERE id IN (%list{%int{id}})"];;
- : Mysql.dbd -> int list -> (int, [> ... ]) IO_result.t = <fun>

Note that for the sake of clarity and brevity we replaced the lengthy list of error variants with an ellipsis (...). More to the point, notice the placement of the parentheses in each of the SQL statements: inside the %list curly brackets in the first example, but outside the %list in the second one. The reason for this becomes obvious once you realise that the PPX replaces whatever it encounters within the %list by a comma-separated list.

PG’OCaml users will find parallels between the usage of %list in the second example and PG’OCaml’s $@ notation for list expressions. There is no PG’OCaml equivalent to the first example, however. Being able to provide a list of tuples as input parameter is a novel and useful feature offered by our PPX.

Notice also that there is no way to specify a name for the list parameter, and that it shows up as a positional argument to the generated function. This reflects an underlying limitation of this feature: only one list parameter may be used per statement, and you will get a compile-time error if multiple uses are attempted. In fairness though, we have yet to encounter a situation where this limitation impacted us in practice.

A discussion of this feature is not complete without mentioning an important caveat. The generated function must always be provided with a non-empty list, and it will return an Empty_input_list error variant otherwise. We briefly considered creating an explicit type for non-empty lists (such as type 'a nelist = 'a * 'a list) and using that in the generated function. However, experience tells us that such types end up being in practice too much of a hassle, so we opted for plain lists in the end.

Not tied to any particular I/O monad or MySQL binding

There is a well know schism in the OCaml community between the two main libraries offering monadic concurrency for I/O, Lwt and Async. Therefore, being able to abstract over the I/O monad is an almost mandatory requirement, which our PPX satisfies. Moreover, though the PPX itself depends neither on Lwt or Async, we do include a simple “Hello World” example illustrating how to use the PPX with projects based on Lwt or Async.

Note that the code generated by the PPX does not even have a dependency on ocaml-mysql itself. Instead, it expects the presence in the environment of the signature PPX_MYSQL_CONTEXT described in the project’s README, and therefore the declaration of the three modules (IO, IO_result, and Prepared) used by the generated code. This is the mechanism used to ensure that the generated code can be used with any I/O monad. It has the additional benefit of not tying the PPX to any particular MySQL client library. In fact, it should be easy to use a MariaDB client library instead.

Also part of the project though packaged separately is ppx_mysql_identity. This library offers an implementation of the PPX_MYSQL_CONTEXT signature described above, using the identity monad for I/O and ocaml-mysql as the client library.

Automatic caching of prepared statements

The code generated by our PPX relies solely on MySQL prepared statements. There are two reasons behind this choice. The first is about security: SQL injection attacks are the poster child of bad security practices, and using prepared statements avoids them. The second reason concerns performance: there is some overhead associated with parsing a SQL statement, and by using prepared statements we need only incur that cost the first time the statement is used.

At this point, the astute reader may be wondering how the code generated by the PPX knows that a given statement has already been prepared. The approach we chose is similar to the one used by PG’OCaml: the database handle required by the PPX is not the raw handle used by the underlying MySQL client library. Instead, it’s an opaque type whose internal representation is a record that includes the original raw handle and a hash table that maps SQL query strings into the corresponding prepared statements.

There is one important caveat with the approach described above. If the SQL query contains list parameters, each list size will result in the preparation of a different statement. Should you use list parameters where the list size varies dramatically, you may end up with a very large cache of prepared statements taking up resources both on the client and on the MySQL server. We are fully aware that this is not a desirable situation. Nevertheless, we also feel that it would be draconian to simply disable statement caching for all queries with list parameters. After all, there are use cases where the list size is bounded and small, and therefore where it makes sense to cache prepared statements for all possible list sizes.

Obviously the PPX cannot guess whether it is reasonable or not to cache any given statement. The burden of this decision should fall on the user. Therefore, we plan to release soon a 1.1 version of our PPX offering per-query control of whether or not the resulting statement(s) should be cached.

Conclusion

Looking back at the development of this project, some lessons stand out. First, we were fortunate to have started working on this PPX after ppxlib came into existence. Not only does ppxlib bring the disparate libraries and tools required for writing a PPX under the same umbrella, it also provides a framework that abstracts away the low-level details of hooking into the compiler, and allows for multiple PPXs to coexist peacefully. Unfortunately, most of the resources found on the Internet were written in the age before ppxlib, and are therefore out of date and a potential source of confusion for newcomers to the ecosystem.

A perhaps more important lesson concerns the PPX mechanism itself. It started out as a clever hack, but a hack nonetheless. In an ideal world, a principled way to generate code at compile-time would’ve been designed from the ground up together with the language. Some of the pitfalls associated with writing a PPX — such as forgetting to fully qualify identifiers in the generated code and thus fall prey to the lack of hygiene — could have been sidestepped altogether in this alternate world. We are nevertheless cognisant of the Nirvana fallacy. Though not perfect, the PPX mechanism has made a positive contribution to the OCaml ecosystem, and we encourage those on the fence to give it a try.

The reader looking for more information about ppx_mysql should refer to the project’s home page, which contains a fairly detailed README. And there are of course OPAM packages for both ppx_mysql and ppx_mysql_identity.