Previous Next Contents

Interfacing C with Objective Caml

This chapter describes how user-defined primitives, written in C, can be linked with Caml code and called from Caml functions.

Overview and compilation information

Declaring primitives

User primitives are declared in an implementation file or struct...end module expression using the external keyword:

        external name : type = C-function-name
This defines the value name name as a function with type type that executes by calling the given C function. For instance, here is how the input primitive is declared in the standard library module Pervasives:
        external input : in_channel -> string -> int -> int -> int
                       = "input"
Primitives with several arguments are always curried. The C function does not necessarily have the same name as the ML function.

External functions thus defined can be specified in interface files or sig...end signatures either as regular values

        val name : type
thus hiding their implementation as a C function, or explicitly as ``manifest'' external functions
        external name : type = C-function-name
The latter is slightly more efficient, as it allows clients of the module to call directly the C function instead of going through the corresponding Caml function.

Implementing primitives

User primitives with arity n <= 5 are implemented by C functions that take n arguments of type value, and return a result of type value. The type value is the type of the representations for Caml values. It encodes objects of several base types (integers, floating-point numbers, strings, ...), as well as Caml data structures. The type value and the associated conversion functions and macros are described in details below. For instance, here is the declaration for the C function implementing the input primitive:

        value input(channel, buffer, offset, length)
                value channel, buffer, offset, length;
        {
         ...
        }

When the primitive function is applied in a Caml program, the C function is called with the values of the expressions to which the primitive is applied as arguments. The value returned by the function is passed back to the Caml program as the result of the function application.

User primitives with arity greater than 5 should be implemented by two C functions. The first function, to be used in conjunction with the bytecode compiler ocamlc, receives two arguments: a pointer to an array of Caml values (the values for the arguments), and an integer which is the number of arguments provided. The other function, to be used in conjunction with the native-code compiler ocamlopt, takes its arguments directly. For instance, here are the two C functions for the 7-argument primitive Nat.add_nat:

        value add_nat_native(nat1, ofs1, len1, nat2, ofs2, len2, carry_in)
             value nat1, ofs1, len1, nat2, ofs2, len2, carry_in;
        {
          ...
        }
        value add_nat_bytecode(argv, argn)
             value * argv;
             int argn;
        {
          return add_nat_native(argv[0], argv[1], argv[2], argv[3],
                                argv[4], argv[5], argv[6]);
        }
The names of the two C functions must be given in the primitive declaration, as follows:
        external name : type =
                 bytecode-C-function-name native-code-C-function-name
For instance, in the case of add_nat, the declaration is:
        external add_nat: nat -> int -> int -> nat -> int -> int -> int -> int
                        = "add_nat_bytecode" "add_nat_native"

Implementing a user primitive is actually two separate tasks: on the one hand, decoding the arguments to extract C values from the given Caml values, and encoding the return value as a Caml value; on the other hand, actually computing the result from the arguments. Except for very simple primitives, it is often preferable to have two distinct C functions to implement these two tasks. The first function actually implements the primitive, taking native C values as arguments and returning a native C value. The second function, often called the ``stub code'', is a simple wrapper around the first function that converts its arguments from Caml values to C values, call the first function, and convert the returned C value to Caml value. For instance, here is the stub code for the input primitive:

        value input(channel, buffer, offset, length)
                value channel, buffer, offset, length;
        {
          return Val_long(getblock((struct channel *) channel,
                                   &Byte(buffer, Long_val(offset)),
                                   Long_val(length)));
        }
(Here, Val_long, Long_val and so on are conversion macros for the type value, that will be described later.) The hard work is performed by the function getblock, which is declared as:
        long getblock(channel, p, n)
             struct channel * channel;
             char * p;
             long n;
        {
          ...
        }

To write C code that operates on Objective Caml values, the following include files are provided:
Include fileProvides
caml/mlvalues.hdefinition of the value type, and conversion macros
caml/alloc.hallocation functions (to create structured Caml objects)
caml/memory.hmiscellaneous memory-related functions (for in-place modification of structures, etc).
caml/callback.hcallback from C to Caml (see section 15.6).
These files reside in the caml/ subdirectory of the Objective Caml standard library directory (usually /usr/local/lib/ocaml).

Linking C code with Caml code

The Objective Caml runtime system comprises three main parts: the bytecode interpreter, the memory manager, and a set of C functions that implement the primitive operations. Some bytecode instructions are provided to call these C functions, designated by their offset in a table of functions (the table of primitives).

In the default mode, the Caml linker produces bytecode for the standard runtime system, with a standard set of primitives. References to primitives that are not in this standard set result in the ``unavailable C primitive'' error.

In the ``custom runtime'' mode, the Caml linker scans the object files and determines the set of required primitives. Then, it builds a suitable runtime system, by calling the native code linker with:

This builds a runtime system with the required primitives. The Caml linker generates bytecode for this custom runtime system. The bytecode is appended to the end of the custom runtime system, so that it will be automatically executed when the output file (custom runtime + bytecode) is launched.

To link in ``custom runtime'' mode, execute the ocamlc command with:

The value type

All Caml objects are represented by the C type value, defined in the include file caml/mlvalues.h, along with macros to manipulate values of that type. An object of type value is either:

Integer values

Integer values encode 31-bit signed integers (63-bit on 64-bit architectures). They are unboxed (unallocated).

Blocks

Blocks in the heap are garbage-collected, and therefore have strict structure constraints. Each block includes a header containing the size of the block (in words), and the tag of the block. The tag governs how the contents of the blocks are structured. A tag lower than No_scan_tag indicates a structured block, containing well-formed values, which is recursively traversed by the garbage collector. A tag greater than or equal to No_scan_tag indicates a raw block, whose contents are not scanned by the garbage collector. For the benefits of ad-hoc polymorphic primitives such as equality and structured input-output, structured and raw blocks are further classified according to their tags as follows:
TagContents of the block
0 to No_scan_tag-1A structured block (an array of Caml objects). Each field is a value.
Closure_tagA closure representing a functional value. The first word is a pointer to a piece of bytecode, the remaining words are value containing the environment.
String_tagA character string.
Double_tagA double-precision floating-point number.
Double_array_tagAn array of double-precision floating-point numbers (for the native-code compiler only).
Abstract_tagA block representing an abstract datatype.
Final_tagA block representing an abstract datatype with a ``finalization'' function, to be called when the block is deallocated.

Pointers to outside the heap

Any word-aligned pointer to outside the heap can be safely cast to and from the type value. This includes pointers returned by malloc, and pointers to C variables (of size at least one word) obtained with the & operator.

Representation of Caml data types

This section describes how Caml data types are encoded in the value type.

Atomic types

Caml typeEncoding
intUnboxed integer values.
charUnboxed integer values (ASCII code).
floatBlocks with tag Double_tag.
stringBlocks with tag String_tag.

Tuples and records

Tuples are represented by pointers to blocks, with tag 0.

Records are also represented by zero-tagged blocks. The ordering of labels in the record type declaration determines the layout of the record fields: the value associated to the label declared first is stored in field 0 of the block, the value associated to the label declared next goes in field 1, and so on.

The native-code compiler represents specially records whose fields all have type float. These are represented as arrays of floating-point numbers, with tag Double_array_tag. (See the section below on arrays.)

Arrays

In the bytecode compiler, all arrays are represented like tuples, that is, as pointers to blocks tagged 0.

The native-code compiler has a special, unboxed, more efficient representation for arrays of floating-point numbers (type float array). These arrays are represented by pointers to blocks with tag Double_array_tag. They should be accessed with the Double_field and Store_double_field macros.

Arrays of floating-point numbers in the bytecode compiler should be accessed with Double_val(Field(v, n)) for reading and modify(&Field(v, n), copy_double(d)) for writing.

Concrete types

Constructed terms are represented either by unboxed integers (for constant constructors) or by blocks whose tag encode the constructor (for non-constant constructors). The constant constructors and the non-constant constructors for a given concrete type are numbered separately, starting from 0, in the order in which they appear in the concrete type declaration. Constant constructors are represented by unboxed integers equal to the constructor number. Non-constant constructors declared with a n-tuple as argument are represented by a block of size n, tagged with the constructor number; the n fields contain the components of its tuple argument. Other non-constant constructors are represented by a block of size 1, tagged with the constructor number; the field 0 contains the value of the constructor argument. Example:

Constructed termRepresentation
()Val_int(0)
falseVal_int(0)
trueVal_int(1)
[]Val_int(0)
h::tBlock with size = 2 and tag = 0; first field contains h, second field t

As a convenience, caml/mlvalues.h defines the macros Val_unit, Val_false and Val_true to refer to (), false and true.

Objects

Objects are represented as zero-tagged blocks. The first field of the block refers to the object class and associated method suite, in a format that cannot easily be exploited from C. The remaining fields of the object contain the values of the instance variables of the object. Instance variables are stored in the order in which they appear in the class definition (taking inherited classes into account).

Operations on values

Kind tests

Operations on integers

Accessing blocks

The expressions Field(v, n), Byte(v, n) and Byte_u(v, n) are valid l-values. Hence, they can be assigned to, resulting in an in-place modification of value v. Assigning directly to Field(v, n) must be done with care to avoid confusing the garbage collector (see below).

Allocating blocks

From the standpoint of the allocation functions, blocks are divided according to their size as zero-sized blocks, small blocks (with size less than or equal to Max_young_wosize), and large blocks (with size greater than to Max_young_wosize). The constant Max_young_wosize is declared in the include file mlvalues.h. It is guaranteed to be at least 64 (words), so that any block with constant size less than or equal to 64 can be assumed to be small. For blocks whose size is computed at run-time, the size must be compared against Max_young_wosize to determine the correct allocation procedure.

Raising exceptions

Two functions are provided to raise two standard exceptions:

Raising arbitrary exceptions from C is more delicate: the exception identifier is dynamically allocated by the Caml program, and therefore must be communicated to the C function using the registration facility described below in section 15.6. Once the exception identifier is recovered in C, the following functions actually raise the exception:

Living in harmony with the garbage collector

Unused blocks in the heap are automatically reclaimed by the garbage collector. This requires some cooperation from C code that manipulates heap-allocated blocks.

Rule:
After a structured block (a block with tag less than No_scan_tag) is allocated, all fields of this block must be filled with well-formed values before the next allocation operation. If the block has been allocated with alloc or alloc_tuple, filling is performed by direct assignment to the fields of the block:
        Field(v, n) = vn;
If the block has been allocated with alloc_shr, filling is performed through the initialize function:
        initialize(&Field(v, n), vn);

The next allocation can trigger a garbage collection. The garbage collector assumes that all structured blocks contain well-formed values. Newly created blocks contain random data, which generally do not represent well-formed values.

If you really need to allocate before the fields can receive their final value, first initialize with a constant value (e.g. Val_long(0)), then allocate, then modify the fields with the correct value (see rule 3).

Rule:
Local variables containing values must be registered with the garbage collector (using the Begin_roots and End_roots macros), if they are to survive a call to an allocation function.

Registration is performed with the Begin_roots set of macros. Begin_roots1(v) registers variable v with the garbage collector. Generally, v will be a local variable or a parameter of your function. It must be initialized to a valid value (e.g. Val_unit) before the first allocation. Likewise, Begin_roots2, ..., Begin_roots5 let you register up to 5 variables at the same time. Begin_root is the same as Begin_roots1. Begin_roots_block(ptr,size) allows you to register an array of roots. ptr is a pointer to the first element, and size is the number of elements in the array.

Once registered, each of your variables (or array element) has the following properties: if it points to a heap-allocated block, this block (and its contents) will not be reclaimed; moreover, if this block is relocated by the garbage collector, the variable is updated to point to the new location for the block.

Each of the Begin_roots macros open a C block that must be closed with a matching End_roots at the same nesting level. The block must be exited normally (i.e. not with return or goto). However, the roots are automatically un-registered if a Caml exception is raised, so you can exit the block with failwith, invalid_argument, or one of the raise functions.

Note: The Begin_roots macros use a local variable and a structure tag named caml__roots_block. Do not use this identifier in your programs.

Rule:
Global variables containing values must be registered with the garbage collector using the register_global_root function.

Registration of a global variable v is achieved by calling register_global_root(&v) just before a valid value is stored in v for the first time.

A registered global variable v can be un-registered by calling remove_global_root(&v).

Rule:
Direct assignment to a field of a block, as in
        Field(v, n) = w;
is safe only if v is a block newly allocated by alloc or alloc_tuple; that is, if no allocation took place between the allocation of v and the assignment to the field. In all other cases, never assign directly. If the block has just been allocated by alloc_shr, use initialize to assign a value to a field for the first time:
        initialize(&Field(v, n), w);
Otherwise, you are updating a field that previously contained a well-formed value; then, call the modify function:
        modify(&Field(v, n), w);

To illustrate the rules above, here is a C function that builds and returns a list containing the two integers given as parameters:

value alloc_list_int(i1, i2)
        int i1, i2;
{
  value result = Val_unit;
  value r = Val_unit;

  Begin_roots2 (r, result);
    r = alloc(2, 0);                        /* Allocate a cons cell */
    Field(r, 0) = Val_int(i2);              /* car = the integer i2 */
    Field(r, 1) = Val_int(0);               /* cdr = the empty list [] */
    result = alloc(2, 0);                   /* Allocate the other cons cell */
    Field(result, 0) = Val_int(i1);         /* car = the integer i1 */
    Field(result, 1) = r;                   /* cdr = the first cons cell */
  End_roots ();
  return result;
}
Here, the registering of result is not strictly needed, because no allocation takes place after it gets its value, but it's easier and safer to simply register all the local variables that have type value.

In the example above, the list is built bottom-up. Here is an alternate way, that proceeds top-down. It is less efficient, but illustrates the use of modify.

value alloc_list_int(i1, i2)
        int i1, i2;
{
  value tail;
  value r = Val_unit;

  Begin_root (r);
    r = alloc(2, 0);                        /* Allocate a cons cell */
    Field(r, 0) = Val_int(i1);              /* car = the integer i1 */
    Field(r, 1) = Val_int(0);               /* A dummy value
    tail = alloc(2, 0);                     /* Allocate the other cons cell */
    Field(tail, 0) = Val_int(i2);           /* car = the integer i2 */
    Field(tail, 1) = Val_int(0);            /* cdr = the empty list [] */
    modify(&Field(r, 1), tail);             /* cdr of the result = tail */
  End_roots ();
  return r;
}
It would be incorrect to perform Field(r, 1) = tail directly, because the allocation of tail has taken place since r was allocated. tail is not registered as a root because there is no allocation between the assignment where it takes its value and the modify statement that uses the value.

Callbacks from C to Caml

So far, we have described how to call C functions from Caml. In this section, we show how C functions can call Caml functions, either as callbacks (Caml calls C which calls Caml), or because the main program is written in C.

Applying Caml closures from C

C functions can apply Caml functional values (closures) to Caml values. The following functions are provided to perform the applications:

If the function f does not return, but raises an exception that escapes the scope of the application, then this exception is propagated to the next enclosing Caml code, skipping over the C code. That is, if a Caml function f calls a C function g that calls back a Caml function h that raises a stray exception, then the execution of g is interrupted and the exception is propagated back into f.

Registering Caml closures for use in C functions

The main difficulty with the callback functions described above is obtaining a closure to the Caml function to be called. For this purpose, Objective Caml provides a simple registration mechanism, by which Caml code can register Caml functions under some global name, and then C code can retrieve the corresponding closure by this global name.

On the Caml side, registration is performed by evaluating Callback.register n v. Here, n is the global name (an arbitrary string) and v the Caml value. For instance:

    let f x = print_string "f is applied to "; print_int n; print_newline()
    let _ = Callback.register "test function" f

On the C side, a pointer to the value registered under name n is obtained by calling caml_named_value(n). The returned pointer must then be dereferenced to recover the actual Caml value. If no value is registered under the name n, the null pointer is returned. For example, here is a C wrapper that calls the Caml function f above:

    void call_caml_f(int arg)
    {
        callback(*caml_named_value("test function"), Val_int(arg));
    }

The pointer returned by caml_named_value is constant and can safely be cached in a C variable to avoid repeated name lookups. On the other hand, the value pointed to can change during garbage collection and must always be recomputed at the point of use. Here is a more efficient variant of call_caml_f above that calls caml_named_value only once:

    void call_caml_f(int arg)
    {
        static value * closure_f = NULL;
        if (closure_f == NULL) {
            /* First time around, look up by name */
            closure_f = caml_named_value("test function");
        }
        callback(*closure_f, Val_int(arg));
    }

Registering Caml exceptions for use in C functions

The registration mechanism described above can also be used to communicate exception identifiers from Caml to C. The Caml code registers the exception by evaluating Callback.register_exception n exn, where n is an arbitrary name and exn is an exception value of the exception to register. For example:

    exception Error of string
    let _ = Callback.register_exception "test exception" (Error "any string")
The C code can then recover the exception identifier using caml_named_value and pass it as first argument to the functions raise_constant, raise_with_arg, and raise_with_string (described in section 15.4) to actually raise the exception. For example, here is a C function that raises the Error exception with the given argument:
    void raise_error(char * msg)
    {
        raise_with_string(*caml_named_value("test exception"), msg);
    }

Main program in C

In normal operation, a mixed Caml/C program starts by executing the Caml initialization code, which then can proceed to call C functions. We say that the main program is the Caml code. In some applications, it is desirable that the C code plays the role of the main program, calling Caml functions when needed. This can be achieved as follows:

Embedding the Caml code in the C code

The bytecode compiler in custom runtime mode (ocamlc -custom) normally appends the bytecode to the executable file containing the custom runtime. This has two consequences. First, the final linking step must be performed by ocamlc. Second, the Caml runtime library must be able to find the name of the executable file from the command-line arguments. When using caml_main(argv) as in section 15.6, this means that argv[0] or argv[1] must contain the executable file name.

An alternative is to embed the bytecode in the C code. The -output-obj option to ocamlc is provided for this purpose. It causes the ocamlc compiler to output a C object file (.o file) containing the bytecode for the Caml part of the program, as well as a caml_startup function. The C object file produced by ocamlc -output-obj can then be linked with C code using the standard C compiler, or stored in a C library.

The caml_startup function must be called from the main C program in order to initialize the Caml runtime and execute the Caml initialization code. Just like caml_main, it takes one argv parameter containing the command-line parameters. Unlike caml_main, this argv parameter is used only to initialize Sys.argv, but not for finding the name of the executable file.

The native-code compiler ocamlopt also supports the -output-obj option, causing it to output a C object file containing the native code for all Caml modules on the command-line, as well as the Caml startup code. Initialization is performed by calling caml_startup as in the case of the bytecode compiler.

Warning:
On some ports, special options are required on the final linking phase that links together the object file produced by the -output-obj option and the remainder of the program. Those options are shows in the configuration file config/Makefile generated during compilation of Objective Caml, as the variables BYTECCLINKOPTS (for object files produced by ocamlc -output-obj) and NATIVECCLINKOPTS (for object files produced by ocamlopt -output-obj). Currently, the only ports that require special attention are:

A complete example

This section outlines how the functions from the Unix curses library can be made available to Objective Caml programs. First of all, here is the interface curses.mli that declares the curses primitives and data types:

type window                   (* The type "window" remains abstract *)
external initscr: unit -> window = "curses_initscr"
external endwin: unit -> unit = "curses_endwin"
external refresh: unit -> unit = "curses_refresh"
external wrefresh : window -> unit = "curses_wrefresh"
external newwin: int -> int -> int -> int -> window = "curses_newwin"
external mvwin: window -> int -> int -> unit = "curses_mvwin"
external addch: char -> unit = "curses_addch"
external mvwaddch: window -> int -> int -> char -> unit = "curses_mvwaddch"
external addstr: string -> unit = "curses_addstr"
external mvwaddstr: window -> int -> int -> string -> unit = "curses_mvwaddstr"
(* lots more omitted *)
To compile this interface:
        ocamlc -c curses.mli

To implement these functions, we just have to provide the stub code; the core functions are already implemented in the curses library. The stub code file, curses.o, looks like:

#include <curses.h>
#include <mlvalues.h>

value curses_initscr(unit)
        value unit;
{
  return (value) initscr();     /* OK to coerce directly from WINDOW * to value
                                   since that's a block created by malloc() */
}

value curses_wrefresh(win)
        value win;
{
  wrefresh((WINDOW *) win);
  return Val_unit;
}

value curses_newwin(nlines, ncols, x0, y0)
        value nlines, ncols, x0, y0;
{
  return (value) newwin(Int_val(nlines), Int_val(ncols),
                        Int_val(x0), Int_val(y0));
}

value curses_addch(c)
        value c;
{
  addch(Int_val(c));            /* Characters are encoded like integers */
  return Val_unit;
}

value curses_addstr(s)
        value s;
{
  addstr(String_val(s));
  return Val_unit;
}

/* This goes on for pages. */
(Actually, it would be better to create a library for the stub code, with each stub code function in a separate file, so that linking would pick only those functions from the curses library that are actually used.)

The file curses.c can be compiled with:

        cc -c -I/usr/local/lib/ocaml curses.c
or, even simpler,
        ocamlc -c curses.c
(When passed a .c file, the ocamlc command simply calls the C compiler on that file, with the right -I option.)

Now, here is a sample Caml program test.ml that uses the curses module:

open Curses
let main_window = initscr () in
let small_window = newwin 10 5 20 10 in
  mvwaddstr main_window 10 2 "Hello";
  mvwaddstr small_window 4 3 "world";
  refresh();
  for i = 1 to 100000 do () done;
  endwin()
To compile this program, run:
        ocamlc -c test.ml
Finally, to link everything together:
        ocamlc -custom -o test test.cmo curses.o -cclib -lcurses

Advanced example with callbacks

This section illustrates the callback facilities described in section 15.6. We are going to package some Caml functions in such a way that they can be linked with C code and called from C just like any C functions. The Caml functions are defined in the following mod.ml Caml source:

(* File mod.ml -- some ``useful'' Caml functions *)

let rec fib n = if n < 2 then 1 else fib(n-1) + fib(n-2)

let format_result n = Printf.sprintf "Result is: %d\n" n

(* Export those two functions to C *)

let _ = Callback.register "fib" fib
let _ = Callback.register "format_result" format_result

Here is the C stub code for calling these functions from C:

/* File modwrap.c -- wrappers around the Caml functions */

#include <stdio.h>
#include <string.h>
#include <caml/mlvalues.h>
#include <caml/callback.h>

int fib(int n)
{
  static value * fib_closure = NULL;
  if (fib_closure == NULL) fib_closure = caml_named_value("fib");
  return Int_val(callback(*fib_closure, Val_int(n)));
}

char * format_result(int n)
{
  static value * format_result_closure = NULL;
  if (format_result_closure == NULL)
    format_result_closure = caml_named_value("format_result");
  return strdup(String_val(callback(*format_result_closure, Val_int(n))));
  /* We copy the C string returned by String_val to the C heap
     so that it remains valid after garbage collection. */
}

We now compile the Caml code to a C object file and put it in a C library along with the stub code in modwrap.c and the Caml runtime system:

        ocamlc -custom -output-obj -o modcaml.o mod.ml
        ocamlc -c modwrap.c
        cp /usr/local/lib/ocaml/libcamlrun.a mod.a
        ar r mod.a modcaml.o modwrap.o

Now, we can use the two fonctions fib and format_result in any C program, just like regular C functions. Just remember to call caml_startup once before.

/* File main.c -- a sample client for the Caml functions */

#include <stdio.h>

int main(int argc, char ** argv)
{
  int result;

  /* Initialize Caml code */
  caml_startup(argv);
  /* Do some computation */
  result = fib(10);
  printf("fib(10) = %s\n", format_result(result));
  return 0;
}

To build the whole program, just invoke the C compiler as follows:

        cc -o prog main.c mod.a


Previous Next Contents