Thicket data repository for the EEG
1{
2 "id": "https://anil.recoil.org/notes/datacaml-with-ciel",
3 "title": "DataCaml: distributed dataflow programming in OCaml",
4 "link": "https://anil.recoil.org/notes/datacaml-with-ciel",
5 "updated": "2011-06-11T00:00:00",
6 "published": "2011-06-11T00:00:00",
7 "summary": "<p>Distributed programming frameworks like\n<a href=\"http://wiki.apache.org/hadoop\">Hadoop</a> and\n<a href=\"http://research.microsoft.com/en-us/projects/dryad/\">Dryad</a> are popular\nfor performing computation over large amounts of data. The reason is\nprogrammer convenience: they accept a query expressed in a simple form\nsuch as <a href=\"http://wiki.apache.org/hadoop/HadoopMapReduce\">MapReduce</a>, and\nautomatically take care of distributing computation to multiple hosts,\nensuring the data is available at all nodes that need it, and dealing\nwith host failures and stragglers.</p>\n<p>A major limitation of Hadoop and Dryad is that they are not well-suited\nto expressing <a href=\"http://en.wikipedia.org/wiki/Iterative_method\">iterative\nalgorithms</a> or <a href=\"http://en.wikipedia.org/wiki/Dynamic_programming\">dynamic\nprogramming</a> problems.\nThese are very commonly found patterns in many algorithms, such as\n<a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">k-means clustering</a>,\n<a href=\"http://en.wikipedia.org/wiki/Binomial_options_pricing_model\">binomial options\npricing</a> or\n<a href=\"http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm\">Smith Waterman</a>\nfor sequence alignment.</p>\n<p>Over in the SRG in Cambridge,\n<a href=\"http://www.cl.cam.ac.uk/research/srg/netos/ciel/who-we-are/\">we</a>\ndeveloped a Turing-powerful distributed execution engine called\n<a href=\"http://www.cl.cam.ac.uk/research/srg/netos/ciel/\">CIEL</a> that addresses\nthis. The <a href=\"https://anil.recoil.org/papers/2011-nsdi-ciel\">CIEL: A universal execution engine for distributed data-flow computing</a>\npaper describes the system in detail, but here\u2019s a shorter introduction.</p>\n<h2><a href=\"https://anil.recoil.org/#the-ciel-execution-engine\"></a>The CIEL Execution Engine</h2>\n<p>CIEL consists of a master coordination server and workers installed on\nevery host. The engine is job-oriented: a job consists of a graph of\ntasks which results in a deterministic output. CIEL tasks can run in any\nlanguage and are started by the worker processes as needed. Data flows\naround the cluster in the form of <em>references</em> that are fed to tasks as\ndependencies. Tasks can publish their outputs either as <em>concrete</em>\nreferences if they can finish the work immediately or as a <em>future</em>\nreference. Additionally, tasks can dynamically spawn more tasks and\ndelegate references to them, which makes the system Turing-powerful and\nsuitable for iterative and dynamic programming problems where the task\ngraph cannot be computed statically.</p>\n<p>The first iteration of CIEL used a domain-specific language called\n<a href=\"https://anil.recoil.org/papers/2011-nsdi-ciel.pdf\">Skywriting</a> to\ncoordinate how tasks should run across a cluster. Skywriting is an\ninterpreted language that is \u201cnative\u201d to CIEL, and when it needs to\nblock it stores its entire execution state inside CIEL as a\ncontinuation. <a href=\"http://www.cl.cam.ac.uk/~dgm36/\">Derek Murray</a> has\nwritten a blog post <a href=\"http://www.syslog.cl.cam.ac.uk/2011/04/06/ciel/\">explaining this in more\ndetail</a>.</p>\n<p>More recently, we have been working on eliminating the need for\nSkywriting entirely, by adding direct support for CIEL into languages\nsuch as <a href=\"http://www.stackless.com/\">Python</a>, Java,\n<a href=\"http://www.scala-lang.org/\">Scala</a>, and the main subject of this post \u2013\n<a href=\"http://caml.inria.fr\">OCaml</a>. It works via libraries that communicate\nwith CIEL to spawn tasks, publish references, or suspend itself into the\ncluster to be woken up when a future reference is completed.</p>\n<h2><a href=\"https://anil.recoil.org/#datacaml-api\"></a>DataCaml API</h2>\n<p>Rather than go into too much detail about the innards of CIEL, this post\ndescribes the OCaml API and gives some examples of how to use it. The\nsimplest interface to start with is:</p>\n<pre><code>type 'a ref\nval deref : 'a ref -> 'a\n</code></pre>\n<p>The type <code>'a ref</code> represents a CIEL reference. This data might not be\nimmediately present on the current node, and so must be dereferenced\nusing the <code>deref</code> function.</p>\n<p>If the reference has been completed, then the OCaml value is\nunmarshalled and returned. If it is not present, then the program needs\nto wait until the computation involving the reference has completed\nelsewhere. The future reference might contain a large data structure and\nbe on another host entirely, and so we should serialise the program\nstate and spawn a task that is dependent on the future\u2019s completion.\nThis way, CIEL can resume execution on whatever node finished that\ncomputation, avoiding the need to move data across the network.</p>\n<p>Luckily, we do not need to serialise the entire heap to suspend the\nprogram. DataCaml uses the\n<a href=\"http://okmij.org/ftp/continuations/implementations.html\">delimcc</a>\ndelimited continuations library to walk the stack and save only the\nsubset required to restart this particular task. Delimcc abstracts this\nin the form a \u201crestartable exception\u201d that supplies a closure which can\nbe called later to resume the execution, as if the exception had never\nhappened. Delimcc supports serialising this closure to an output\nchannel, which you can read about in Oleg\u2019s\n<a href=\"http://okmij.org/ftp/continuations/caml-shift.pdf\">paper</a>.</p>\n<p>So how do we construct references? Lets fill in more of the interface:</p>\n<pre><code>module Ciel = struct\n type 'a ref\n val deref : 'a ref -> 'a\n val spawn : ('a -> 'b) -> 'a -> 'b ref\n val run : (string list -> 'a) -> ('a -> string) -> unit\nend\n</code></pre>\n<p>The <code>spawn</code> function accepts a closure and an argument, and returns a\nfuture of the result as a reference. The <code>run</code> function begins the\nexecution of a job, with the first parameter taking some\n<code>string arguments</code> and returning an <code>'a</code> value. We also supply a\npretty-printer second argument to convert the <code>'a</code> into a string for\nreturning as the result of the job (this can actually be any JSON value\nin CIEL, and just simplified here).</p>\n<pre><code>let r1 = spawn (fun x -> x + 5) arg1 in\nlet r2 = spawn (fun x -> deref r1 + 5) arg1 in\nderef r2\n</code></pre>\n<p>We first spawn a function <code>r1</code> which simply adds 5 to the job argument.\nA job in CIEL is <em>lazily scheduled</em>, so this marshals the function to\nCIEL, creates a future, and returns immediately. Next, the <code>r2</code> function\nspawns a task which also adds 5, but to the dereferenced value of <code>r1</code>.\nAgain, it is not scheduled yet as the return reference has not been\ndereferenced.</p>\n<p>Finally, we attempt to dereference <code>r2</code>, which causes it be scheduled on\na worker. While executing, it will try to dereference <code>r1</code> that will\nschedule it, and all the tasks will run to completion.</p>\n<p>Programming language boffins will recognise that this interface is very\nsimilar to <a href=\"http://www.ps.uni-saarland.de/alice/\">AliceML</a>\u2019s concept of\n<a href=\"http://www.ps.uni-saarland.de/alice/manual/futures.html\">lazy futures</a>.\nThe main difference is that it is implemented as a pure OCaml library,\nand uses a general-purpose distributed engine that can also work with\nother languages.</p>\n<h2><a href=\"https://anil.recoil.org/#streaming-references\"></a>Streaming References</h2>\n<p>The references described so far only have two states: they are either\nconcrete or futures. However, there are times when a task can\nprogressively accept input and make forward progress. For these\nsituations, references can also be typed as <em>opaque</em> references that are\naccessed via <code>in_channel</code> and <code>out_channel</code>, as networks are:</p>\n<pre><code>type opaque_ref\n\nval spawn_ref : (unit -> opaque_ref) -> opaque_ref\nval output : ?stream:bool -> ?pipe:bool -> (out_channel -> unit) -> opaque_ref\nval input : (in_channel -> 'a) -> opaque_ref -> 'a\n</code></pre>\n<p>This interface is a lower-level version of the previous one:</p>\n<ul>\n<li><code>spawn_ref</code> creates a lazy future as before, but the type of\nreferences here is completely opaque to the program.</li>\n<li>Inside a spawned function, <code>output</code> is called with a closure that\naccepts an <code>out_channel</code>. The <code>stream</code> argument informs CIEL that a\ndependent task can consume the output before it is completed, and\n<code>pipe</code> forms an even more closely coupled shared-memory connection\n(requiring the tasks to be scheduled on the same host). Piping is\nmore efficient, but will require more work to recover from a fault,\nand so using it is left to the programmer to decide.</li>\n<li>The <code>input</code> function is used by the receiving task to parse the\ninput as a standard <code>in_channel</code>.</li>\n</ul>\n<p>The CIEL engine actually supports multiple concurrent input and output\nstreams to a task, but I\u2019ve just bound it as a single version for now\nwhile the bindings find their feet. Here\u2019s an example of how streaming\nreferences can be used:</p>\n<pre><code>let x_ref = spawn_ref (fun () ->\n output ~stream:true (fun oc ->\n for i = 0 to 5 do\n Unix.sleep 1;\n fprintf oc "%d\\n%!" i;\n done\n )\n ) in\n let y_ref = spawn_ref (fun () ->\n input (fun ic ->\n output ~stream:true (fun oc ->\n for i = 0 to 5 do\n let line = input_line ic in\n fprintf oc "LINE=%s\\n%!" line\n done\n )\n ) x_ref\n ) in\n</code></pre>\n<p>We first spawn an <code>x_ref</code> which pretends to do 5 seconds of work by\nsleeping and outputing a number. This would of course be heavy number\ncrunching in a real program. The <code>y_ref</code> then inputs this stream, and\noutputs its own result by prepending a string to each line.</p>\n<h2><a href=\"https://anil.recoil.org/#try-it-out\"></a>Try it out</h2>\n<p>If you are interested in a more real example, then read through the\n<a href=\"https://github.com/avsm/ciel/blob/master/src/ocaml/binomial.ml\">binomial\noptions</a>\ncalculator that uses streaming references to parallelise a dynamic\nprogramming problem (this would be difficult to express in MapReduce).\nOn my Mac, I can run this by:</p>\n<ul>\n<li>check out CIEL from from Derek\u2019s <a href=\"http://github.com/mrry/ciel\">Git\nrepository</a>.</li>\n<li>install all the Python libraries required (see the <code>INSTALL</code> file)\nand OCaml libraries\n(<a href=\"http://okmij.org/ftp/continuations/implementations.html\">delimcc</a>\nand <a href=\"http://martin.jambon.free.fr/yojson.html\">Yojson</a>).</li>\n<li>add <code><repo>/src/python</code> to your <code>PYTHONPATH</code></li>\n<li>in one terminal: <code>./scripts/run_master.sh</code></li>\n<li>in another terminal: <code>./scripts/run_worker.sh -n 5</code> (this allocates\n5 execution slots)</li>\n<li>build the OCaml libraries: <code>cd src/ocaml && make</code></li>\n<li>start the binomial options job:\n<code>./scripts/sw-start-job -m http://localhost:8000 ./src/package/ocaml_binopt.pack</code></li>\n<li>there will be a URL printed which shows the execution progress in\nreal-time</li>\n<li>you should see log activity on the worker(s), and a result reference\nwith the answer (<code>10.x</code>)</li>\n<li>let us know the happy news if it worked or sad news if something\nbroke</li>\n</ul>\n<h2><a href=\"https://anil.recoil.org/#discussion\"></a>Discussion</h2>\n<p>The DataCaml bindings outlined here provide an easy way to write\ndistributed, fault-tolerant and cluster-scheduled jobs in OCaml. The\ncurrent implementation of the engine is aimed at cluster computation,\nbut <a href=\"http://www.cl.cam.ac.uk/~ms705\">Malte</a> has been working on\n<a href=\"http://www.cl.cam.ac.uk/~ms705/pub/papers/2011-ciel-sfma.pdf\">condensing CIEL onto multicore\nhardware</a>.\nThus, this could be one approach to \u2018solving the OCaml multicore\nproblem\u2019 for problems that fit nicely into the dataflow paradigm.</p>\n<p>The biggest limitation for using these bindings is that delimited\ncontinuation serialisation only works in bytecode. Native code delimcc\nsupports <code>shift/reduce</code> in the same program, but serialising is\nproblematic since native code continuations contain a C stack, which may\nhave unwrapped integers. One way to work around this is by switching to\na monadic approach to dereferencing, but I find delimcc programming more\nnatural (also see <a href=\"http://www.openmirage.org/wiki/delimcc-vs-lwt\">this\ndiscussion</a>).</p>\n<p>Another important point is that tasks are lazy and purely functional\n(remind you of Haskell?). This is essential for reliable fault-tolerance\nand reproducibility, while allowing individual tasks to run fast, strict\nand mutable OCaml code. The tasks must remain referentially transparent\nand idempotent, as CIEL may choose to schedule them multiple times (in\nthe case of faults or straggler correction). Derek has been working on\n<a href=\"http://www.cl.cam.ac.uk/~dgm36/publications/2011-murray2011nondet.pdf\">integrating non-determinism into\nCIEL</a>,\nso this restriction may be relaxed soon.</p>\n<p>Finally, these ideas are not limited to OCaml at all, but also apply to\nScala, Java, and Python. We have submitted a draft paper dubbed <em>\u2018<a href=\"http://www.cl.cam.ac.uk/~ms705/pub/papers/2011-ciel-socc-draft.pdf\">A\nPolyglot Approach to Cloud\nProgramming</a>\u2019</em>\nwith more details and the ubiquitous evaluation versus Hadoop. There is\na really interesting line to explore between low-level\n<a href=\"http://en.wikipedia.org/wiki/Message_Passing_Interface\">MPI</a> coding and\nhigh-level MapReduce, and we think CIEL is a useful spot in that design\nspace.</p>\n<p>Incidentally, I was recently hosted by <a href=\"http://research.nokia.com/\">Nokia\nResearch</a> in Palo Alto by my friend\n<a href=\"http://www.linkedin.com/pub/prashanth-mundkur/6/b44/27\">Prashanth\nMundkur</a>, where\nthey work on the Python/Erlang/OCaml <a href=\"http://discoproject.org/\">Disco</a>\nMapReduce engine. I\u2019m looking forward to seeing more critical\ncomparisons and discussions of alternatives to Hadoop, from them and\nothers.</p>\n<p><em>Thanks are due to <a href=\"http://www.cl.cam.ac.uk/~dgm36/\">Derek</a>,\n<a href=\"https://twitter.com/#!/chrissmowton\">Chris</a> and\n<a href=\"http://www.cl.cam.ac.uk/~ms705\">Malte</a> for answering my incessant CIEL\nquestions while writing this post! Remember that DataCaml is a work in\nprogress and a research prototype, and feedback is most welcome.</em></p>",
8 "content": "<p>Distributed programming frameworks like\n<a href=\"http://wiki.apache.org/hadoop\">Hadoop</a> and\n<a href=\"http://research.microsoft.com/en-us/projects/dryad/\">Dryad</a> are popular\nfor performing computation over large amounts of data. The reason is\nprogrammer convenience: they accept a query expressed in a simple form\nsuch as <a href=\"http://wiki.apache.org/hadoop/HadoopMapReduce\">MapReduce</a>, and\nautomatically take care of distributing computation to multiple hosts,\nensuring the data is available at all nodes that need it, and dealing\nwith host failures and stragglers.</p>\n<p>A major limitation of Hadoop and Dryad is that they are not well-suited\nto expressing <a href=\"http://en.wikipedia.org/wiki/Iterative_method\">iterative\nalgorithms</a> or <a href=\"http://en.wikipedia.org/wiki/Dynamic_programming\">dynamic\nprogramming</a> problems.\nThese are very commonly found patterns in many algorithms, such as\n<a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">k-means clustering</a>,\n<a href=\"http://en.wikipedia.org/wiki/Binomial_options_pricing_model\">binomial options\npricing</a> or\n<a href=\"http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm\">Smith Waterman</a>\nfor sequence alignment.</p>\n<p>Over in the SRG in Cambridge,\n<a href=\"http://www.cl.cam.ac.uk/research/srg/netos/ciel/who-we-are/\">we</a>\ndeveloped a Turing-powerful distributed execution engine called\n<a href=\"http://www.cl.cam.ac.uk/research/srg/netos/ciel/\">CIEL</a> that addresses\nthis. The <a href=\"https://anil.recoil.org/papers/2011-nsdi-ciel\">CIEL: A universal execution engine for distributed data-flow computing</a>\npaper describes the system in detail, but here\u2019s a shorter introduction.</p>\n<h2><a href=\"https://anil.recoil.org/#the-ciel-execution-engine\"></a>The CIEL Execution Engine</h2>\n<p>CIEL consists of a master coordination server and workers installed on\nevery host. The engine is job-oriented: a job consists of a graph of\ntasks which results in a deterministic output. CIEL tasks can run in any\nlanguage and are started by the worker processes as needed. Data flows\naround the cluster in the form of <em>references</em> that are fed to tasks as\ndependencies. Tasks can publish their outputs either as <em>concrete</em>\nreferences if they can finish the work immediately or as a <em>future</em>\nreference. Additionally, tasks can dynamically spawn more tasks and\ndelegate references to them, which makes the system Turing-powerful and\nsuitable for iterative and dynamic programming problems where the task\ngraph cannot be computed statically.</p>\n<p>The first iteration of CIEL used a domain-specific language called\n<a href=\"https://anil.recoil.org/papers/2011-nsdi-ciel.pdf\">Skywriting</a> to\ncoordinate how tasks should run across a cluster. Skywriting is an\ninterpreted language that is \u201cnative\u201d to CIEL, and when it needs to\nblock it stores its entire execution state inside CIEL as a\ncontinuation. <a href=\"http://www.cl.cam.ac.uk/~dgm36/\">Derek Murray</a> has\nwritten a blog post <a href=\"http://www.syslog.cl.cam.ac.uk/2011/04/06/ciel/\">explaining this in more\ndetail</a>.</p>\n<p>More recently, we have been working on eliminating the need for\nSkywriting entirely, by adding direct support for CIEL into languages\nsuch as <a href=\"http://www.stackless.com/\">Python</a>, Java,\n<a href=\"http://www.scala-lang.org/\">Scala</a>, and the main subject of this post \u2013\n<a href=\"http://caml.inria.fr\">OCaml</a>. It works via libraries that communicate\nwith CIEL to spawn tasks, publish references, or suspend itself into the\ncluster to be woken up when a future reference is completed.</p>\n<h2><a href=\"https://anil.recoil.org/#datacaml-api\"></a>DataCaml API</h2>\n<p>Rather than go into too much detail about the innards of CIEL, this post\ndescribes the OCaml API and gives some examples of how to use it. The\nsimplest interface to start with is:</p>\n<pre><code>type 'a ref\nval deref : 'a ref -> 'a\n</code></pre>\n<p>The type <code>'a ref</code> represents a CIEL reference. This data might not be\nimmediately present on the current node, and so must be dereferenced\nusing the <code>deref</code> function.</p>\n<p>If the reference has been completed, then the OCaml value is\nunmarshalled and returned. If it is not present, then the program needs\nto wait until the computation involving the reference has completed\nelsewhere. The future reference might contain a large data structure and\nbe on another host entirely, and so we should serialise the program\nstate and spawn a task that is dependent on the future\u2019s completion.\nThis way, CIEL can resume execution on whatever node finished that\ncomputation, avoiding the need to move data across the network.</p>\n<p>Luckily, we do not need to serialise the entire heap to suspend the\nprogram. DataCaml uses the\n<a href=\"http://okmij.org/ftp/continuations/implementations.html\">delimcc</a>\ndelimited continuations library to walk the stack and save only the\nsubset required to restart this particular task. Delimcc abstracts this\nin the form a \u201crestartable exception\u201d that supplies a closure which can\nbe called later to resume the execution, as if the exception had never\nhappened. Delimcc supports serialising this closure to an output\nchannel, which you can read about in Oleg\u2019s\n<a href=\"http://okmij.org/ftp/continuations/caml-shift.pdf\">paper</a>.</p>\n<p>So how do we construct references? Lets fill in more of the interface:</p>\n<pre><code>module Ciel = struct\n type 'a ref\n val deref : 'a ref -> 'a\n val spawn : ('a -> 'b) -> 'a -> 'b ref\n val run : (string list -> 'a) -> ('a -> string) -> unit\nend\n</code></pre>\n<p>The <code>spawn</code> function accepts a closure and an argument, and returns a\nfuture of the result as a reference. The <code>run</code> function begins the\nexecution of a job, with the first parameter taking some\n<code>string arguments</code> and returning an <code>'a</code> value. We also supply a\npretty-printer second argument to convert the <code>'a</code> into a string for\nreturning as the result of the job (this can actually be any JSON value\nin CIEL, and just simplified here).</p>\n<pre><code>let r1 = spawn (fun x -> x + 5) arg1 in\nlet r2 = spawn (fun x -> deref r1 + 5) arg1 in\nderef r2\n</code></pre>\n<p>We first spawn a function <code>r1</code> which simply adds 5 to the job argument.\nA job in CIEL is <em>lazily scheduled</em>, so this marshals the function to\nCIEL, creates a future, and returns immediately. Next, the <code>r2</code> function\nspawns a task which also adds 5, but to the dereferenced value of <code>r1</code>.\nAgain, it is not scheduled yet as the return reference has not been\ndereferenced.</p>\n<p>Finally, we attempt to dereference <code>r2</code>, which causes it be scheduled on\na worker. While executing, it will try to dereference <code>r1</code> that will\nschedule it, and all the tasks will run to completion.</p>\n<p>Programming language boffins will recognise that this interface is very\nsimilar to <a href=\"http://www.ps.uni-saarland.de/alice/\">AliceML</a>\u2019s concept of\n<a href=\"http://www.ps.uni-saarland.de/alice/manual/futures.html\">lazy futures</a>.\nThe main difference is that it is implemented as a pure OCaml library,\nand uses a general-purpose distributed engine that can also work with\nother languages.</p>\n<h2><a href=\"https://anil.recoil.org/#streaming-references\"></a>Streaming References</h2>\n<p>The references described so far only have two states: they are either\nconcrete or futures. However, there are times when a task can\nprogressively accept input and make forward progress. For these\nsituations, references can also be typed as <em>opaque</em> references that are\naccessed via <code>in_channel</code> and <code>out_channel</code>, as networks are:</p>\n<pre><code>type opaque_ref\n\nval spawn_ref : (unit -> opaque_ref) -> opaque_ref\nval output : ?stream:bool -> ?pipe:bool -> (out_channel -> unit) -> opaque_ref\nval input : (in_channel -> 'a) -> opaque_ref -> 'a\n</code></pre>\n<p>This interface is a lower-level version of the previous one:</p>\n<ul>\n<li><code>spawn_ref</code> creates a lazy future as before, but the type of\nreferences here is completely opaque to the program.</li>\n<li>Inside a spawned function, <code>output</code> is called with a closure that\naccepts an <code>out_channel</code>. The <code>stream</code> argument informs CIEL that a\ndependent task can consume the output before it is completed, and\n<code>pipe</code> forms an even more closely coupled shared-memory connection\n(requiring the tasks to be scheduled on the same host). Piping is\nmore efficient, but will require more work to recover from a fault,\nand so using it is left to the programmer to decide.</li>\n<li>The <code>input</code> function is used by the receiving task to parse the\ninput as a standard <code>in_channel</code>.</li>\n</ul>\n<p>The CIEL engine actually supports multiple concurrent input and output\nstreams to a task, but I\u2019ve just bound it as a single version for now\nwhile the bindings find their feet. Here\u2019s an example of how streaming\nreferences can be used:</p>\n<pre><code>let x_ref = spawn_ref (fun () ->\n output ~stream:true (fun oc ->\n for i = 0 to 5 do\n Unix.sleep 1;\n fprintf oc "%d\\n%!" i;\n done\n )\n ) in\n let y_ref = spawn_ref (fun () ->\n input (fun ic ->\n output ~stream:true (fun oc ->\n for i = 0 to 5 do\n let line = input_line ic in\n fprintf oc "LINE=%s\\n%!" line\n done\n )\n ) x_ref\n ) in\n</code></pre>\n<p>We first spawn an <code>x_ref</code> which pretends to do 5 seconds of work by\nsleeping and outputing a number. This would of course be heavy number\ncrunching in a real program. The <code>y_ref</code> then inputs this stream, and\noutputs its own result by prepending a string to each line.</p>\n<h2><a href=\"https://anil.recoil.org/#try-it-out\"></a>Try it out</h2>\n<p>If you are interested in a more real example, then read through the\n<a href=\"https://github.com/avsm/ciel/blob/master/src/ocaml/binomial.ml\">binomial\noptions</a>\ncalculator that uses streaming references to parallelise a dynamic\nprogramming problem (this would be difficult to express in MapReduce).\nOn my Mac, I can run this by:</p>\n<ul>\n<li>check out CIEL from from Derek\u2019s <a href=\"http://github.com/mrry/ciel\">Git\nrepository</a>.</li>\n<li>install all the Python libraries required (see the <code>INSTALL</code> file)\nand OCaml libraries\n(<a href=\"http://okmij.org/ftp/continuations/implementations.html\">delimcc</a>\nand <a href=\"http://martin.jambon.free.fr/yojson.html\">Yojson</a>).</li>\n<li>add <code><repo>/src/python</code> to your <code>PYTHONPATH</code></li>\n<li>in one terminal: <code>./scripts/run_master.sh</code></li>\n<li>in another terminal: <code>./scripts/run_worker.sh -n 5</code> (this allocates\n5 execution slots)</li>\n<li>build the OCaml libraries: <code>cd src/ocaml && make</code></li>\n<li>start the binomial options job:\n<code>./scripts/sw-start-job -m http://localhost:8000 ./src/package/ocaml_binopt.pack</code></li>\n<li>there will be a URL printed which shows the execution progress in\nreal-time</li>\n<li>you should see log activity on the worker(s), and a result reference\nwith the answer (<code>10.x</code>)</li>\n<li>let us know the happy news if it worked or sad news if something\nbroke</li>\n</ul>\n<h2><a href=\"https://anil.recoil.org/#discussion\"></a>Discussion</h2>\n<p>The DataCaml bindings outlined here provide an easy way to write\ndistributed, fault-tolerant and cluster-scheduled jobs in OCaml. The\ncurrent implementation of the engine is aimed at cluster computation,\nbut <a href=\"http://www.cl.cam.ac.uk/~ms705\">Malte</a> has been working on\n<a href=\"http://www.cl.cam.ac.uk/~ms705/pub/papers/2011-ciel-sfma.pdf\">condensing CIEL onto multicore\nhardware</a>.\nThus, this could be one approach to \u2018solving the OCaml multicore\nproblem\u2019 for problems that fit nicely into the dataflow paradigm.</p>\n<p>The biggest limitation for using these bindings is that delimited\ncontinuation serialisation only works in bytecode. Native code delimcc\nsupports <code>shift/reduce</code> in the same program, but serialising is\nproblematic since native code continuations contain a C stack, which may\nhave unwrapped integers. One way to work around this is by switching to\na monadic approach to dereferencing, but I find delimcc programming more\nnatural (also see <a href=\"http://www.openmirage.org/wiki/delimcc-vs-lwt\">this\ndiscussion</a>).</p>\n<p>Another important point is that tasks are lazy and purely functional\n(remind you of Haskell?). This is essential for reliable fault-tolerance\nand reproducibility, while allowing individual tasks to run fast, strict\nand mutable OCaml code. The tasks must remain referentially transparent\nand idempotent, as CIEL may choose to schedule them multiple times (in\nthe case of faults or straggler correction). Derek has been working on\n<a href=\"http://www.cl.cam.ac.uk/~dgm36/publications/2011-murray2011nondet.pdf\">integrating non-determinism into\nCIEL</a>,\nso this restriction may be relaxed soon.</p>\n<p>Finally, these ideas are not limited to OCaml at all, but also apply to\nScala, Java, and Python. We have submitted a draft paper dubbed <em>\u2018<a href=\"http://www.cl.cam.ac.uk/~ms705/pub/papers/2011-ciel-socc-draft.pdf\">A\nPolyglot Approach to Cloud\nProgramming</a>\u2019</em>\nwith more details and the ubiquitous evaluation versus Hadoop. There is\na really interesting line to explore between low-level\n<a href=\"http://en.wikipedia.org/wiki/Message_Passing_Interface\">MPI</a> coding and\nhigh-level MapReduce, and we think CIEL is a useful spot in that design\nspace.</p>\n<p>Incidentally, I was recently hosted by <a href=\"http://research.nokia.com/\">Nokia\nResearch</a> in Palo Alto by my friend\n<a href=\"http://www.linkedin.com/pub/prashanth-mundkur/6/b44/27\">Prashanth\nMundkur</a>, where\nthey work on the Python/Erlang/OCaml <a href=\"http://discoproject.org/\">Disco</a>\nMapReduce engine. I\u2019m looking forward to seeing more critical\ncomparisons and discussions of alternatives to Hadoop, from them and\nothers.</p>\n<p><em>Thanks are due to <a href=\"http://www.cl.cam.ac.uk/~dgm36/\">Derek</a>,\n<a href=\"https://twitter.com/#!/chrissmowton\">Chris</a> and\n<a href=\"http://www.cl.cam.ac.uk/~ms705\">Malte</a> for answering my incessant CIEL\nquestions while writing this post! Remember that DataCaml is a work in\nprogress and a research prototype, and feedback is most welcome.</em></p>",
9 "content_type": "html",
10 "author": {
11 "name": "Anil Madhavapeddy",
12 "email": "anil@recoil.org",
13 "uri": "https://anil.recoil.org"
14 },
15 "categories": [],
16 "rights": "(c) 1998-2025 Anil Madhavapeddy, all rights reserved",
17 "source": "https://anil.recoil.org/news.xml"
18}