Thicket data repository for the EEG
at main 4.0 kB view raw
1{ 2 "id": "https://www.tunbury.org/2025/03/30/box-diff", 3 "title": "Box Diff Tool", 4 "link": "https://www.tunbury.org/2025/03/30/box-diff/", 5 "updated": "2025-03-30T00:00:00", 6 "published": "2025-03-30T00:00:00", 7 "summary": "Box has an unlimited storage model but has an upload limit of 1TB per month. I have been uploading various data silos but would now like to verify that the data is all present. Box has an extensive API, but I only need the list items in folder call.", 8 "content": "<p>Box has an unlimited storage model but has an upload limit of 1TB per month. I have been uploading various data silos but would now like to verify that the data is all present. Box has an extensive <a href=\"https://developer.box.com/reference/\">API</a>, but I only need the <a href=\"https://developer.box.com/reference/get-folders-id-items/\">list items in folder</a> call.</p>\n\n<p>The list-items call assumes that you have a folder ID which you would like to query. The root of the tree is always ID 0. To check for the presence of file <code>foo</code> in a folder tree <code>a/b/c/foo</code>, we need to call the API with folder ID 0. This returns a list of entries in that folder. e.g.</p>\n\n<div><div><pre><code><span>{</span><span>\n </span><span>\"entries\"</span><span>:</span><span> </span><span>[</span><span>\n </span><span>{</span><span>\n </span><span>\"id\"</span><span>:</span><span> </span><span>\"12345\"</span><span>,</span><span>\n </span><span>\"type\"</span><span>:</span><span> </span><span>\"folder\"</span><span>,</span><span>\n </span><span>\"name\"</span><span>:</span><span> </span><span>\"a\"</span><span>\n </span><span>}</span><span>\n </span><span>]</span><span>\n</span><span>}</span><span>\n</span></code></pre></div></div>\n\n<p>The API must now be called again with the new ID number to get the contents of folder <code>a</code>. This is repeated until we finally have the entries for folder <code>c</code> which would contain the file itself. I have used a <code>Hashtbl</code> to cache the results of each call.</p>\n\n<div><div><pre><code><span>{</span><span>\n </span><span>\"entries\"</span><span>:</span><span> </span><span>[</span><span>\n </span><span>{</span><span>\n </span><span>\"id\"</span><span>:</span><span> </span><span>\"78923434\"</span><span>,</span><span>\n </span><span>\"type\"</span><span>:</span><span> </span><span>\"file\"</span><span>,</span><span>\n </span><span>\"name\"</span><span>:</span><span> </span><span>\"foo\"</span><span>\n </span><span>}</span><span>\n </span><span>]</span><span>\n</span><span>}</span><span>\n</span></code></pre></div></div>\n\n<p>Each call defaults to returning at most 100 entries. This can be increased to a maximum of 1000 by passing <code>?limit=1000</code> to the GET request. For more results, Box offers two pagination systems: <code>offset</code> and <code>marker</code>. Offset allows you to pass a starting item number along with the call, but this is limited to 10,000 entries.</p>\n\n<blockquote>\n <p>Queries with offset parameter value exceeding 10000 will be rejected with a 400 response.</p>\n</blockquote>\n\n<p>To deal with folders of any size, we should use the marker system. For this, we pass <code>?usemarker=true</code> to the first GET request, which causes the API to return <code>next_marker</code> and <code>prev_marker</code> as required as additional JSON properties. Subsequent calls would use <code>?usemarker=true&amp;marker=XXX</code>. The end is detected by the absence of the <code>next_marker</code> when no more entries are available.</p>\n\n<p>The project can be found on GitHub in <a href=\"https://github.com/mtelvers/ocaml-box-diff\">mtelvers/ocaml-box-diff</a>.</p>", 9 "content_type": "html", 10 "author": { 11 "name": "Mark Elvers", 12 "email": "mark.elvers@tunbury.org", 13 "uri": null 14 }, 15 "categories": [ 16 "OCaml,Box", 17 "tunbury.org" 18 ], 19 "source": "https://www.tunbury.org/atom.xml" 20}