gabriel/walkability_compsci_2025_06_04_designimplementation.json at main · anil.recoil.org/thicket-eeg

Thicket data repository for the EEG
thicket-eeg / gabriel / walkability_compsci_2025_06_04_designimplementation.json
at main 23 kB view raw
 1{
 2  "id": "https://gabrielmahler.org/walkability/compsci/2025/06/04/designimplementation",
 3  "title": "Walkability Chapter 4: Design and Implementation",
 4  "link": "https://gabrielmahler.org/walkability/compsci/2025/06/04/designimplementation.html",
 5  "updated": "2025-06-04T10:40:11",
 6  "published": "2025-06-04T10:40:11",
 7  "summary": "Design and implementation",
 8  "content": "<h3>Design and implementation</h3>\n\n<p>To address the issue of generating walkability-friendly and\nuser-customizable pedestrian routes, our approach is divided into four\nparts: (1) data aggregation, conflation, and pre-processing, (2) the\ndevelopment of a specialized fine-tuning pipeline for sentence\nembedders, leveraging contrastive learning to learn representations of\ngenerally walkable (and unwalkable) place descriptions, (3) inference\nof point-wise scores based on “general walkability” and\npreference-specific criteria from generated comprehensive embedding sets, and (4)\nintegration of the point-wise scores in an A*-based path-finding\nalgorithm.</p>\n\n<h2>Data Preparation</h2>\n\n<p>As already as discussed earlier, we concluded that the\nFoursquare and Overture Maps suffered from various insufficiencies. In\nthe context of our work, both exhibited low temporal accuracy and\nfocused on a relatively narrow selection of geospatial features with\nnormalized but limited descriptions. Furthermore (in contrast to OSM),\nthe feasibility of efficiently aggregating additional information from\nexternal sources in both of these datasets was minimal, as they only\never referenced private websites or social media profiles. Subsequently,\nOSM was eventually chosen to constitute the skeleton of our knowledge\nbase.</p>\n\n<h3>OSM Pre-Processing</h3>\n\n\n\n  \n    \n      <strong>Feature Type</strong>\n      Quantity (in thousands)\n      with Wikidata Reference\n    \n  \n  \n    \n      Ways\n      19.1\n      362\n    \n    \n      Segmented Ways\n      38.6\n      362\n    \n    \n      Nodes\n      34.6\n      1086\n    \n    \n      Buildings\n      35.9\n      133\n    \n    \n      Outdoor areas\n      2.3\n      35\n    \n  \n\n\n\n Summary of extracted OSM feature counts for Cambridge, UK. \n\n<p>To construct a robust knowledge base from OSM and to minimize the risk\nof losing potentially useful information or data points, we chose to\nmanually implement our own filters and process raw OSM data (instead of\nrelying on existing third-party post-processed datasets or APIs).</p>\n\n<p>The segment network used in our work was created from segmented OSM\n“ways”, where each segment is defined at both ends either by a junction\nwith another segment or an isolated end. In the particular case of\nCambridge, OSM holds all kinds of transportation segments, from highways\nto unofficial “desire paths”. Next, all nodes, as well as recorded\nbuildings, were extracted and stored. However, for both of these feature\ntypes, only the entries with some informative descriptions were kept.\nLastly, relevant outdoor areas were extracted, such as playgrounds,\nwater bodies, or parks. Where appropriate, these areas were conflated,\nsince raw data from OSM sometimes suffers from redundant or segmented\narea entries. Furthermore, for all OSM buildings, ways, and nodes, a\nwritten English description from Wikidata was scraped and appended to\nthe database whenever available. In the context of our model, and\nsimilarly to some user-uploaded text descriptions of nodes in OSM,\nWikidata’s descriptions suffer from non-regularity. The database\npresents descriptions of varying lengths and informative values.\nTherefore, the scraped descriptions were cleaned of, for example,\nunwanted geographical names (since those were expected to provide little\nbenefit later on), and shortened where appropriate. The resulting\nquantities for each of these feature types in the table above.</p>\n\n<h3>Tree Dataset</h3>\n\n<p>Since, particularly for the geographical regions we were interested in\n(the UK), greenery can play a vital role for a data-driven inference of\nwalkability, having accurate estimates about the locations and\nquantities of trees is highly valuable. Although trees (and other\ngreenery) are a common node type in OSM data, their representation\nunderestimates reality. Within the boundaries of Cambridge, OSM tracks\nfewer than 3.5 thousand trees, substantially underestimating the actual\ncounts. In contrast, the specialized tree datasets (as introduced\nearlier) offer a more comprehensive\nand reliable source of tree-related data. Therefore, the VOM data was\nleveraged. Specifically, this project relies on a processed version of\nthe VOM raster, after a tree segmentation completed with the lidR\npackage (Roussel, Goodbody, and Tompalski 2025). This version of the\ndataset was kindly provided by Andrés Camilo Zúñiga-González (an AI4ER\nPh.D. Student at the University of Cambridge) (Zúñiga González 2025),\nand served as a sole source of tree records for this project. Entries of\ntrees from OSM were, henceforth, ignored. Within the boundaries of\nCambridge, the segmented VOM supplied over 102 thousand trees.</p>\n\n<h3>Open Greenspace Dataset</h3>\n\n<p>The final “supplementary” dataset used was the “Greenspace Dataset”. Nevertheless, as it\nnarrowly specializes in public green spaces (such as public parks or\nplaygrounds), the Greenspace Dataset was used to merely enhance the\nspatial accuracy and fill in any gaps in the OSM data. Furthermore, for\nCambridge, it only included 398 entries. Therefore, the Greenspace\nDataset and OSM areas were iteratively matched and merged on\ndescriptions and spatial parameters, and stored in one database.</p>\n\n<h3>Point-Geospatial Context Dataset</h3>\n\n<p>This aggregated knowledge base was used to create final\npoint-to-geospatial-context mappings. First, a set of points was sampled\nfrom each of the segments in 10-meter intervals. For each of these\npoints, all entities within a pre-defined buffer zone were recorded.\nThese buffer zones were set to a 40-meter radius for buildings, and\n30-meter radius for all other feature types. Furthermore, each of these\nsegment points was also mapped to any outdoor areas it intersected.</p>\n\n<p>Given a specific point on a segment, these mappings were then used to\nretrieve text descriptions of the features from the parsed datasets. For\neach data type (such as nodes or areas), a priority mechanism selected\nthe most desirable attributes (such as building or business type, or\nWikidata description). The entity descriptions were then compiled into\nsentence descriptions. While the exact structure of the sentence\ndescription was also subject to much experimentation (partly because\nsome sentence encoders are better suited to specific structures), the\neventual structure of the descriptions introduced the different feature\ntypes in order, transitioned between these types with consistent\nconnector phrases, and represented missing entities of a given feature\ntype with “<code>nothing</code>”. Specifically, the default descriptions followed\nthis format:</p>\n\n<div><div><pre><code>[segment infrastructure description];\n        IN AREAS: [list of areas];\n        NEARBY: [list of nearby nodes and buildings].\n</code></pre></div></div>\n\n<h2>Encoder Fine-Tuning</h2>\n\n<p>To produce representations from the assembled dataset of\npoint-to-description mappings, we used sentence encoders (that are more\nclosely discussed in. However, while the\nability to make semantic associations was the key reason for picking up\npre-trained sentence encoders, these models had to first be lightly\nre-focused towards representing our specific descriptions. This was\nachieved through a contrastive fine-tuning process.</p>\n\n<h3>Finetuning Dataset</h3>\n\n<p>To create a dataset for the encoder fine-tuning, a set of compiled place\ndescriptions was encoded with an off-the-shelf encoder (specifically,\nwith the “all-MiniLM-L6-v2” from the “sentence-transformers”\nlibrary (Reimers and Gurevych 2019)). Afterwards, 12,500 unique data\npoints were selected based on their respective embeddings with\nfurthest-point sampling to maximize the degree of diversity within the\ndataset.</p>\n\n<p>These points were then scored and labeled on the basis of walkability\nwith the Mistral 7b language model (Jiang et al. 2023). The language\nmodel was prompted to assign a numerical score on a scale of zero to\nten, where zero stood for the least walkable descriptions (such as\ndescriptions of points on highways) and ten for the most walkable\ndescriptions (such as descriptions from park footpaths). The prompt used\nfor this purpose related to the concepts of walkability summarized earlier, particularly the\nwork of <em>Alfonzo</em> (Alfonzo 2005).</p>\n\n<h3>Embedding Architecture</h3>\n\n<p>There’s a plethora of pre-trained, publicly available sentence encoders,\nmany of which advertise a similar plethora of domain versatility in\ninformation retrieval, sentence similarity, or clustering tasks. Hence,\nthe selection of the most suitable encoder models was a highly iterative\nprocess. Moreover, the strategy of employing\nthese encoder models was also initially unclear, and two main options\nwere considered.</p>\n\n<p>The first option was encompassing all of the desired information for a\ngiven point into a singular sentence, and then using a single encoder to\ngenerate the point embeddings. This approach offered much simplicity,\nbut imposed the risks of relying too heavily on the encoder model’s\nability to extract and represent all of the important features.\nMoreover, this approach was less flexible for potential future\nimplementations, where, for instance, not all features should be used to\ngenerate embeddings.</p>\n\n<p>The second option was to generate each feature or section of the\ndescription individually, potentially with different encoder models,\nlater composing these embeddings into a singular vector. A similar\napproach is developed in, for instance, the aforementioned work by\n<em>Tempelmeier et. al.</em> (Tempelmeier, Gottschalk, and Demidova 2021).\nTherefore, several implementations of this approach were tested, none\nwith satisfying results. In some of the attempts, a set of embeddings of\nindividual features of a given point was composed by simply finding the\naverage of those feature embeddings. Alternatively, the composed vector\nwas generated via a fusion component, which was also trained during the\nfine-tuning phase.</p>\n\n<p>Nonetheless, none of the attempts to compose embeddings of individual\nfeatures into a singular vector proved useful. The models were prone to\nover-clustering (pulling samples of the same samples too close together)\nduring the contrastive fine-tuning phase, and generally failed to retain\nthe ability of the original off-the-shelf models to later make relevant\nsemantic associations.</p>\n\n<p>Hence, this work relies on a single encoder architecture, processing\ndescriptions composed of singular sentences. Furthermore, the\nfine-tuning of the sentence encoders was done via LoRA adapters. The adapters were injected into\neach of the pre-trained models, and while the models’ weights remained\nfrozen during the fine-tuning, the adapters’ weights adjusted to the\ncontrastive objective.</p>\n\n<h3>Contrastive Fine-Tuning</h3>\n\n<p>With the LLM-labeled dataset, sentence encoders were fine-tuned using\nthe Triplet Loss-based strategy. This strategy was implemented by\nsimply splitting the training examples into a positive and a negative\nbin. The threshold for the positive bin was a score assigned by the LLM\nhigher than or equal to seven, and in the negative bin, the scores of\nthe data points were lower than or equal to three. In order to create a\nclear contrast between the “walkable” and the “unwalkable”, data points\nthat fell into neither of the two bins were discarded. After this\nindexing, the positive bin contained 5390 examples, and the negative bin\n1060 examples. This disparity between the sizes of the two bins was most\nlikely caused by the fact that points with low walkability scores were\nfrequently associated with fewer features (e.g., high-speed roads in\nurban outskirts) whereas highly walkable places were more commonly\nsurrounded by heterogeneous elements (e.g., paths surrounded by\namenities or places). Hence, there were fewer unique points with poor\nwalkability than unique points with high walkability.</p>\n\n<p>During the training, and due to the contrasting cardinalities of the two\nbins, the dataloader sampled the positive and negative examples randomly\nfor each iterated anchor. Furthermore, every time an example data-point\nwas used, its list of associated areas and of nearby nodes and buildings\nwas first randomly shuffled to embed an extent of permutation invariance\ninto the encoder.</p>\n\n<p>Extended with the LoRA adapters, the models adjusted to the fine-tuning\nobjective after only a few epochs and only required minimal training\ndurations. Although no model was fine-tuned for more than fifteen\nepochs, generally only models trained for fewer than five epochs proved\nuseful. Unsurprisingly,\ndue to the contrastive objective and the crudeness of the data bins, the\nprevention of over-clustering was essential. While in downstream tasks,\nthoroughly fine-tuned encoders successfully managed to classify examples\nas walkable or non-walkable, the differences in representations were\nsignificant, and neglected other features present in the examples.</p>\n\n<h2>Urban Embeddings and Scoring</h2>\n\n<p>Leveraging the ability of sentence encoders to independently project\nindividual examples into the embedding space, we developed an\nanchor-based method for the generation of absolute walkability scores.\nFurthermore, because of the use of anchors and the encoder’s ability to\nhighlight semantic associations, we were able to further readjust the\nscoring pipeline and generate not only general walkability scores but\nalso scores reflective of more specific pedestrian preferences.</p>\n\n<h3>Walkability Scoring</h3>\n\n<p>Although simple distance metrics, such as cosine similarity, are very\nfrequently used for tasks such as embedding-based retrieval, their\noutputs reflect relative relationships only within the considered set of\nexamples. For instance, if plain cosine similarity was used to infer\nwalkability indices in a specific area, the obtained “scores” would\nimply walkability only relatively to the other points in the sample, and\nnot to any general expectations regarding walkability.</p>\n\n<p>Therefore, we used an anchor-based linear scaling approach to establish\nthese expectations. The approach considers three anchor vectors. A\ncompletely negative anchor (representing highly unwalkable data points),\na neutral anchor (representing data points of average walkability), and\na positive anchor (representing data points with the highest possible\nwalkability indices). These anchors were used to establish a set of\nthresholds, i.e., where specific ranges of walkability indices begin in\nthe embedding space and where they end. Each respective threshold was\ndefined as the cosine distance from the positive example. More\nspecifically, since in this work we used three thresholds, the negative\nanchor defined the distance-from-the-positive-anchor threshold for all\nwalkability scores equal to zero, and the neutral anchor for scores\nequal to five. Since distances in the embedding space may be\nproportionately different than the actual walkability scores, the\nneutral example was added with the intention of adjusting for this\ninequality and improving the scoring system’s outputs. Then, for an\nembedding of a given example, the embedding was situated into the\nthreshold scale based on its similarity to the positive anchor, and its\nabsolute score was calculated through linear scaling and the two\nthresholds as points of reference.</p>\n\n<p>To obtain each of the anchors, a set of manually selected example\nsentences was constructed. Each sentence was meant to provide a\nspecific, strong example of the type of descriptions the given anchor\nrepresents. Each sentence was then embedded with the fine-tuned encoder,\nand the entire set was averaged to produce the final vectorized anchor.\nThe curation of the sentences used in the anchors was, nevertheless, not\nguided by any exact notions, and after a number of experimental\niterations, all three sets consisted of twelve exemplary sentences,\nfollowing the sentence structure.</p>\n\n<h3>Embedding Sets</h3>\n\n<p>A significant advantage of using a similarity-based scoring system lies\nin its computational efficiency, once the point-wise embeddings are\ngenerated. After obtaining a fine-tuned model, the preferences (such as\nthe various reference points) are reflected only in the anchors, and not\nin the representations of the geospatial points. Therefore, to generate\nscores, the system only needs to embed the few walkability anchors and\nperform the linear-scaling scoring. Since cosine similarity is\nparticularly easy and computationally inexpensive, this process is very\nquick and allows for the geospatial embeddings of the entire area of\ninterest to be pre-computed. Therefore, a dataset of mappings from\npoints (defined with geographical coordinates) to embedded descriptions\ncan be stored and used later in various downstream tasks.</p>\n\n<h3>Custom Preference Scoring</h3>\n\n<p>Despite the specialized fine-tuning, the embeddings created from\ndescriptions of geospatial points can be used for more than strictly\ngeneral walkability-focused tasks, such as preferences towards\nparticular geospatial areas or elements. In fact, by adjusting the\nanchors used in our linear scoring method, more specific pedestrian\npreferences can be used to generate the walkability scores. If the\nfine-tuning performed is sufficiently light, these preferences are then\nreflected in the embeddings generated by the encoder. Subsequently, the\nscoring pipeline rewards data points closer to those preference-adjusted\nembeddings and generates scores that lean towards the initial\npreferences. Specific implementations of this feature are discussed in\nthe <em>Evaluation</em> chapter of this series.</p>\n\n<h2>Path-Finding</h2>\n\n<p>With access to point-wise walkability indices generated by our scoring\npipeline, capable of producing evaluations of unrestricted spatial\ngranularity, we assembled a new routing algorithm. Unlike existing\napproaches, our algorithm did not have to rely on costs calculated with\nmanually fine-tuned static profiles. Instead, it was supported by scores\ncalculated based on embeddings generated by the custom sentence\nencoders, and thus reflected the variety of our aggregated geospatial\ndata. We used our OSM segment database to construct an infrastructure\nnetwork. Then, we combined aggregates of the walkability or specific\npreference-based scores with the segment lengths to calculate total\ncosts for each of the segments in the network. To generate paths in this\nnetwork, we used an A*-based searching algorithm. The implementation of\nour A* was relatively straightforward. It relied on a unidirectional\nsearch with no particular tweaks or optimizations (such as contraction\nhierarchies). This was because, in the scope of this work, pedestrian\nrouting in urban areas was our only focus. Hence, similar adjustments\nand optimizations, often implemented by existing path-finding\nframeworks, were deemed unnecessary.</p>\n\n<h3>Cost Estimation</h3>\n\n<p>Establishing an effective approach to calculating the overall\ncost-so-far $g(n)$ for the A* algorithm required more nuance. This was\nprimarily because of the point-based approach, where highly desirable\n(or undesirable) features often reflected over only a few points.\nMoreover, depending on the anchor configuration, considerable\ndifferences in points were reflected only by marginal differences in the\nscores. Therefore, an effective prevention of the “average” points\noutweighing the critically important points was necessary. Similarly,\nfinding a working balance between the distance (which still had to be\nreflected in the scores calculation) was crucial for the generation of\ndesirable routes.</p>\n\n\\[segment\\ cost = \\frac{n}{\\sum_{i=1}^{n} \\frac{1}{inv.\\ score_i + \\delta}} * segment\\ length\\]\n\n<p>Considering these factors, a harmonic mean-based approach was eventually\nadopted. To calculate a score for a specific segment,the above formula was used, with the $\\delta$\nconstant equal to $10^{-6}$ and scores proportionately inverted so that\nlower scores were “better” and resulted in lower costs.</p>\n\n<h3>Heuristic Function</h3>\n\n<p>Similarly to related path-finding frameworks and implementations, the\nheuristic function used in this work remained simple. In fact, our A*\nsimply used the total Euclidean distance between the iterated and the\ntarget nodes, scaled by the globally lowest calculated cost. By scaling\nthe distance with the lowest cost, the heuristic remained a guaranteed\nunderestimate of the true path cost and was, therefore, admissible. In\nthis way, A* received an informed estimate with a minimal computational\noverhead and without the risk of sub-optimality.</p>\n\n<h3>References</h3>\n\n<ul>\n  <li>Alfonzo, M. A. (2005). <em>To Walk or Not to Walk? The Hierarchy of Walking Needs</em>. <em>Environment and Behavior</em>, 37(6), 808–836.</li>\n  <li>Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., et al. (2023). <em>Mistral 7B</em>.\n<a href=\"https://arxiv.org/abs/2310.06825\">https://arxiv.org/abs/2310.06825</a></li>\n  <li>Reimers, N., &amp; Gurevych, I. (2019). <em>Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks</em>. In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</em>. Association for Computational Linguistics.\n<a href=\"https://arxiv.org/abs/1908.10084\">https://arxiv.org/abs/1908.10084</a></li>\n  <li>Roussel, J.-R., Goodbody, T. R. H., &amp; Tompalski, P. (2025). <em>The lidR Package</em>.\n<a href=\"https://r-lidar.github.io/lidRbook/\">https://r-lidar.github.io/lidRbook/</a></li>\n  <li>Tempelmeier, N., Gottschalk, S., &amp; Demidova, E. (2021). <em>GeoVectors: A Linked Open Corpus of OpenStreetMap Embeddings on World Scale</em>. In <em>Proceedings of the 30th ACM International Conference on Information &amp; Knowledge Management</em>, 4604–4612.</li>\n  <li>Zúñiga González, A. C. (2025). <em>Post-Processed LiDAR Point-Cloud Dataset</em>. Unpublished dataset, provided by the author.</li>\n</ul>",
 9  "content_type": "html",
10  "author": {
11    "name": "",
12    "email": null,
13    "uri": null
14  },
15  "categories": [
16    "walkability",
17    "compsci"
18  ],
19  "source": "https://gabrielmahler.org/feed.xml"
20}