gabriel/walkability_ai_ml_2025_06_06_evaluation2.json at main · anil.recoil.org/thicket-eeg

Thicket data repository for the EEG
thicket-eeg / gabriel / walkability_ai_ml_2025_06_06_evaluation2.json
at main 20 kB view raw
 1{
 2  "id": "https://gabrielmahler.org/walkability/ai/ml/2025/06/06/evaluation2",
 3  "title": "Walkability Chapter 5: Evaluation (Part 2: Walkability Assessment)",
 4  "link": "https://gabrielmahler.org/walkability/ai/ml/2025/06/06/evaluation2.html",
 5  "updated": "2025-06-06T10:40:11",
 6  "published": "2025-06-06T10:40:11",
 7  "summary": "Walkability Assessment",
 8  "content": "<h1>Walkability Assessment</h1>\n\n<p>To address the inability of the open-source frameworks to identify\nhighly walkable (or otherwise interesting) urban spaces, we utilize our\nwalkability assessment tool (as presented in the previous parts).\nFurthermore, we perform our walkability assessment experiments with two\ndistinct sentence transformers to build a more comprehensive overview\nand underline the dangers of over-clustering under the contrastive\nfine-tuning.</p>\n\n<p>In this section, we also highlight the accessibility and\ncomprehensiveness of our approach to defining specific preferences. In\ncontrast to the complicated routing profiles of the open-source baseline\nframeworks,\nour method relies on preferences expressed through plain natural\nlanguage sentences. Therefore, we demonstrate how our framework provides\na solution to our third research question - how can we <strong>simplify user\ninputs</strong>.</p>\n\n<h3>Experimental Encoder Models</h3>\n\n<p>Besides the design of the sentence embedding strategy, the\nselection of the specific pre-trained sentence encoder and the degree of\nfine-tuning proved equally critical. While most of the considered\nencoders were trained on similar large text corpora lacking any\nparticular thematic specializations, the responses to fine-tuning were\nvery diverse. Consequently, the selection of considered encoders\neventually narrowed to two models: “all-mpnet-base-v2” and\n“all-MiniLM-L12-v2”. Both of these encoders are part of HuggingFace’s\n“sentence-transformers” library (Reimers and Gurevych 2019).</p>\n\n<p>all-mpnet-base-v2 projects text inputs into a 768-dimensional vector\nspace, and is a fine-tuned variation of MPNet - a transformer-based\nmodel improving over BERT and RoBERTa by relying on masked language\nmodeling and permutation-based training, thus improving the model’s\nability to capture semantic dependencies (Song et al. 2020).</p>\n\n<p>The second sentence encoder, all-MiniLM-L12-v2, is based on MiniLM, an\napproach developed with the goal of compressing large transformer-based\nmodels, such as BERT, while minimizing loss in performance (Wang et al.\n2020). The approach relies on a deep self-attention distillation, where\na smaller “student” model learns by mimicking the self-attention\nbehavior of a larger “teacher” model. Similar to all-mpnet-base-v2,\nall-MiniLM-L12-v2 is fine-tuned under a contrastive objective, but\noutputs embeddings of only 384 dimensions.</p>\n\n<h3>General Walkability</h3>\n\n<p>The same settings were used for both of the encoder models. In the\nanchor-based scoring system, outputs from the models were weighted\nagainst embeddings of identical preference anchors, generated from the\nsame twelve sentences.</p>\n\n<p><strong>Table: Average general walkability score across various fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.03\n      3.94\n      3.94\n      3.63\n      3.94\n      3.80\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      0.01\n      2.45\n      2.65\n      3.42\n      2.97\n      3.23\n    \n  \n\n\n<p>The scores generated based on these three encoders exhibited several\nshared patterns. As illustrated, embeddings generated\nby the off-the-shelf encoders generally completely failed to relate to\nthe anchor embeddings. As a result, most of the representations\ngenerated with these vanilla encoders resulted in extremely negative\nscores.</p>\n\n<p>Nevertheless, both encoder models also demonstrated an ability to align\ntheir projections and adjust to the specific settings extremely quickly.\nThe averages of the inferred scores jumped significantly after only a\nsingle fine-tuning epoch. This highlighted the encoders’ ability to\nadjust to the specific format of the point description sentences and the\nefficiency of the contrastive fine-tuning approach.</p>\n\n<p>Furthermore, in terms of the mean walkability scores, both\nall-mpnet-base-v2 and all-MiniLM-L12-v2 achieved relative consistency\nafter the initial alignment during the first fine-tuning phase.</p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-general.jpeg\"></p>\n\n<p>However, the relative consistency of the scores generated by our\nencoders did not imply stale training. Instead, with rising numbers of\nfine-tuning epochs, the encoders started over-clustering under the\ncontrastive objective. Due to the multi-anchored scoring system,\nhowever, this did not result only in extremely positive or negative\nscores, but also in extremely “average” scores. This is well apparent the table, where\nthe scores of a rather positive example converge towards 5 as the\nfine-tuning proceeds. We hypothesize this is because of the\nprogressively expanding margin between the highly positive (walkability\nscores greater or equal to 7) and negative (walkability scores smaller\nor equal to 3) examples, which place examples “in the middle” into\nrelative proximity to the neutral anchor. This\nover-clustering not only results in distorted final scores but also\nsuppresses the models’ ability to derive associations between various\nsemantic features.</p>\n\n<p><strong>Table: Variance of general walkability scores over fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.04\n      10.18\n      9.77\n      7.04\n      5.11\n      5.58\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      0.01\n      7.74\n      7.76\n      7.33\n      5.30\n      5.11\n    \n  \n\n\n<p>The negative effects of prolonged fine-tuning are also reflected in the\nvariances of final score. Across all three\nmodels, the generated scores attain peak variance after one or two\nepochs of fine-tuning, but begin falling as the training continues. As\ndiscussed earlier, this is presumably because of the increasing\ndistances between the projections of the positive and negative anchors.\nHowever, as the contrastive fine-tuning shifts the projections to\nmaximize this distance, the models’ original ability to extract features\nalso starts to vanish. Therefore, high variance of the scores is, in\nthis case, desirable because it reflects the system’s attention to\nindividual features.</p>\n\n<p>Reflecting upon these observations, fine-tuning the models over two\nepochs appears as a generally reasonable approach. During such short\nfine-tuning, the encoders adjust to the task and sentence description\nformatting while maintaining a high variability of outputs. The outputs\ngenerated by these models are also generally agreeable upon manual\nreview. For instance, while\nfootpath segments in parks or pedestrian zones receive high scores,\nsegments associated with private infrastructure or service areas are\ngenerally rated very poorly.</p>\n\n<h3>Greenery-focused objective</h3>\n\n<p>In the first hypothetical preference set, the scoring pipeline was\nsituated to evaluate points with a preference towards the presence of\ngreenery and green spaces. However, as the notion of greenery already\nconstitutes an important aspect of evaluation under the general\nwalkability criterion (which is also embedded in the encoder\nfine-tuning), this specific configuration aimed to merely emphasize the\ngreenery preference. Therefore, a new set of positive anchors was\ncreated to reflect this objective, mainly consisting of common relevant\nelements, such as trees, public furniture, or parks and gardens.</p>\n\n<p><strong>Table: Percentual difference between the mean average of greenery and general walkability scores, calculated based on embeddings from the same model</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      587.21%\n      -21.62%\n      -11.12%\n      -7.46%\n      -4.30%\n      -6.65%\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      1504.94%\n      -17.18%\n      -10.98%\n      -12.76%\n      -1.98%\n      -0.92%\n    \n  \n\n\n<p><strong>Table: Variance in greenery-focused scores across fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.82\n      5.29\n      6.78\n      4.85\n      4.38\n      3.94\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      0.48\n      5.25\n      5.36\n      4.80\n      5.16\n      4.96\n    \n  \n\n\n<p>Despite being partially redundant to the general case, the reemphasis on\ngreenery was still reflected in the generated outputs. In fact, greenery-focused scores\nwere typically lower than general walkability ones, but they converged\nto the general scores as the fine-tuning went on. This, we hypothesize,\nis also a result of the over-clustering phenomenon. Furthermore,\nmirroring the findings in the previous example, the gradual\nsuppression of the features’ variability reflects in the score variance too: as the number of\nepochs increases, the variance of the scores decreases.</p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-greenery-difference.jpeg\"></p>\n\n<p>The overall scores, nonetheless, reflected most expectations. As\nillustrated here, points in parks and\nclose to natural elements were generally rated high, and low in dense\nurban areas.</p>\n\n<h3>Shopping-focused objective</h3>\n\n<p>In the next experiment, we conceived a hypothetical preference towards\nshopping-related areas (such as shopping malls and places near various\nkinds of stores) and embedded it into the scoring pipeline. Again, this\nwas done simply by rewriting the set of positive anchor sentences. In this example, we\nfurther measured the ability of the scoring mechanism and, more\nimportantly, of the generated embeddings to reflect individual elements\ndirectly stated in the anchors. Although a preference towards shopping\nareas does not necessarily require a high degree of the encoder’s\nability to create semantic associations (as the number of related\nfeatures and terms is much more limited), this objective was situated\nfurther from the general walkability objective than the greenery-focused\ncase. While we could expect shopping areas to be, on average, relatively\nwalkable, they are not correlated with walkability in a generalizable\nway.</p>\n\n<p><strong>Table: Percentual difference between the mean average of shopping-focused and general walkability scores, calculated based on embeddings from the same model</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      -98.56%\n      -24.29%\n      -20.36%\n      -10.07%\n      -4.76%\n      -6.93%\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      -94.22%\n      -19.46%\n      -10.84%\n      -10.49%\n      -0.90%\n      -2.12%\n    \n  \n\n\n<p><strong>Table: Variance in shopping-focused scores through fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.00\n      4.78\n      4.63\n      4.60\n      4.19\n      3.85\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      0.00\n      4.28\n      4.62\n      4.61\n      5.12\n      5.03\n    \n  \n\n\n<p>We hypothesize that this narrowed intersection of the shopping-focused\nobjective and the general walkability is reflected in the performed\nmeasurements, which diverge from trends observed under the\ngreenery-focused objective. In the scores analysis, an unforeseen spike in the\ndifference margin between the general walkability and store-focused\nscores appears after the fifteenth epoch of fine-tuning across both\nencoders. We hypothesize this is because some of the features that are\nunder the shopping objective, expected to be close together, are pulled\napart by the contrastive training. Similar noise, likely rooted in the\nsame conflict of representations, is observed in the scores’ variance\nmeasurements.</p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-strores-difference.jpeg\"></p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-stores.jpeg\"></p>\n\n<p>Despite that, the embeddings generated by lightly fine-tuned encoders\nstill produced relevant point-wise scores with high variance. Although\nfor completeness, the visual comparison between the shopping- and\ngeneral walkability-focused scores is included, as, in this case, the\nvisualization of the actual scores indicates the overall accuracy better.</p>\n\n<h3>Historically-focused objective</h3>\n\n<p>In the next experimental case, the scoring pipeline is repositioned to\nreward points associated with historical elements, such as old\nbuildings, monuments, or museums. This case was meant to represent an\nobjective even more distant from general walkability than the\nstore-focused one. In terms of the relatedness to the definition of\nwalkability that is used in the contrastive task, the historical\nelements are even more semantically distant than the factors defined by\nthe shopping- or greenery-focused objectives. Furthermore, the notion of\nhistoricity was expected to be more challenging to capture in the\ntextual anchors.</p>\n\n<p><strong>Table: Percentual difference between the mean average of historically-focused and general walkability scores, calculated based on embeddings from the same model</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      587.21%\n      -21.62%\n      -11.12%\n      -7.46%\n      -4.30%\n      -6.65%\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      1504.94%\n      -17.18%\n      -10.98%\n      -12.76%\n      -1.98%\n      -0.92%\n    \n  \n\n\n<p><strong>Table: Variance in historically-focused scores through fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.78\n      8.50\n      5.58\n      4.73\n      4.34\n      3.89\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      1.02\n      6.24\n      5.72\n      4.99\n      6.99\n      5.71\n    \n  \n\n\n<p>Mirroring these challenges, a similar “noise” to the one in the\nshopping-based case is present in the scores evaluation here. For\ninstance, in the case of the architecture based on all-mpnet-base-v2,\nthe convergence towards the general walkability scores is not as\nconsistent as it was in the greenery-focused case. Similarly, the\nvariance of scores generated with a model based on all-MiniLM-L12-v2\nresembles a similar behavior, as shown.</p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-historical-difference.jpeg\"></p>\n\n<p>Nonetheless, even in these challenging settings, the scores generated\nwith lightly fine-tuned encoders have seemed to satisfy our objective,\nas highlighted by the visualization\nin <a href=\"https://gabrielmahler.org/walkability/ai/ml/2025/06/06/evaluation2.html#img:difference-historical-mpnet2eps\">1.10</a>{reference-type=”ref+label”\nreference=”img:difference-historical-mpnet2eps”}.</p>\n\n<h3>Safety-focused objective</h3>\n\n<p>Finally, we utilize our scoring system in a difficult-to-define yet\nhighly practical safety-oriented objective. By relying on the richness\nof data provided by OSM, elements that typically contribute to the\nfeeling of public safety (such as street lighting, security cameras, or\npublic service-related facilities and infrastructure) are used in the\nanchor definitions. Nevertheless, due to the\nloose correlation between these particular elements and the general\nwalkability evaluation, generating scores under this objective proved to\nbe the most difficult.</p>\n\n<p><strong>Table: Percentual difference between the mean average of safety-focused and general walkability scores, calculated based on embeddings from the same model</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      -99.73%\n      -27.07%\n      -20.21%\n      16.50%\n      11.43%\n      20.77%\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      -99.60%\n      -32.54%\n      -22.02%\n      -12.53%\n      57.89%\n      56.92%\n    \n  \n\n\n<p><strong>Table: Variance in safety-focused scores through fine-tuning epochs</strong></p>\n\n\n\n  \n    \n      Encoder Model\n      Vanilla\n      1 ep.\n      2 eps.\n      5 eps.\n      10 eps.\n      15 eps.\n    \n  \n  \n    \n      <em>all-mpnet-base-v2</em>\n      0.00\n      4.46\n      4.87\n      10.16\n      6.93\n      9.03\n    \n    \n      <em>all-MiniLM-L12-v2</em>\n      0.00\n      3.50\n      4.29\n      4.82\n      16.20\n      15.53\n    \n  \n\n\n<p>Unlike any of the previous preference-specific cases, the safety-focused\nobjective caused the mean average scores to rise higher than the mean of\nthe general walkability scores, and never converged. Furthermore, the variance\nof the safety-focused scores was slightly inconsistent, variously rising\nand falling.</p>\n\n<p><img alt=\"alt text\" src=\"https://gabrielmahler.org/assets/images/thesis/new%20images/mpnet/mpnet-safety-difference.jpeg\"></p>\n\n<p>The generated safety-focused map reflected these observations. As demonstrated, scores of certain areas\n(such as parks) generally seemed to suffer under these specific\npreferences, whereas other areas did unexpectedly well. We conclude this\nis due to both the high diversity and the sparsity of geospatial records\nthat could be used to reliably measure safety levels across entire urban\nareas. Furthermore, we argue this was also caused by the obvious\nsemantic divergence of elements associated with the fine-tuning\nobjective (general walkability) and the scoring objective (safety).</p>\n\n<h3>References</h3>\n\n<ul>\n  <li>Reimers, Nils, &amp; Gurevych, Iryna. (2019). <em>Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks</em>. In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</em>. Association for Computational Linguistics.\n<a href=\"https://arxiv.org/abs/1908.10084\">https://arxiv.org/abs/1908.10084</a></li>\n  <li>Song, Kaitao, Tan, Xu, Qin, Tao, Lu, Jianfeng, &amp; Liu, Tie-Yan. (2020). <em>MPNet: Masked and Permuted Pre-Training for Language Understanding</em>. <em>Advances in Neural Information Processing Systems</em>, 33, 16857–16867.</li>\n  <li>Wang, Wenhui, Wei, Furu, Dong, Li, Bao, Hangbo, Yang, Nan, &amp; Zhou, Ming. (2020). <em>MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers</em>. <em>Advances in Neural Information Processing Systems</em>, 33, 5776–5788.</li>\n</ul>",
 9  "content_type": "html",
10  "author": {
11    "name": "",
12    "email": null,
13    "uri": null
14  },
15  "categories": [
16    "Walkability",
17    "AI/ML"
18  ],
19  "source": "https://gabrielmahler.org/feed.xml"
20}