H3-MOSAIC: multimodal generative AI for semantic place detection from high-frequency GPS on H3 grids in mental health geomatics

Int J Health Geogr. 2025 Nov 22;24(1):35. doi: 10.1186/s12942-025-00423-9.

ABSTRACT

BACKGROUND: Mental-health geomatics require reliable ways to convert high-frequency GPS trajectories into meaningful place types that support indicators such as homestay, location entropy, and spatial extent of daily activities. Raw coordinates are typically noisy and carry little semantic information. We introduce H3-MOSAIC(H3-based Multimodal OSM-and-Satellite AI for Classification), a multimodal generative framework that fuses OpenStreetMap (OSM) building text and satellite imagery on H3 grids to infer place semantics from high-frequency GPS.

METHODS: Raw GPS was smoothed by minute-level speed filtering, then assigned to Level 10 H3 hexagons. Cells were retained if the mean speed was ≤ 1.2 m/s and the cumulative duration was ≥ 15 min, contiguous cells were merged, and home was defined as the cell with the longest dwell from 23:45 to 06:00. We compared text-only OSM classification with image-based and fused approaches across open-source models (DeepSeek, CLIP, LLaVA, Qwen-VL) and proprietary models (GPT-4o-mini, Gemini-2.5-flash-lite). Performance was assessed by accuracy, Cohen’s kappa, precision, recall, F-measure, and confusion matrices. Day level associations between H3 semantic exposures and stress were examined by a random forest model and explainable methods.

RESULTS: Multimodal methods outperformed single-modality baselines. In the 11-class task, accuracies were: CLIP 0.179, LLaVA 0.269, Qwen-VL 0.565, GPT-4o-mini 0.779, and Gemini-2.5-flash-lite 0.790. In the 5-class consolidation, accuracies rose to 0.702 (Qwen-VL), 0.849 (GPT-4o-mini), and 0.858 (Gemini-2.5-flash-lite). Text-only OSM baselines were lower (≈ 0.60-0.68). Across 3,845 hexagons with OSM text, closed-source models agreed on 79% of labels; disagreements concentrated in mixed-use, office, and green classes. Error modes reflected area-dominant versus keyword-triggered reasoning, hybrid-parcel ambiguity, tag sparsity, and symbolic artifacts. Stabilized semantics support more robust computation of homestay, entropy, and activity space and are suitable for privacy-aware, cross-city reuse. In a day-level case study, minutes at Home related to lower stress; Green showed a U-shaped pattern.

CONCLUSIONS: H3-MOSAIC provides a scalable, auditable pipeline for semantic place detection from high-frequency GPS. Multimodal fusion markedly improves accuracy and consistency. Proprietary models are most robust on hard classes and open-source models are practical for coarse taxonomies. H3 day level exposures show stress patterns consistent with established mental health pathways, supporting face validity. The framework enables downstream exposure analyses with reduced misclassification and improved interpretability.

PMID:41275284 | DOI:10.1186/s12942-025-00423-9

By Nevin Manimala