Field Notes / №07 — Y.F.M. tracking: CAIGLRBRSARNBCN (you are here) --:--:-- BCN   ·   41.38 °N
Portfolio · 2026 · Vol. i

Youssef,
being observed.
& observing back.

يوسف فهمي محمد · باحث في الذكاء الاصطناعي الاجتماعي

Dr. Youssef Fahmy Mohamed spent his PhD teaching robots to respond better to the person in front of them — using thermal imaging and social-context fusion in consented lab studies on effort and frustration signals. Along the way he built ROS4HRI, an open-source framework now running inside every PAL Robotics commercial robot; founded Pokamind, which turned that research into an opt-in communication-skills training product (€43K in Vinnova/Almi grants and €150K in enterprise clients); and keeps a quiet habit of publishing papers people actually cite — six of them, 92 citations and counting.

  • §1 PhD, Social AI · KTH Royal Institute of Technology, Stockholm (WASP-funded) · 2021 – 2025.
  • §2 MSc Robotics, University of Bristol · BSc Mechanical Engineering, University of Surrey (Fiat-sponsored thesis).
  • §3 Founder & CEO: Pokamind AB · Founder: Vizioneer, HighlightsHub · Creator: ROS4HRI · Consultant: many clients, most under NDA — the bulk of the week-to-week work.
  • §4 Publications in HRI, RO-MAN, IROS, IJSR · 92 citations · h-index 5.
  • §5 Native Arabic. Bilingual English. Swedish at B1. Fluent in PyTorch.
  • §6 Currently: Barcelona, via Stockholm, via Bristol, via Guildford, via Cairo.
youssefezzat12@gmail.com
+46 790 489 221
linkedin / youssefmohamed-phd
to the abstract
Section 01 · abstract

A brief, honest account
of what Youssef actually does.

A PhD in how robots can respond to the person in front of them. A Master's in how machines can sense the dynamics of a group. Three companies in between. The thread is the same thread it always was: how do we build AI that actually helps the humans it interacts with?

Youssef's doctoral thesis — “Adaptive systems for real-time affective state detection in human-robot interaction using thermal imaging, multimodal fusion, and contextual understanding” — was supported by the Wallenberg AI, Autonomous Systems and Software Program at KTH's Division of Robotics, Perception and Learning. Along the way he published the first automatic frustration detection paper using thermal imaging at ACM/IEEE HRI, then the first context-aware affective fusion paper at RO-MAN, and built the tooling underneath the rest of the field.

That tooling was ROS4HRI — an open-source framework standardising how social robots handle faces, bodies, voices, and the slippery problem of knowing which person is which in a crowded room. PAL Robotics adopted it as the standard HRI toolkit for their entire commercial fleet — ARI, TIAGo, and the receptionist prototype Youssef helped architect as a Visiting Researcher in Barcelona.

Today he channels the same obsession into companies. Pokamind AB is the PhD, productised — €43K in Vinnova/Almi grants, €150K in enterprise clients, and a cross-functional team of 8+ shipping a GCP-backed SaaS that gives each learner structured, opt-in practice on their own communication style — then returns a private, explainable report the learner owns and controls. Vizioneer turns existing cameras in dark stores and cloud kitchens into an operational command centre. HighlightsHub gives every Sunday-league football match its own highlight reel.

Before any of that, a Master's at Bristol (building a system that could tell, from across a room, whether two people were actually enjoying each other's company) and a Fiat-sponsored BSc at Surrey (teaching a light commercial vehicle to steer itself, politely, around pedestrians). The research hasn't really changed since. Only the table size.

0
citations
0
papers
0K
grants + pilots
0
cities lived
lines of python
Section 02 · specimens

Five public builds.
Plenty more behind NDAs.

Each public entry is a working system — not a slide, not a mockup. Shipped, deployed, used by humans and robots. Where possible, click through and try them. The consultancy work, which is most of what he actually ships week-to-week, lives behind signed NDAs — summarised in the final card.

Specimen 01
computer vision · GCP

Vizioneer

AI-powered camera analytics for dark stores & cloud kitchens.

visit site

Vizioneer turns existing camera infrastructure into a real-time operational command centre for rapid commerce. Vision Language Models and classical machine vision work in concert to track orders, optimise picking routes, and detect bottlenecks — from cameras already mounted on the ceiling. No new hardware. Setup measured in minutes. Cameras are used for order and route tracking; the system does not identify, profile, or score staff.

technology stack

Vision Language Models Machine Vision RTSP streaming WebSockets Real-time processing Google Cloud Platform Object detection Tracking

key features

Smart picking routes
Real-time inventory
Order verification
Bottleneck detection
40%
faster picks
99.5%
accuracy
2min
setup
Vizioneer dashboard
analytics dashboard
Vizioneer dark store solutions
dark store optimisation
Vizioneer platform
platform architecture
Specimen 02
LLMs · communication coaching · GCP

Pokamind

Adaptive, opt-in practice for leadership & communication skills.

visit site

Pokamind is a skills-training platform where learners opt in to role-play sessions with an AI counterpart. During each session the system analyses the learner's own communication style — word choices, pacing, clarity, and turn-taking — then returns a structured, explainable summary at the end. Analysis runs per-session; no continuous monitoring, no profiles of third parties, no biometric categorisation. Reports belong to the learner and can be deleted on request. Built GDPR- and EU AI-Act-aware by design, with SHAP/LIME explainability on top of enterprise-grade GCP + Kubernetes (99.9% uptime).

technology stack

Large Language Models Explainable AI (SHAP / LIME) RAG architecture LoRA fine-tuning Self-review of delivery cues (opt-in) Voice pacing / clarity NLP & keyword extraction Kubernetes OAuth 2.0 GDPR-aware design

AI components

Conversational engine
Explainable feedback
Adaptive personalisation
Communication analysis
4x
ROI
68%
faster dev
92%
satisfaction
99.9%
uptime
Pokamind platform
AI-native learning platform
Pokamind features
enterprise security & scale
Specimen 03
computer vision · sports analytics

HighlightsHub

A TV crew for every Sunday league.

visit site

HighlightsHub watches amateur football matches and decides, on its own, when something interesting happened — then stitches together a highlight reel. The system handles goal detection, player tracking, and action recognition through fine-tuned computer vision. Built because the best goal of the weekend shouldn't vanish with the final whistle.

technology stack

Fine-tuned vision models Object tracking Action recognition Event detection Video analytics

core features

Automatic goal detection
Player identification
Highlight generation
Match statistics
auto
detection
live
processing
cloud
architecture
HighlightsHub interface
platform interface
Specimen 04
NLP · social robotics · HRI

Furhat · Arabic NLP

A social robot that speaks Saudi Arabic, deployed in NEOM.

Furhat Robotics

A full Arabic NLP stack for Furhat Robotics' social humanoid — tuned for dialectical Arabic and Middle-Eastern cultural context, then deployed at the NEOM Exhibit Center in Saudi Arabia. The robot held conversations with visitors, conducted surveys, and navigated the cultural beats of a Saudi greeting without flinching.

technology stack

Kotlin Java NLP Dialectical Arabic Speech recognition Speech synthesis ROS Cultural adaptation Social robotics

achievements

Arabic language processing
Cultural context adaptation
Natural conversation
NEOM deployment
NEOM
exhibit centre
ar-SA
NLP engine
HRI
social robotics

NEOM is Saudi Arabia's $500 billion futuristic smart-city project under Vision 2030. Furhat Robotics is a Swedish startup building social robots with human-like facial expressions — used in research, healthcare, education, and recruitment worldwide.

Furhat social robot
Furhat — human-like expressions
Specimen 05
open-source · ROS · HRI infrastructure

ROS4HRI

The framework now running inside every PAL Robotics commercial robot.

view on github

Social robots need to answer a deceptively hard question — who is in the room, and are they the same person they were a second ago? — while merging faces, bodies, voices, and identities from different ROS nodes that never agreed on a common schema. ROS4HRI is the answer: a standard, ROS-native toolkit for social signal processing, co-led with Séverin Lemaignan and presented as the tutorial workshop at IROS 2022 in Kyoto. Adopted by PAL Robotics as the default HRI stack across ARI, TIAGo, and the entire commercial fleet.

stack

ROS / ROS2 Python · C++ OpenFace OpenPose Multi-party tracking Social signal processing Sensor fusion Identity reconciliation

what it solves

Multi-party tracking
Face / body / voice fusion
Stable person identifiers
ROS-native integration
PAL
adopted by
IROS
'22 Kyoto workshop
OSS
MIT licensed
ARI
+ TIAGo fleet

The paper — “ROS for Human-Robot Interaction” (IROS 2021, Mohamed & Lemaignan) — is the reason there's a common language for social signals in ROS today. It is, quietly, the piece of work Youssef is most proud of: a thing researchers cite without knowing whose thesis it came out of.

Specimen 06
consultancy · advisory · confidential

         ,       ,             & many more

The bulk of the week-to-week work — all under NDA.

under NDA

The five entries above are the public-facing tip of the iceberg. The larger share of what Youssef ships week-to-week is consultancy work across MENA, the EU, and the Gulf — architecting systems for AI startups, guiding enterprises on responsible LLM and vision-model rollouts, and occasionally building the first prototype himself. Most of these engagements are under signed NDA, which is why the interesting parts of this paragraph look like                  and            , and why the client list isn't enumerable here. If one of these descriptions is describing you, feel free to reach out — the rest of the story is available over coffee.

categories he can name

LLM architecture review RAG rollouts CV pipeline design Responsible-AI strategy MLOps on GCP Go-to-market for AI products AI literacy · Career180 (Egypt) E3.Ventures portfolio (KSA) Saudi university workshops
MENA
+ EU + Gulf
NDA
signed × n
can't show
ask over coffee
Section 03 · experiment

the mirror test.
try the tech yourself.

A working demo of the open-source face-landmarking tech underneath ROS4HRI — MediaPipe, compiled to WebAssembly and running 100% on-device in your browser. 478 facial landmarks, 52 blendshapes, 30 fps. Smile, raise your eyebrows, or open your mouth wide and watch the page react. Nothing is uploaded or stored. When you close the panel, the camera stops.

MediaPipe · WebAssembly · on-device

run the face model.

Click to grant webcam access. An open-source face-landmarking model loads into your browser and runs locally. Your expressions drive three small visual effects — smile for a burst of hearts, open your mouth wide for a goal celebration, raise your eyebrows to summon confetti. That's it. No analysis is saved, sent, or inferred beyond those three triggers.

100% on-device. MediaPipe runs inside your browser via WebAssembly. Camera frames are processed locally and discarded immediately — nothing is uploaded, logged, or stored. Hit the close button (or leave the page) to release the camera.

initialising
— fps
0.00
Section 04 · side quests

What Youssef builds on a Saturday.

When the companies are running on their own for an afternoon, the terminal gets noisy. Local LLMs, VLMs, video generation, facial-expression trackers, communication-style classifiers — things nobody asked for, built because they looked fun. The one below is a Pokamind ad generated end-to-end at his desk: local text-to-image for keyframes, local image-to-video, local voiceover, local music, one ffmpeg command to stitch.

LOCAL · HOMEBREW Pokamind · ad spec · gen 1
weekend build

A communication-style classifier on top of Whisper

Given a consented recording of a practice role-play, it times each turn and suggests one of four communication styles — assertive / analytical / amiable / expressive — as a self-review prompt. Trained on a tiny curated set plus synthetic role-play dialogues.

whisperloralocal
tinkering · personal

Real-time facial-expression landmarks on a laptop cam

Mediapipe landmarks piped through a tiny transformer, 32 ms/frame on a MacBook. A personal weekend experiment on my own face — not part of any product, not pointed at anyone else, writes to a ROS topic because old habits.

mediapipeonnxros2
proof of concept

Short-form video generation, entirely local

FLUX for keyframes, Wan 2.2 for motion, Chatterbox for voiceover, Stable Audio for score, ffmpeg for glue. No API calls. No cloud bills. The ad on the left was made this way.

comfyuifluxwan-2.2ffmpeg
toy

A VLM that describes the kitchen fridge at 3pm

Local Qwen2-VL pointed at a webcam, prompted hourly. Outputs a haiku about the state of lunch and whether it's time to grocery shop. Technically useful.

qwen2-vlollamacron
research-adjacent · personal

Multimodal signal fusion, v2 (lab code, cleaned up)

PhD code, refactored for fun. Thermal + RGB + audio prosody fused via a small cross-attention head. Runs offline on my own device in my own home office — a personal toy for replaying consented lab recordings, not a product and not pointed at anyone else.

pytorchthermaledge
Section 05 · further reading

The peer-reviewed paper trail.

Six papers in Human-Robot Interaction, Affective Computing, and Social AI. 92 citations and counting — published at HRI, RO-MAN, IROS, and the International Journal of Social Robotics.

2025
Fusion in Context: A Multimodal Approach to Affective State Recognition IEEE RO-MAN 2025 · Mohamed, Lemaignan, Güneysu, Jensfelt, Smith
peer-reviewed
2025
Are You an Expert? Instruction Adaptation Using Multi-Modal Affect Detections with Thermal Imaging and Context IEEE RO-MAN 2025 · Mohamed et al.
peer-reviewed
2025
Context Matters: Understanding Socially Appropriate Affective Responses via Sentence Embeddings ICSR + AI 2024, Lecture Notes in Computer Science vol. 15561, Springer
peer-reviewed
2024
Multi-modal Affect Detection Using Thermal and Optical Imaging in a Gamified Robotic Exercise International Journal of Social Robotics, 16(5), 981–997
57 cites
2022
Automatic Frustration Detection Using Thermal Imaging ACM / IEEE Int'l Conference on Human-Robot Interaction (HRI), 451–459
22 cites
2021
ROS for Human-Robot Interaction IEEE/RSJ IROS 2021, 676–683 · the paper that became ROS4HRI
13 cites
full publications list google scholar the longer version