UX Research · Usability Testing · AI Platform
End-to-end usability testing and research driving strategic design improvements for Microsoft's AI developer platform: uncovering critical blockers in the agent-creation flow.
Introduction
Microsoft Foundry lets students, developers, and startup founders build AI agents for their products. The vision: democratize AI agent creation for people still learning. But vision only matters if users can get through the flow.
As a User Experience Researcher on Microsoft's CoreAI team, I led 8 moderated usability sessions testing the Knowledge, Data, and FoundryIQ features. I defined research questions, designed the study protocol, and owned team communication. In the first session (and confirmed in every subsequent one), we discovered a critical blocker that 100% of participants hit, one the team hadn't anticipated.
This study went from "let's see how intuitive the flow is" to "we need to fix this before anything else matters."
My role: Led UX research on a 4-person team: defined research questions, designed the study protocol, ran 8 think-aloud usability sessions, owned team communication, and synthesized findings for the CoreAI team.
Note: Due to the confidential nature of this product and user privacy, visuals in this case study are limited.
Research Process
Research Questions
I shaped research questions to go beyond surface-level usability, investigating where confusion transforms from "learning curve" into "I'm done." The distinction between temporary confusion and hard blockers turned out to be the most critical framing of the study.
Method
I chose moderated usability testing because the research centered on mental models. I needed to be there when confusion happened. Remote unmoderated testing would capture task failure, but not why. The Think Aloud protocol revealed participants' reasoning in real time, and their silences were often most revealing.
I recruited participants matching Foundry's target audience: students and early-career professionals with technical backgrounds, curious about AI and looking to leverage it for projects.
Findings
Not everything was broken, and that matters. Identifying what works is just as important as flagging problems. It tells the team which patterns to protect as they iterate.
Areas of Improvement
I prioritized findings using Norman's severity scale. A confusing label is a different problem than a blocker preventing every user from completing the core task. Clear severity framing gave the Foundry team an actionable roadmap, not just a list of complaints.
Study at a Glance
Eight sessions. Five distinct findings. One critical blocker affected every single participant, a finding so consistent it immediately became the team's top priority.
Next Steps
Good research doesn't just answer questions. It reveals the next ones. My study surfaced both immediate fixes and areas needing deeper exploration.
"You dived into a complex product, asked all the right questions, and it's very clear that you put a lot of thought into planning and executing the study, and then translated that into a clear, engaging readout."
Research Mentor · Microsoft CoreAI
Reflection
The biggest lesson was about adaptability. The guardrail blocker emerged in Session 1 and threatened to derail the study. Every participant hit the same wall, forcing a real-time decision: help them past it (losing data about the blocker's impact) or let them struggle (losing downstream data)? I chose a hybrid approach. Participants attempted the task fully while I documented confusion, then I provided a workaround so we could test the rest of the flow. That decision preserved both the critical finding and downstream insights.
Biweekly check-ins with the Microsoft sponsor became more critical than I'd expected: not just for logistics, but for real-time alignment on what mattered most. When I flagged the guardrail issue after Session 2, the sponsor confirmed it was a known-but-underestimated bug. My data gave the engineering team the evidence they needed to prioritize the fix. That moment of watching research directly influence a product decision was the highlight of this experience.