Under the hood – How does Cavell go from speech to a note?

Cavell
Jan 23
4 min read

From the start, Cavell was designed as a generic AI engine that can be deployed across a wide range of care settings. General practitioners, home nurses and medical specialists all work differently, with distinct workflows and expectations from their medical records. Yet all these applications are built on the same technological core. From a technical perspective, Cavell is a speech-to-text-to-code engine. The components that convert speech into text, and subsequently transform that text into a coded clinical report, are built as generic, reusable building blocks. This architectural choice is deliberate: improvements to the Cavell engine automatically benefit all applications, create economies of scale, and allow new use cases to be supported quickly without having to start from scratch each time.

Anthony, our technical lead and co-founder, recently went under the hood of the CareConnect AI Assistant during a Corilus podcast. Building on that conversation, we provide further insight into the different building blocks that make up the Cavell engine, and show how these blocks are configured differently depending on the care setting. The building blocks remain the same; their configuration is tailored to the reality of each clinical context.

Step 1: capturing speech

Everything starts with capturing spoken information. How that information is provided varies significantly depending on the context. Home nurses typically work with short voice notes of twenty to thirty seconds, usually recorded after a home visit. In these recordings, they dictate all relevant observations and actions. Since there is only one speaker and the dictation is deliberate, a smartphone microphone is perfectly sufficient.

Consultations are different. During visits with general practitioners, specialists or psychologists, crucial information is not only spoken by the care provider but also by the patient, often throughout the entire conversation. To reliably capture all relevant information, audio capture must therefore be broader and more consistent. That's why we recommend to use an external microphone. Anthony elaborated on that during the podcast:

“The reason we provide an external microphone is not because there is no microphone in your computer. Built-in computer microphones are designed for video calls. When a patient is sitting opposite or at an angle, the audio is not captured properly. They are simply not designed for that.”

To strike the right balance between audio quality, range and cost, we developed this external microphone specifically for clinical use. The current microphone connects to the computer via USB and delivers audio of sufficient quality to accurately capture multiple speakers, without disrupting the workflow in the practice.

Step 2: transcription

The captured audio then serves as input for the next step: transcription. In this phase, spoken language is converted into text through cloud-based processing, which is required to achieve the necessary speed and scalability. During the podcast, Anthony explained why this step does not happen locally:

“Early on, we explored running this locally on the doctor’s computer, but it quickly became clear that the AI models required for high-quality transcription are so large and computationally heavy that it’s simply not feasible on a standard workstation. The infrastructure alone would cost around €100,000 to serve a single practice. That’s why we do this in the cloud.”

An important factor during consultations is speaker recognition (speaker diarization). In GP or specialist consultations, a caregiver or companion is often present, making it essential to distinguish what is said by the patient, by the accompanying person, or by the clinician. This distinction is crucial for correctly interpreting the consultation. For home nurses, where typically only one person dictates, speaker recognition is far less relevant and the processing pipeline can be kept simpler.

Step 3: from transcription to a coded clinical report

The transcription is not the end point. In the third step, the text is transformed into a coded clinical report, adapted to the care setting and to how the electronic patient record (EPR) expects information to be structured. Such a report typically combines free text with diagnosis codes and structured, coded parameters.

For home nurses, Cavell extracts a limited free-text section alongside approximately forty parameters that are specifically relevant for nursing observations and wound care. For general practitioners, Cavell generates a report in SOAP format, clearly separating the subjective input from the patient, the objective observations and measurements, the coded assessment, and the care plan. Here too, around forty parameters are automatically identified and structured, ranging from blood pressure and weight to more specialised parameters, for example in diabetes consultations. For medical specialists, the format of the report becomes even more critical. Each specialty has its own focus, terminology and reporting structure. Cavell therefore includes templates for more than twenty-five specialties and subspecialties, ranging from endocrinology and cardiology to orthopaedics and psychiatry.

To support all these care settings effectively, our AI engineers designed a set of collaborating AI models. Together, these models ensure that coded reports are generated not only quickly, but also with a high level of clinical accuracy and relevance for each specific context. Anthony described this as following during the podcast, specifically for GP consultations:

“It’s essentially a team of AI models working together. One model generates the narrative report in free text. Another model then extracts coded information from that report, such as diagnosis codes, links to existing care elements, or parameters like blood pressure.”

At the end of this process, the coded report is made available directly in the electronic patient record.

To conclude, Cavell was built as a single, generic AI engine that adapts to the context in which it is used. Whether it is a short voice note from a home nurse, a GP consultation or a specialist report, Cavell always follows the same fundamental steps: capturing speech, transcription, and conversion into a coded clinical report. What differs is the configuration of those steps, aligned with the workflow, content and requirements of each care setting. By working with reusable building blocks, Cavell combines quality, speed and scalability without sacrificing specificity. This makes Cavell broadly deployable in healthcare today, and ready to evolve alongside new use cases and care models.

If you would like to learn more about what's under the hood at Cavell, feel free to listen to the full podcast (only available in Dutch) via the following link.

Cavell

Under the hood – How does Cavell go from speech to a note?

Step 1: capturing speech

Step 2: transcription

Step 3: from transcription to a coded clinical report

Recent Posts

Comments