Google Gemini Outperforms Humans In Health Coaching

Komal Patil June 28, 2023

Google Gemini is only six months old, yet it has already demonstrated outstanding capabilities in security, coding, and debugging.

The large language model (LLM) now outperforms humans in sleep and fitness advice.

Google researchers have developed the Personal Health Large Language Model (PH-LLM), a version of Gemini that can understand and reason about time-series personal health data from wearable devices such as smartwatches and heart rate monitors. In their studies, the model answered questions and predicted outcomes far better than experts with years of expertise in the health and fitness disciplines.

“Our work…employs generative AI to expand model utility from only predicting health states to also providing coherent, contextual and potentially prescriptive outputs that depend on complex health behaviors,” the authors write.

Gemini Is A Sleep and Fitness Specialist

Wearable technology can help people monitor and, ideally, improve their health. These gadgets provide a “rich and longitudinal source of data” for personal health monitoring that is “passively and continuously acquired” from inputs such as exercise and diet diaries, mood journals, and, in certain cases, social media activity, according to Google researchers.

However, the data they collect on sleep, physical activity, cardiometabolic health, and stress is rarely used in clinical settings that are “sporadic in nature.” According to the researchers, this is most likely due to data being gathered without context and requiring a significant amount of compute to retain and interpret. Furthermore, it may be difficult to interpret.

Furthermore, while LLMs have performed well in areas such as medical question answering, electronic health record analysis, medical image diagnosis, and mental evaluations, they frequently lack the ability to reason about and offer suggestions based on wearable data.

However, Google researchers made a significant breakthrough by teaching PH-LLM to offer suggestions, answer professional examination questions, and predict self-reported sleep disruption and sleep impairment outcomes. Multiple-choice questions were provided to the model, and researchers used chain-of-thought (human-like reasoning) and zero-shot procedures to recognize objects and concepts for the first time.

Impressively, PH-LLM scored 79% on the sleep exams and 88% on the fitness exam, outperforming the average scores of a sample of human experts, including five professional athletic trainers (with an average of 13.8 years of experience) and five sleep medicine experts (with an average of 25 years). The humans scored an average of 71% in fitness and 76% in sleep.

In one coaching recommendation example, researchers told the model, “You are a sleep medicine expert. You are given the following sleep data. The user is male, 50 years old. List the most important insights.”

PH-LLM replied: “They are having trouble falling asleep…adequate deep sleep [is] important for physical recovery.” The model further advised: “Make sure your bedroom is cool and dark…avoid naps and keep a consistent sleep schedule.”

Meanwhile, when asked what type of muscle contraction occurs in the pectoralis major “during the slow, controlled, downward phase of a bench press.” Given four options for a response, PH-LLM properly chose “eccentric.”

For patient-recorded earnings, researchers questioned the model: “Based on this wearable data, would the user report having difficulty falling asleep?” to which it replied, “This person is likely to report that they experience difficulty falling asleep several times over the past month.”

The authors write: “Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models.”

Gemini Can Provide Personalized Insights

To obtain these results, the researchers first developed and curated three datasets that tested individualized insights and recommendations based on collected physical activity, sleep patterns, and physiological responses; expert domain knowledge; and predictions about self-reported sleep quality.

In partnership with domain experts, they developed 857 case studies that depict real-world circumstances related to sleep and fitness — 507 for the former and 350 for the latter. Sleep scenarios identified probable causative factors and provided individualized recommendations to assist enhance sleep quality. Fitness tasks employed data from training, sleep, health markers, and user input to make recommendations for the intensity of physical activity on any given day.

Both types of case studies included wearable sensor data (up to 29 days for sleep and more than 30 days for fitness), demographic information (age and gender), and expert interpretation.

Sensor data comprised overall sleep scores, resting heart rates, changes in heart rate variability, sleep length (start and end times), awake minutes, restlessness, REM sleep percentage, respiration rates, number of steps, and fat burning minutes.

“Our study shows that PH-LLM is capable of integrating passively-acquired objective data from wearable devices into personalized insights, potential causes for observed behaviors and recommendations to improve sleep hygiene and fitness outcomes,” the authors write.

There Is Still More Work To Be Done In Personal Health Apps

However, the researchers concede that PH-LLM is only the beginning, and that it, like any new technology, has problems that must be worked out. For example, model-generated responses were not always consistent, confabulations showed “conspicuous differences” between case studies, and the LLM’s responses were occasionally conservative or cautious.

In fitness case studies, the model was sensitive to overtraining, and human experts noticed in one instance that it failed to detect undersleeping as a potential cause of injury. Furthermore, case studies were drawn from a wide range of demographics and extremely active individuals, thus they were unlikely to be truly typical of the population and could not address more general sleep and fitness difficulties.

“We caution that much work remains to be done to ensure LLMs are reliable, safe and equitable in personal health applications,” the authors conclude. This includes further lowering confabulations, taking into account specific health circumstances not captured by sensor data, and ensuring that training data matches the diverse community.

In the conclusion, the investigators write: “The results from this study represent an important step toward LLMs that deliver personalized information and recommendations that support individuals to achieve their health goals.”