Search papers, labs, and topics across Lattice.
Nd′(s)Id,j,i(s)𝑇subscript𝑆𝑑superscriptsubscript𝑗1𝐽superscriptsubscript𝑖1superscriptsubscript𝑁superscript𝑑′𝑠superscriptsubscript𝑅𝑑𝑗𝑖𝑠superscriptsubscript𝑗1𝐽superscriptsubscript𝑖1superscriptsubscript𝑁superscript𝑑′𝑠superscriptsubscript𝐼𝑑𝑗𝑖𝑠TS_{d}=\frac{\sum_{j=1}^{J}\sum_{i=1}^{N_{d^{\prime}}^{(s)}}R_{d,j,i}^{(s)}}{% \sum_{j=1}^{J}\sum_{i=1}^{N_{d^{\prime}}^{(s)}}I_{d,j,i}^{(s)}}italic_T italic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_d , italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d , italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_ARG (4) The geometric mean is used to ensure that strong performance in one area cannot mask weaknesses in another. This method penalizes low component scores, requiring models to perform well across all dimensions of human flourishing simultaneously to achieve a high score. This comprehensive evaluation framework provides a holistic assessment of how effectively AI models support human flourishing across multiple dimensions, creating a new standard for trusted, values-aligned AI development. 5 Experimental Results and Insights This section presents results from the initial benchmarking of large language models (LLMs) using the Flourishing AI Benchmark (FAI Benchmark). These evaluations assess alignment with seven dimensions of human flourishing: Character and Virtue, Close Social Relationships, Happiness and Life Satisfaction, Meaning and Purpose, Mental and Physical Health, Financial and Material Stability, and Faith and Spirituality. The threshold score of 90 was selected as a meaningful benchmark to indicate strong alignment with the principles of human flourishing across all seven dimensions. Although this threshold score is somewhat arbitrary, it reflects a high standard that balances aspirational intent with the practical limitations of current LLMs. A perfect score of 100 is neither realistic nor expected given the complexity of human values and the evolving nature of LLMs, such a result would imply an idealized level of performance unlikely to be achieved in practice. At the same time, a lower threshold (such as 80) might overstate a model’s readiness to support holistic well-being, especially in areas requiring deep contextual understanding or moral sensitivity. By setting the bar at 90, the benchmark offers a rigorous yet achievable target that encourages continuous improvement while providing a clear signal of meaningful alignment with the multifaceted dimensions of human flourishing. While current models show some promising capabilities, none meet or exceed a threshold score of 90 across all dimensions. This reinforces the notion that significant room for improvement remains for the development of models that support holistic human flourishing. 5.1 Summary of Initial Evaluation Each model listed in Appendix C was evaluated once using the FAI Benchmark. The results represent point-in-time snapshots of each model’s ability to respond to complex, value-aligned prompts. The top scoring model, OpenAI’s o3, achieved an overall geometric mean score of 72. Other high-performing models include Gemini 2.5 Flash Thinking (68) and Grok 3 (67), as well as GPT-4.5 Preview, o1, and o4-mini (66). These scores fall short of the 90-point threshold score, indicating robust alignment with human flourishing. We note that one of the judges (GPT-4o mini) is from the OpenAI family. Currently, research is mixed on when and how models might favor themselves or their family of models (Zheng et al.,, 2023; Xu et al.,, 2024). We plan further research and analysis to limit model judge bias, if it exists. Figure 1: Overall FAI Benchmark Scores by Model. The red dashed line represents the target alignment target score of 90. All models fall short of this threshold, indicating that substantial opportunity remains to improve model alignment with the dimensions of human flourishing. 5.2 Observed Dimension Gaps Although model performance varied, a consistent pattern emerged across the evaluation. No model achieved the 90-point flourishing threshold score across all dimensions. Certain areas, notably Character and Finances, exhibited comparatively higher scores across models, with several systems such as o3, Grok 3, and GPT-4.1 achieving their highest score in these categories. However, critical dimensions like Faith, Relationships, and Purpose consistently lagged across the benchmarked models. For example, while o3 had the highest overall score with 72, it performed considerably worse in Faith, scoring only 43. This persistent pattern suggests that contemporary LLMs are relatively stronger at optimizing for pragmatic or emotionally supportive outputs, yet continue to underperform in dimensions requiring ethical reflection, existential reasoning, and virtue-based considerations. The results underscore the necessity of advancing model training methods that explicitly target multi-dimensional flourishing rather than narrow task optimization. Table 1: Flourishing AI Benchmark Model Overall Character Relationships Faith Finance Happiness Meaning Health
OpenAI1
2
3
1
Current LLMs fall far short of supporting holistic human well-being, with even the best models struggling to score above 72/100 on a new Flourishing AI Benchmark, particularly in areas like Faith and Spirituality.