Chinese General Practice

    Next Articles

Analysis of the Quality and Readability of Thyroid Cancer-related Information in Large Language Models Based on TikTok Index

  

  1. 1.Head and Neck Surgery,National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital,Chinese Academy of Medical Sciences and Peking Union Medical College,Shenzhen,518116,China;2.Department of Nursing,National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital,Chinese Academy of Medical Sciences and Peking Union Medical College,Shenzhen,518116,China;3.Department of Thoracic Surgery,National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital,Chinese Academy of Medical Sciences and Peking Union Medical College,Beijing,100021,China
  • Received:2025-04-29 Accepted:2025-05-30
  • Contact: NING Yanting,Associate chief nurse; E-mail:ningyanting@cicams-sz.org.cn

基于抖音指数的甲状腺癌问题集在大型语言模型中的信息质量及可读性分析

  

  1. 1.518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院头颈外科;2.518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院护理部;3.100021 北京市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院胸外科
  • 通讯作者: 宁艳婷,副主任护师;E-mail:ningyanting@cicams-sz.org.cn
  • 基金资助:
    中国医学科学院肿瘤医院深圳医院院内科研课题护理专项(E010422009)

Abstract: Background Large language models(LLMs) are gaining public familiarity and are increasingly adopted in healthcare contexts. Thyroid cancer represents a common malignancy in China,where patients express substantial unmet needs for evidence-based disease information. Nevertheless,no studies have assessed the quality and readability of LLM-generated responses regarding thyroid cancer in the Chinese context. Objective To evaluate and compare the quality and readability of responses generated by domestic large language models(LLMs) to thyroid cancer-related queries. Methods The Douyin Index was used to identify a set of 25 questions pertaining to thyroid cancer. Response texts were generated using DeepSeek (DeepSeek-R1-0120),Qwen(qwen-max-2025-01-25),and GLM(GLM-4Plus). Cosine similarity is a metric used to evaluate the similarity between texts generated at different time points,thereby assessing the stability of the model. To assess the quality of the information,the modified version of the Health Information Quality Assessment Tool(mDISCERN) was employed. Additionally,the Chinese Readability Formula was utilized to evaluate the readability of the texts. To explore the differences in the quality and stability of response text information between models,the following methodologies are applied,cluster heatmaps,principal component analysis(PCA),Friedman tests,and signed rank tests. Additionally,Pearson correlation analysis is used to examine the relationship between information quality and readability. Results The text similarity evaluation results show that the proportion of moderately similar texts on Deepseek is 12%,the proportion of highly similar texts is 88%,and the proportion of highly similar texts in the two responses of Qwen and GLM is 100%. A comparative analysis of information quality and readability across the three models showed statistically significant differences(P<0.001). Specifically,DeepSeek demonstrated superior performance in terms of information quality,as indicated by a significant chi-squared test result(Z=35.396,P<0.001). However,its readability was comparatively lower(R=7.525±1.006). Qwen and GLM exhibited comparable information quality,with GLM outperforming in question clusters 2 and 3,while Qwen excelled in responding to question cluster 1. The overall correlation between information quality and readability was found to be negative(r=-0.370,P=0.010). Conclusion LLMs in China have significant potential to provide essential health education to patients with thyroid cancer. However,concerns have been raised regarding inaccuracies in the generated content and the occurrence of AI hallucinations. When patients actually apply LLMs to obtain health information,they should consider comprehensively in combination with the response texts from different platforms and the doctor's suggestions. In terms of the model,it is necessary to balance the professionalism and popularity of the information and establish a medical content security review mechanism to ensure the accuracy and professionalism of the information.

Key words: Large Language Models, Thyroid cancer, Information quality, Readability analysis, Medical artificial intelligence

摘要: 背景 大型语言模型作为新技术逐渐被民众熟知与应用。甲状腺癌作为我国恶性肿瘤中的常见类型,患者对甲状腺癌科普信息需求量高,但国内仍未有对大型语言模型中甲状腺癌领域应答文本的信息质量和可读性分析的研究。目的 评估和比较国内大型语言模型(LLMs)对甲状腺癌相关问题应答文本的信息质量与可读性。方法 基于抖音指数筛选25个甲状腺癌问题作为问题集,利用DeepSeek(DeepSeek-R1-0120)、通义千问(qwen-max-2025-01-25)、智谱清言(GLM-4Plus)分别生成应答文本。采用余弦相似度计算不同时间节点生成文本的相似度以评估模型稳定性。采用改良版健康信息质量评价工具(mDISCERN)进行信息质量评价,结合中文可读性计算公式评估文本可读性。通过绘制聚类热力图、主成分分析及Friedman检验、符号秩和检验探索各模型间应答文本信息质量的差异,采用Pearson相关性分析探究信息质量和可读性的关联。结果 文本相似度评价结果显示,Deepseek文本中度相似占12%,文本高度相似占88%,通义千问和智谱清言2次应答文本高度相似占100%。3个模型的信息质量与可读性比较,差异有统计学意义(P<0.001),DeepSeek在信息质量上优于其他模型(Z=35.396,P<0.001),但可读性相对较差(R=7.525±1.006)。通义千问与智谱清言信息质量相似,但智谱清言更擅长对问题集聚类2、聚类3的应答,通义千问更擅长对问题集聚类1的应答。信息质量与可读性呈负相关(r=0.370,P=0.010)。结论 国内大型语言模型可为甲状腺癌患者提供基础健康科普,但存在生成内容不准确与人工智能(AI)幻觉,患者在实际应用大型语言模型(LLMs)获取健康信息时,应结合不同平台的应答文本及医生建议综合考虑。模型方面需要平衡信息的专业性与通俗性,并建立医疗内容安全审核机制,以确保信息的准确性与专业性。

关键词: 大型语言模型, 甲状腺癌, 信息质量, 可读性分析, 医疗人工智能

CLC Number: