计算机科学
数据科学
聚类分析
矛盾
地球科学
大数据
自然语言
本体论
情报检索
语调(文学)
信息抽取
主题模型
跟踪(教育)
人工智能
语料库语言学
自然(考古学)
计算语言学
语言模型
信息系统
语义学(计算机科学)
数据建模
摘要
Abstract Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost‐effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high‐quality sentences, extracted from 95 leading peer‐reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000–2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain‐specific information from these corpora through semantic search techniques and sentence‐level indexing. This approach, unlike LLMs such as ChatGPT‐4 that often produces generalized responses, excels at identifying substantial amounts of expert‐verified information with established, multi‐disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.
科研通智能强力驱动
Strongly Powered by AbleSci AI