OpenAI has unveiled GPT-Rosalind, a large language model (LLM) specialised for biology research. On April 16 local time, IT outlet Ars Technica reported that, unlike general-purpose science models, it was trained to match the flow of biology research itself.
GPT-Rosalind targets two problems in biology research settings. The first is that decades of accumulated genome sequencing and protein biochemistry data are so vast that individual researchers struggle to absorb it all. The second is that techniques and terminology differ widely across subfields, making it difficult for researchers in one area to follow literature in another. For example, a geneticist working with genes activated in brain cells may have difficulty understanding extensive neurobiology literature.
Yoon-Yoon Wang (윤윤 왕), OpenAI head of life sciences product, said at a press briefing, "We trained GPT-Rosalind on the 50 most common workflows in biology and also taught it how to access major public biology databases." Wang said the model was designed, through additional training, to propose plausible biological pathways and to prioritise potential drug targets.
Wang also said the model can link genotype and phenotype based on known pathways and regulatory mechanisms, and can infer structural and functional properties of proteins. OpenAI said it focused on using such understanding of biological mechanisms in real research.
OpenAI said it also tried to reduce excessive agreeableness and overly optimistic answers from LLMs. It adjusted the model to judge more negatively on inappropriate drug targets. The company also highlighted GPT-Rosalind's "reasoning" and "expert-level" capabilities. OpenAI defined reasoning as the ability to carry out complex multi-step processes and cited some benchmark performance as the basis for its expert-level assessment.
It is still unclear how much it has solved the hallucination problem. LLMs can generate incorrect content even when asked to explain the process of reaching a conclusion. In practical use, there remains the possibility of both positive assessments that it found unexpected links and cases where it makes clearly wrong suggestions.
Access will remain limited for the time being. OpenAI is operating distribution conservatively due to concerns the model could be misused to increase viral infectivity. For now, applications are limited to institutions headquartered in the United States, and it plans to separately select who can use it.
OpenAI also said it will make a more restricted life sciences research plugin available to general users. It will provide the capability in stages by dividing risk, rather than fully opening life sciences specialised functions.
Other companies are also releasing AI models aimed at life sciences, but GPT-Rosalind is seeking differentiation by narrowing its focus further to biology. It remains to be seen whether this focused strategy will improve research efficiency, and that will need to be judged after results from use in the field.