AI trained only on Victorian-era Britain unveiled

Mr. Chatterbox is an example of training a language model only on publicly available documents from a specific era. [Photo: Shutterstock]

A language model called Mr. Chatterbox has been unveiled after being trained from scratch using only materials published and made public in Victorian-era Britain.

Online media outlet Gigazine reported on April 1 that the model was trained on 28,035 selected documents from 1837 to 1899.

The model’s core feature is era-limited training. Developer Trip Venturella (트립 벤투렐라) used a British Library dataset released on the AI platform Hugging Face, selecting only materials from the Victorian era to build the training data. The British Library has released datasets of more than 25 million pages of books and documents whose copyrights have expired through cooperation with Microsoft (MS), and the model narrows that vast trove to a single period.

Mr. Chatterbox has about 340 million parameters, and was introduced as similar in size to OpenAI’s GPT-2 Medium. Compared with the recent race for large models, it falls into the small category, but it was designed to strongly reflect the period’s writing style and knowledge context by limiting training to texts from a single era.

Venturella stressed that the model connects with Victorian-era life, literature, science, philosophy and etiquette. He guided users to ask about railways, the Crystal Palace, Darwin’s theory of evolution, or how to behave as a gentleman so they can experience the model’s character. Rather than covering up-to-date general knowledge or a wide range of topics like a general-purpose chatbot, the approach focuses on reproducing the text-based worldview of a specific era.

He acknowledged clear limits to its completeness at the current stage. Venturella said Mr. Chatterbox is still a beta version, so responses may be unstable or unnatural, and users may need to ask it to generate an answer again if it does not run smoothly. It was released on the premise that the user experience may not be consistent.

The release shows that it is possible to build models specialised in a specific topic or era using only publicly available material that is not entangled with copyright issues. At the same time, boosting “general conversational quality” using only era-limited data leaves key tasks such as securing additional training data, improving the model and stabilising it.

Jinju Hong hongjj@d-today.co.kr

Keyword