Hancom said on Sunday that its open-source PDF data extraction project, OpenDataLoader PDF v2.0, ranked No. 1 on GitHub’s Trending list across all programming languages as of March 20 and received a trending badge.
GitHub Trending is an index that tallies in real time the open-source projects attracting the most attention from developers worldwide.
Hancom said OpenDataLoader PDF v2.0 logged growth of more than 1,800 GitHub stars in a single day on March 21. It said the project’s total stars exceeded 7,000 and forks topped 500.
OpenDataLoader PDF breaks down PDF documents with complex structures into text, tables and images, converting them into a form that artificial intelligence can process immediately.
PDF is the document format most widely used for AI training worldwide, but its complex internal structure has made data extraction difficult, and it has been cited as a key bottleneck in AI development. Hancom signed a memorandum of understanding in July 2025 with Duallab, a global PDF technology specialist, and began joint development. It released an initial version in September that year and launched v2.0 on March 12.
Version 2.0 applies a hybrid engine that combines an AI method with a direct extraction method, and runs in a local environment without sending data to external servers. It provides by default 4 AI add-ons, including optical character recognition, table extraction, formula extraction and chart analysis. It is also compatible with other companies’ open-source AI models, including Docling.
Hancom Chief Executive Kim Yeon-su (김연수) said, "This achievement is the result of Hancom’s document data extraction technology being directly verified by the global developer community for its completeness and practicality, and it also confirmed the possibility of expanding the technology ecosystem through diverse uses." He added, "Through a switch to the Apache 2.0 licence, we will develop it into an open PDF data platform that companies and developers around the world can freely use and expand."