'Sovereign AI language database' due before year-end: Digital ministry

Taipei, July 15 (CNA) Taiwan's Ministry of Digital Affairs (MODA) said Tuesday it plans to release the first version of its "sovereign AI language database" in the fourth quarter of this year.
The release will be based on licensing terms for the training corpus -- the body of data used to train the AI learning model -- that the ministry has drafted to help agencies identify suitable data for inclusion while addressing copyright issues, MODA said at a news conference.
Government ministries are currently reviewing their datasets, and both public and private sector entities will be able to apply to access the database once it goes online, according to the ministry.
Chuang Ming-fen (莊明芬), head of MODA's Department of Data Innovation, said the ministry began preparing the licensing terms to address copyright concerns that have arisen over AI training.
She said the corpus is expected to include open government data, policy reports and government publications.
Rather than measuring the dataset by volume, the ministry plans to use tokens as the unit for quantifying the data, Chuang added.
She added that only around 1,000 of the more than 50,000 open datasets currently available are textual in nature, which is the type of data large language models require.
Agencies such as the Hakka Affairs Council (HAC), Ministry of Education (MOE), Council of Indigenous Peoples (CIP), and Ministry of Culture (MOC) are among those now reviewing language data for possible inclusion, she said.
The announcement came at a press conference introducing draft legislation on promoting data innovation and utilization, which is open for public comment until Aug. 15.
Chuang said the draft act focuses on four key areas: expanding open data for AI, promoting cross-industry data-sharing mechanisms, lowering agency data costs, and building a data innovation ecosystem, including requiring municipalities to appoint chief data officers.
Deputy Digital Affairs Minister Lin Yi-jing (林宜敬) said the proposed law aims to "train more AI models with Taiwanese perspectives" by allowing copyrighted data to be released with privacy safeguards.
- Society
Magnitude 4.8 earthquake hits southern Taiwan
07/15/2025 10:42 PM - Society
Taipower worker electrocuted conducting Chiayi repairs in critical condition
07/15/2025 10:28 PM - Cross-Strait
Japan's defense white paper highlights China's military drills around Taiwan
07/15/2025 10:06 PM - Society
Taiwan plans tighter rules for older drivers to boost road safety
07/15/2025 10:05 PM - Business
TSMC tops 2024 list of corporate taxpayers with over NT$100 billion
07/15/2025 09:56 PM