In the past, datasets have been held back by limited information, simple annotations and lack of character sets.
For example, detection datasets were only able to isolate characters without identifying them, while character recognition datasets could recognize only a few hundred characters; and some annotations were incomplete due to a lack of specialized knowledge. These issues hindered the development of algorithms for oracle bone scripts.
The digitalization initiative seeks to address those problems.
"We have now achieved several significant milestones," says Wang Chaoyang, chief architect at the digital cultural lab of Tencent's sustainable social value division, which has been a major force behind the initiative.
Currently, some 160,000 oracle bones have been unearthed and around 4,500 characters identified, of which, about 3,000 have yet to be deciphered.
Wang says that the bones deteriorate rapidly and the number of experts and scholars specialized in reading oracle bone scripts is limited.
"There is an urgent need for digital preservation and intelligent tools to accelerate the decoding process," he adds.