by Tarin Clanuwat
About the project
Our project uses machine learning to transcribe Kuzushiji characters (a cursive writing style) into modern Japanese characters, because most Japanese people can’t read Kuzushiji anymore. The reason I started this project is because I want to make historical documents more accessible so the general public can appreciate them and help protect them. We built KuroNet, and now the service is available through IIIF on the CODH website.
More information about our project is available through the blog posts By the Book: AI Making Millions of Ancient Japanese Texts More Accessible and How Machine Learning Can Help Unlock the World of Ancient Japan, as well as our paper on Arxiv, KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning.
What are the stakes?
The end goal of our Kuzushiji recognition project is to create a search engine for historical documents, even though they are not transcribed into plain text. This will help researchers work more efficiently, possibly leading to new discoveries in literature and history. My own goal is to help people better appreciate classical Japanese literature. Kuzushiji is a huge wall that prevents them from enjoying the original texts, so I am trying to eliminate that barrier. In the future, we can try to do something more challenging, like machine translation from classical Japanese to modern Japanese. I also want to open up Japanese literature to other fields like machine learning, too.
What have you learned doing this project?
Building a one-size-fits-all model is extremely hard. There will always be exceptional cases. A model (KuroNet or any other model) will never be perfect. We can keep optimizing the model, but it can’t go beyond the scope of the data set. We need to think about how to expand the model through collaborations between humanities and computer sciences researchers which, in my experience, not very easy path. Making people (especially humanities researchers) understand our goals is important but can be hard sometime.
We mainly use Python and the deep learning framework Pytorch.
Our model is trained on a Kuzushiji dataset created by the National Institute of Japanese Literature. It’s based on 44 books printed in the Edo period (17th - 19th century Japan).
There are 3 people in KuroNet project. It’s me, Tarin Clanuwat (ROIS-DS Center for Open Data in the Humanities, National Institute of Informatics, Japan); Alex Lamb from Mila, University of Montreal; and my supervisor, Asanobu Kitamoto (ROIS-DS Center for Open Data in the Humanities, National Institute of Informatics, Japan). Alex was in Japan when we built the prototype, but he’s based in Canada. So most of the time, we work through online communication like Google Docs, Overleaf, and video chat.
Key: Collaborators: blue, Sources: purple. Map projection: Dymaxion.