This data is Dutch transcribed and un-transcribed, cursive handwritten text, from historical (16-18th century) handwritten manuscripts.
Historical handwritten documents guard an important part of human knowledge only at the reach of a few scholars and experts. Recent developments in machine learning and handwriting research has the potential of rendering this information accessible to a larger audience. Data-driven approaches to automatic manuscript recognition require large amounts of transcribed scans to work. To this end, we introduce a new handwritten corpus based on 400-year-old, cursive, early modern Dutch documents such as ship journals and daily logbooks. The 1000 page collection has been segmented into lines and we provide textual transcriptions on 20% of the pages. Other annotations such as handwriting slant, year of origin, complexity, and writer identity have been manually added. With over 80 writers this corpus is significantly larger and more varied than other existing data sets such as Spanish RODRIGO. We provide train/test splits, experimental results from an automatic transcription baseline and tools to facilitate its use in deep learning research. The manuscripts span over 150 years of significant journeys by captains and traders from the Vereenigde Oost-indische Company (VOC) such as Tasman, Brouwer and Van Neck, making this resource also valuable to historians and the paleography community.
Contact: scribblelens@protonmail.com
The data has been used for academic research as part of JSALT'19, project Distant Supervision for Representation Learning
You can cite the data using the following BibTeX entry:
@inproceedings{Dolfing20, author={Hans J.G.A. Dolfing, Jerome Bellegarda, Jan Chorowski, Ricard Marxer, Antoine Laurent }, title={{The ``ScribbleLens'' Dutch historical handwriting corpus}}, booktitle={International Conference on Frontiers of Handwriting Recognition (ICFHR)}, pages={To Appear}, year={2020}, note="{http://www.openslr.org/84/}" }