ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
The aim of this paper is to share the experience of developing a Hittite treebank. The challenges of the job are not limited by the usual peculiarities of ancient languages data, such as the necessity of spelling normalization. Hittite is attested in the cuneiform script: standard procedure is to transcribe it on the syllable-by-syllable basis, which means an additional step for normalization called broad transcription (a rough phonological interpretation of the script). Cuneiform signs in Hittite documents can be Hittite, Akkadian, or Sumerian. In the two latter cases a real Hittite word, known to contemporary scribes, is concealed from a modern-day reader but should be restored in a treebank on the basis of various copies of the document. This job can rarely be done automatically. The lack of Hittite digitalized dictionaries adds problems for automated tokenization: vice versa, the Hittite treebank particularly aims at becoming a tool for such a dictionary. The ways to process Hittite with the help of computer tools, at least partly, include instruments for automated broad transcription. I am going to show the results and the principles for this type of processing a document. For the time being, all the Hittite letters have been digitalized and tokenized, and, when possible, provided with an additional layer of correspondence between Hittite, Akkadian, and Sumerian script. Our next step is to create a treebank in accordance with the requirements of Universal Dependencies: the material demands syntactic annotation, and I am going to briefly discuss the results of our preliminary attempts to annotate the Hittite data.