Reuse or New Development: sustainability of resources and tools for multi-facetted historical data and languages

Workshop in conjuction with the Conference

Forschungsdaten in den Geisteswissenschaften (FORGE 2016) "Jenseits der Daten"

University of Hamburg 14 September 2016

Venue

ESA 1, Hörsaal H

Organisers

Dr. Cristina Vertan, University of Hamburg, ERC Project TraCES

Dr. Alicia Gonzales,University of Hamburg, ERC Project COHBUNI

Dr. Peter Verkinderen, University of Hamburg, ERC Project The Early Islamic Empire at Work

Publication

Selected powerpoint presentations have now been published online (see Programme below). Proceedings are being prepared for publication.

Rationale

Data in humanities, especially historical data, is characterized by a strong presence of vague information and uncertainty. The available Content Management Systems and annotation tools have often disregarded the requirements of research projects dealing with fuzzy data, languages with non-concatenative morphologies and scripts of non-Latin writing systems. Additionally, data encoding standards often overstress the importance of mere standardization at the expense of human readability and efficiency in terms of storage and parsing performance. Similarly, morphological tag sets and natural language processing frameworks primarily based on Indo-European languages are presented as universal solutions, but fail to meet some of the linguistic phenomena characteristic of other languages.

This workshop brought together scholars using annotation tools for non-Western languages with people involved in the development of such tools and content management systems, in order to exchange experiences, discuss problems, and search for ways to overcome these barriers.

Three main directions were discussed:

1. Sustainability in terms of data repository

What kind of data management framework is needed for which type of data?
How can one store enough explanation about vague, imprecise and ambiguous data (like place and person names, historical dates, complex relationships among actors)?
How can currently available systems handle multilingual issues?

2. Sustainability of (annotation) tools

Analysis of historical data analysis implies: annotation of textual material in languages which usually:

either are no longer (actively) spoken like Ge'ez, or changed a lot from their present form (classical Arabic), or the writing system is no longer known (e.g. Maya);
are less resourced from the point of view of digital resources;
the linguistic structure/features for this languages differ considerable from the modern ones which are usually object of most available annotation tools;
annotation cannot be done on the original script but on transliteration but both versions have to be kept synchronised;
grammatical structure of the language is not completely researched. Some phenomena are discovered during the analysis, one cannot consider them in the modelling phase.

The workshop tackled issues like:

which annotation tools can be considered for annotation; which limitations do they have?
what does the design of a new annotation tool imply and how to ensure sustainability?
visualisation of complex annotations;
local solutions versus general tools;
frequent requirements of structural complex languages: multilevel annotation, correction of text during annotation, multi-level segmentation.

3. Sustainability of annotated objects (standards)

Nowadays XML, and TEI-XML in particular, is generally accepted as an export format. Whilst for this purpose TEI-XML is very useful, ensuring the portability of data, there are still issues to be discussed like:

For internal processing is TEI-XML quite difficult to use. In case of intensive fine-grained interlinking between components practically impossible. Thus project-specific solutions are often required and “EXPORT”-function should ensure a TEI-XML output at the end
TEI-XML ensures that data is readable by everyone. Can this data then be also processed easily by a third party? - Which other formats could be used (e.g. JSON)?
Are the current existent Tag-Sets (like the Tübinger Tagset) enough specified for all other non-European languages?

Programme

9:30 – 09: 45 Cristina Vertan, Alicia González, Peter Verkinderen, University of Hamburg - "Reuse or New Development: sustainability of resources and tools for multi-facetted historical data and languages: Introduction" (doi.org/10.5281/zenodo.160375)

Session I – Computational Approaches

Chair: Peter Verkinderen, Hamburg

09:45 – 10:10 Alicia Gonzáles, Tillmann Feige, University of Hamburg - “Reuse: A symbiosis between developers and researchers” (doi.org/10.5281/zenodo.160372)

10:10 – 10:35 Johannes Daxenberger, Technical University Darmstadt -“How to computationally approach extinct languages: A case study on Hittite”

10:35 – 11:00 Christian Prager, University of Bonn - “Of Codes and Kings: Approaches in the Encoding of Classic Maya Hieroglyphic Inscriptions” (doi.org/10.5281/zenodo.160378)

11:00 – 11:25 Heiko Werwick, University of Jena - “Technische Hintergründe der Erstellung eines Wörterbuchs für das Sabäische”

11:25 – 11:45 Coffee Break

Session II - Annotation Tools

Chair: Tillmann Feige, Hamburg

11:45 – 12:10 Cristina Vertan, University of Hamburg - “GeTa a multi-level semi-automatic annotation tool for Classical Ethiopic” (doi.org/10.5281/zenodo.160366)

12:10 – 12:35 Seid Muhie Yiman, University of Darmstadt - “WebAnno for less resourced and historical data annotation” (doi.org/10.5281/zenodo.160380)

12:35 – 13:00 Thomas Krause, Humboldt University Berlin - “Utilising ANNIS for search and analysis of historical data” (doi.org/10.5281/zenodo.160368)

13:00 – 13:30 Discussions

Abstracts

Alicia Gonzáles, Tillmann Feige, University of Hamburg - “Reuse: A symbiosis between developers and researchers” (Herausforderungen in der Nutzung vorhandener Tools für arabische Daten)

Wir beschreiben den Ansatz, ein arabisches Textkorpus mit computerlinguistischen und semantischen Verfahren analysierbar zu machen. Der Ansatz war es, auf bereits vorhandene Software für die Hauptpunkte Annotation und Analyse zu setzen. Wir haben dazu ein Pflichtenheft erstellt, dass mit der vorhandenen Softwarelandschaft abgeglichen wurde.

Die Anforderungen

Da wir mit arabischen Daten arbeiten, ist die größte Herausforderung die Schrift an sich. Es ist eine linksläufige verbundene Schrift, die durch Konsonanten und lange Vokale repräsentiert wird. Kurze Vokale sind Diakritika, die optional gesetzt werden. Die größte Hürde ist eine vollständige UTF-8 Unterstützung und die saubere (verbundene) Darstellung der Schrift. Dies reduziert die Auswahl erheblich. Hinzu kommt, dass wir auf flexible Import- und Exportmöglichkeiten angewiesen sind. Durch unsere Herangehensweise gibt es weitere Einschränkungen wie Mehrebenen-, Multitoken- aber auch Subtoken-Annotation.

Die Auswahl

Die Auswahl für die Annotation fiel auf WebAnno, das durch die Nutzung von UIMA XMI die Verwendung von DKPro-Core erlaubt, einem Framework, mit dem die Daten vernünftig kontrolliert und aufbereitet werden können. Dies ist auch mit Arabisch möglich. Weiterhin wurde der RTL-Support durch Zusammenarbeit mit den Entwicklern stetig ausgebaut, so dass die die linksläufige, verbundene Schrift kein Hindernis mehr ist.

Als Visualisierungstool wurde Annis ausgewählt, das ebenfalls Arabisch unterstützt, einen konfigurierbaren Converter mitbringt und Mehrebenenkorpora erlaubt., so dass auch hier die Hauptkriterien erfüllt wurden.

Zusammenfassung: Nutzung vorhandener Software

In unserem Fall ist festzuhalten, dass das größte Hindernis tatsächlich die Charakteristika der arabischen Schrift sind, die die zur Verfügung stehende Auswahl an Software erheblich reduzieren. Durch sehr guten Support der Anwendung konnte auf eine Eigenentwicklung verzichtet werden, was – für das Projekt – die Frage nach langfristiger Bereitstellung, Support und Weiterentwicklung der Tools erübrigt. Die Daten werden durch die Nutzung von Annis als ein geschlossener Korpus zur Verfügung gestellt werden können. Somit ist die Nachnutzung auch gesichert.

*****

Johannes Daxenberger, Technical University Darmstadt -“How to computationally approach extinct languages: A case Study on Hittite”

Extinct ancient languages from the middle east contain valuable cultural, historical and linguistic information. However, sustainable access to these languages is threatened by both political conflict as well as a lack of experts with the necessary knowledge to process them.

As consequence, their fast digitalization and automatic processing is a desirable goal. State-of-the-art natural language processing (NLP) methodology is mostly tuned towards modern languages and can only be adapted to extinct languages with substantial effort, due to the lack of digitized data that is available. Thus, in this study, we explore a novel way to enhance digital access to a cuneiform language spoken in ancient Asia Minor, Hittite.

To make the existing transliteration and translations more accessible to non-experts, we implement a set of methods to allow semantic search on a small set of parallel texts. In particular, we explore the use of lexical-semantic methods to semantically enrich the translations for search. Some of the problems we were facing include the very specialized vocabulary, fragmentary texts, or divergent markup conventions by the multiple translators. The evaluation of the developed search tool in a user study showed that our strategy is a promising first step to computationally approach extinct languages.

*****

Christian Prager, University of Bonn - “Of Codes and Kings: Approaches in the Encoding of Classic Maya Hieroglyphic Inscriptions”

So far, no existing digital work environment can sufficiently represent the traditional epigraphic workflow ‘documentation, analysis, interpretation, and publication’ for texts written in complex writing systems; such as Egyptian hieroglyphs, cuneiform writing, or Classic Maya.

The project “Text Database and Dictionary of Classic Mayan” will transpose this workflow to a digital epigraphy, by the reuse and development of digital methods and tools in the Virtual Research Environment. Maya writing is a semi-deciphered logographic-syllabic system with approximately 10,000 text carriers discovered in sites throughout Mexico, Guatemala, Belize, and Honduras (300 B.C. to A.D. 1500).

When designing the digital epigraphic work environment, the documentation of the current state of decipherment of the script and language must to be considered. The digital decoding of undeciphered scripts requires a machine readable corpus with annotated textual data which meet technical requirements for applying corpus and computational linguistic methods. To digitally encode texts or markup linguistic information, the annotation guidelines of the TEI (Text Encoding Initiative) have become a standard.

The project will therefore investigate the usability of TEI, rather designed for marking up transcriptions of fully readable texts originally written linearly and in alphabetic writing systems. A linear transcription of Maya inscriptions alone cannot represent the original spelling or primary source in its entirety, as many potentially significant details remain undocumented. Marking up the original text and its structure is therefore of great importance, particularly for partially deciphered or undeciphered scripts. We identify this issue as a significant desideratum in TEI epigraphic research by estimating the limits as well as restating requirements for encoding standards like TEI.

*****

Heiko Werwick, University of Jena - “Technische Hintergründe der Erstellung eines Wörterbuchs für das Sabäische” (Talk in German)

In dem Vortrag werden die einzelnen Bearbeitungsschritte von der Texterfassung bis hin zur fertigen Übersetzung dargestellt. Es soll dabei gezeigt werden, wie sich die die Datenstrukturen den jeweiligen Arbeitsschritten anpasst.

Die einzelnen Arbeitsschritte sind

Textaufnahme über ein externes Tool
Textnachbearbeitung
Grammatikalische Analyse
Bearbeitung eines Lemma
Erstellung des Mastertextes
Übersetzung

*****

Cristina Vertan, University of Hamburg - „GeTa: a multi-level semi-automatic annotation tool for Classical Ethiopic“

Although classical Ethiopic plays an essential role in the research on early Christian literature, up to now no digital tools and resources (e.g. corpora, dictionaries) are available. In contrast, manuscripts in classical Greek, Hebrew and Arabic are already digitized and annotated with linguistic and philological information, thus diachrone analysis of language development, linkage of multilingual versions and cross-language comparisons are possible.

The project TraCES (From Translation to Creation: Changes in Ethiopic Style and Lexicon from Late Antiquity to the Middle Ages), funded through an Advanced Grant from the European Research Council, aims to fill this gap for classical Ethiopic and :

on the one hand to build the necessary electronic resources for bringing Ge'ez in the digital age and
on the other hand to use these new information technology tools in order to get new insights in Ge'ez literature and language.

The complexity of the language and the number of linguistic features to be marked, together with the lack of electronic corpora, makes impossible a completely automatic annotation process. No annotation tool available for the moment could offer all functionality necessary for a deep linguistic annotation. In this contribution we will present a semi-automatic annotation tool which allows corrections of the text during the annotation process, and enables annotation at multiple levels. Automatic processes are marked so that the user can check and correct. The tool can manage different scripts and synchronizes original script and transliteration. We will show, that also an adaptation to other languages is possible.

*****

Seid Muhie Yiman, University of Darmstadt - “WebAnno for less resourced and historical data annotation”

WebAnno is a generic, web-based, and distributed annotation tool. WebAnno supports the annotation of different linguistic types and structures, such as token spans (e.g. part of speech), sub-tokens (e.g. morphology markers), relations (e.g. dependency grammar), chains (e.g. co-reference), and complex slotbased (e.g. semantic role labelling) annotations.

Unlike many annotation tools, it supports the annotations of different languages, including low-resourced and historical languages, as far as the writing systems use valid Unicode representations. In addition to left-to-right writing systems of the European languages, the latest WebAnno release also supports annotations for right-to-left scripts such as Arabic and Hebrew. To facilitate rapid annotations for less resourced languages, the tool also includes an integrated automation component, which suggests annotations automatically and incrementally so that annotators can easily correct the suggestions.

This automation component led to an increase of annotation speed of 21% in an annotation study for Amharic. There is ample support for the annotation flow, including user management, agreement computation, adjudication of multiply-annotated material, as well as various import and export formats. WebAnno has been developed over the past 3 years as part of the CLARIN-D infrastructure, and is available as open source, enabling others to customize it according to specific needs.

*****

Thomas Krause, Humboldt University Berlin - “Utilising ANNIS for search and analysis of historical data”

Tools for the analysis of historical data, especially from non-Indo-European languages, have to solve specific challenges pertaining to, e.g., the synchronised representation of original script and transliterations, deep search over non-Latin script, data models to allow for customised tokenisations, etc. While the implementation of new software solutions for a specific research question and specific data in this context is a plausible solution, it is perfectly unsustainable. We present ANNIS, a browser-based, re-usable search and analysis tool for multi-layer linguistic corpora. ANNIS can be, and has been, used to searches and analyses over a number of historical corpora as well as corpora with non-Latin script. It is driven by a graph-based data model that is able to take up potentially unlimited types of annotation, and can therefore be used to represent data coming from various different sources and formats. The possibility of conversion from several different formats via the compatible conversion framework Pepper makes ANNIS highly re-usable in a wide variety of research contexts. It also features different, pluggable, visualisation options so that the different corpus strata can be presented in optimal form. We exemplarily present a use case for search in the Coptic SCRIPTORIUM, a multi-layer corpus of Coptic.