[ptsefton.com] | [CV & Bio]

Infrastructure for Multilingual Text Analysis

2022-01-27

Infrastructure for Multilingual Text Analysis
Simon Musgrave
Peter Sefton
Language Data Commons of Australia (LDaCA)
University of Queensland

This presentation is from an online event on January 27th 2022. Digital Approaches to Multilingual Text Analysis delivered by Simon Musgrave and Peter Sefton.

About this event

The use of DH tools and methods have been applied across a variety of corpora but text-analysis of English language sources has dominated this field. These approaches are increasingly being used in languages and linguistics research for non-English corpora. At the same time, the integration of these tools has seen new research questions and possibilities emerge, including questions such as “Is there a non-Anglo digital humanities (DH), and if so, what are its characteristics” (Fiormonte 2016: 438). Recent studies have begun to examine aspects such as OCR for historical text analysis and data mining (Hill & Hengchen 2019; Goodman et al. 2018), multilingual computation analysis (Dombrowski 2020), semantic and sentiment analysis (Daems et al. 2019) and historical linguistics (Evans 2016), among others. The papers in this conference present a diverse range of projects and critiques of digital methods across different languages.

January 27th 1:45pm – 7:30pm AEDT

Convener: Joshua Brown Senior Lecturer and Convenor, Italian Studies, Australian National University and Katrina Grant Senior Lecturer, Centre for Digital Humanities Research, Australian National University

Why we need research infrastructure
Collecting data is time-consuming (expensive)
Making data reusable while respecting rights is very desirable
FAIR and CARE principles should guide us
Managing this at the level of individual projects is onerous
Even for small datasets
Separate infrastructure encourages best practices
FAIRer data
Wider availability of data management expertise
Better alignment with technology change

If we accept that making sharing and reuse of data (consistent with ethical considerations) should be the default, managing even small amounts of data can be onerous. Having infrastructures which can take on the task relieves researchers of some of this burden and brings advantages: more reliable FAIR compliance, access to data management experts, more responsive to changing technology (at least for the life of the infrastructure)

Introducing the Language Data Commons of Australia (LDaCA)
LDaCA will make nationally significant language data available for academic and non-academic use and provides a model for ensuring continued access with appropriate community control
LDaCA aims to provide access to materials which record language use in Australia
In some cases, LDaCA will provide federated access to existing collections
In other cases, LDaCA will be a repository 

Regardless of where data is housed, access will be through one portal (external data may also be accessible by other routes). Access control will follow the CARE Principles for Indigenous Data Governance where original providers of data have moral rights which must be considered, data owners/custodians will control lists of authorised users.

Multilingual material
Multilingual Australia (ABS data):
In 2016, there were over 300 separately identified languages spoken in Australian homes
More than one-fifth (21 per cent) of Australians spoke a language other than English at home
An infrastructure with the stated aims of LDaCA has to be able to handle data:
From multiple languages
With multiple writing systems
With multiple annotations (translations, phonetics, syntax etc)
In principle, Unicode encoding and suitable fonts should be capable of doing this
How does it work in practice?

'record language use in Australia' covers a huge range of possibilities. Current figures on slide, plus at least 250 Australian languages pre-European arrival, at least half no longer spoken but records remain (not for all). Unicode is not always without problems – how are we doing in meeting these goals so far? First, the architecture....

LDaCA Architecture

The LDaCA technical architecture is based on the Arkisto platform, storing data in the Oxford Common File Layout (OCFL), with data objects such as linguistic items and collections described in detail using Research Object Crate (RO-Crate). RO-Crate is a linked-data approach to describing data which is based on widely used standards for structural and descriptive properties such as dates and contributors, with extensions for language data being built on work in the Open Language Archives (OLAC). RO-Crate is an international collaboration with diverse contributors, the specification is in English and most RO-Crates at this point have English metadata and contents, but there is demand for content in other languages and future versions of the spec will cover multilingual use cases.

The demonstration material
All levels of government in Australia make documents available in multiple languages
Demonstration corpus uses documents from:
Services Australia
Department of Health (Victoria)
Languages:
Arabic
Farsi (Persian)
Turkish
Vietnamese
Chinese (simplified characters)

Simon pointed out here that the languages all use different writing systems, 3 completely distinct systems (Turkish and Vietnamese use extended Roman scripts, Farsi uses an Arabic based script)

This quick demonstration screencast shows a work-in-progress prototype of the LDaCA portal which will give controlled access to language resources to those who are licensed to see them  – in this demonstrator we have openly available multilingual Australian Government documents in PDF and text format and a small history dataset containing interviews with women from Western Sydney, Farms to Freeways. Eventually the LDaCA repository will contain a wide variety of data including speech, video, sign, images and digitized text with a browse and search interface to allow researchers to find data they are interested in – provided, of course that they have been granted an appropriate licence to view and use the data. In this demonstration our colleague Moises Sacal Bonequi peforms searches in different languages to find repository objects of interest. Each object has multiple translations stored in separate files, in both PDF and text format.

Why bother? 
What are the relevant data sources?
Lots more government documents
Serial publications:
Tim Sherratt lists 52 non-English sources in Trove
31 commenced publication before 1945
German prominent in C19
Substantial resources >1945 in Italian and Greek
Chinese publications always present
Use of LOTEs in Australia is under-researched, huge opportunities to collect data

Is there data in Australia which makes it worth worrying about this? Yes – at least two important sources of written material, plus this is an under-researched field with lots of questions to be answered and therefore lots of data to be collected. For example, there is research on differing usage in Vietnamese depending on speakers' time of arrival in Australia (1970s v. later), yet to be replicated with other similarly time-layered communities.

Acknowledgments
The Language Data Commons of Australia project received investment from the NCRIS-enabled Australian Research Data Commons (ARDC) through two of its programs:
Data Partnerships Program: Developing policy and technology foundations of a nationally integrated research infrastructure for language data collections of high strategic importance for the Australian research community.
HASS Research Data Commons and Indigenous Research Capability Program: Capitalising on existing infrastructure, securing vulnerable and dispersed collections and linking with improved analysis environments for new research outcomes.
Software developer: Moises Sacal Bonequi