macht.sprache. behind the Scenes 3: Developing the Text Checker

8. December 2021 / poco.lit.

Posted in

Tagged Approximate String Matching, case.sensitive., macht.sprache., macht.sprache., Natural Language Processing, sensitive terms, sensitive translation, text checker, translation, translation manifesto, translation tool

Picture:

Timur Celikel

poco.lit.

The macht.sprache. project has been running since the beginning of 2021. It’s curated by us, the editors of poco.lit., while Timur Celikel and Kolja Lange are in charge of the technical side of the project. The knowledge that we were able to collaboratively build with the help of the discussions on machtsprache.de and a number of online events has now been utilised to develop a practical tool. Here we offer some insights into the thought processes behind the development of the Text Checker and explain why it is accompanied by a translation manifesto.

macht.sprache. is developing a tool to help those who work in the cultural sphere in both English and German to translate with greater sensitivity. The result is an initial version of a freely accessible Text Checker that finds and highlights potentially sensitive terms in a given text, and offers some insights and suggestions relevant to their translation.

To check a text for sensitive terms, users can enter or copy a text into the Text Checker’s input field. The Text Checker compares this text with the collectively compiled terms from the macht.sprache. database and highlights those terms that have been suggested as sensitive. It displays definitions and translation options for these terms. Relevant guidelines and basic principles from the translation manifesto frame the explanations of terms and the ranking of translation options. In addition, the Checker provides links to the discussion of the respective terms on machtspache.de. Users can continue to contribute to the discussion, address specific translation examples, and add new terms.

In developing this Text Checker, we have not only confronted challenges presented by the fact that language is constantly changing and no linguistic recommendation will be pertinent forever, but have also encountered numerous technical challenges.

How can the Text Checker find variants of the terms in its database?

In principle, the issue is that the macht.sprache. database stores the root form of a given term. But of course, in the texts that users want to use the Text Checker to go through, terms don’t only occur in their root forms. The Text Checker must therefore be able to recognise the following alterations:

Declination (different cases, differences in number and gender – this is more of an issue in German grammar)
Adjectives functioning as nouns (the whites)
Conjugation (according to person, number, tense, etc. – to queer, she queers)
Variations of terms (Person of Colour and People of Colour)
Different spellings, e.g. British and American English (Person of Colour and Person of Color).

The Text Checker should be able to find terms even when they do not exactly match the form in the database. Various technical approaches could conceivably be used to tackle this:

Approximate String Matching

Many commonly used autocorrect programmes are based on approximate string matching. In these programmes, words that almost match a term in the database are automatically linked. Thus, if one writes “Aboriginl”, the programme might suggest “Aboriginal”. Applied to macht.sprache., this mechanism could be used to link “Aboriginal” and “Aborigine”, for example. Unfortunately, approximate string matching can also go wrong, and terms be matched incorrectly – e.g. “emotion” and “emoticon” could be linked due to the similarity of the letters that appear in them. Matching in this way cannot take into account the fact that their meanings are completely different.

Natural Language Processing (NLP)

NLP is a newer technique based on machine learning. With the help of NLP, we can find, among other things, the root forms (lemma) of terms. These root forms are then compared to the terms from the macht.sprache. database. Since we are now comparing basic forms, the problem of declension and conjugation is solved. However, e.g. “decolonise” is not the basic form of “decolonisation”, because they are different kinds of word – a verb and a noun.

We also hope to use NLP to tackle another challenge presented specifically by the German language by March 2022 – thanks to funding provided by the Prototype Fund: In German, there are a number of gendering options. We want to implement NLP so that the Text Checker can recognise all terms that refer to people, and display corresponding insights about the different gendering options. On the one hand, NLP is pretty good at tagging terms that refer to people, but on the other, a dictionary (or database) could probably never be exhaustive in this regard because there are so many such terms.

Entering Variants Manually

The last option is manual entry. This means that all conceivable variants – from conjugated forms to different spellings – would be added to the database by the macht.sprache. team. This option would in all likelihood be the most precise, but also the most labour-intensive. Unfortunately, entering all variant forms manually is not feasible for a small team of four which is financed by limited project funds.

At the moment, we’re using Natural Language Processing for the Text Checker and adding some variants via manual entry. Since we will continue to develop the project until at least March 2022, we are also considering experimenting with approximate string matching. A combination of different approaches seems to us to make the most sense overall and is in tune with our commitment to creativity and being open to new ideas. In general, we are aware that a project like macht.sprache.’s Text Checker can never be completely finished or all-encompassing – not least because language is constantly evolving.

Frequently occurring terms and context dependence

Some terms are sensitive in certain contexts, but quite unproblematic in others (e.g.: they, other, colour, white). The difference is context-dependent and difficult – but not impossible – for a machine to determine. It is possible to use Natural Language Processing to ascertain whether an adjective describes a person or an object, and to match it only if it is a person (i.e. in the case of a “white woman” and not a “white house”). Unfortunately, since the project funding for macht.sprache. only covers a few more months for the time being, it’s likely we won’t have enough time and resources to implement this distinction.

The Text Checker will thus emphasise these terms once too often rather than once too little in unclear cases. It is ultimately up to the user to form an opinion with the help of the support offered by macht.sprache. – i.e. with the definitions, discussions and the translation manifesto – and decide on a translation option that bests suits their needs. Here is an example of what the Text Checker would recommend for the term “white”:

“A white woman sits in the garden of her white villa.”

Definition of white: (adj) describes a political position of people who, in the context of racism, have comparatively easy access to social resources such as work or education.

Interplay of the Text Checker and the translation manifesto

To complement the Text Checker, we’ve written a translation manifesto, and relevant sections of this manifesto are directly integrated into the Text Checker. We chose the form of the manifesto because it is a type of document that aims to bring about change in the world. A manifesto essentially arises from an inadequate status quo and wants to move readers to change something. The macht.sprache. translation manifesto thus offers arguments for a politically sensitive approach to language, and presents some useful basic principles and guidelines for translation.

The fundamental principles explain what macht.sprache. stands for and recommends in relation to translating sensitively. The general guidelines have been developed by the macht.sprache. team to support translators in making their decisions about the right translation for a given word. Short explanations with examples and terms from the macht.sprache. database serve to illustrate the relevance of the individual points.

Users are welcome to add new terms to the database and to discuss terms and translation examples. Only our joint efforts make macht.sprache. as a whole a helpful translation tool.

Authors: Timur Celikel, Anna von Rath & Lucy Gasser

Behind the Scenes 2 – on the code of conduct, assessing translations and getting involved

Behind the Scenes 1 – on design, accessibility and dealing with discriminatory language

Support poco.lit. by becoming a Steady member.

You can support our work with a monthly or yearly subscription.