Multilingual search: Decompounding with language-specific lexicons

Marc von Wyl in Code and deep dives on Dec 1, 2021

Decompounding is the act of cutting a compound word into its meaningful parts. It’s an important process for a multilingual search engine, as in many languages, compounding is used profusely to generate new words. A German user looking for a Hundehütte (dog house) would like to get the same results as if they had entered Hütte für meine Hund (house for my dog). Since most ecommerce search is keyword based, once we’ve removed the stop words für (for) and meine (my), breaking the compound Hundehütte into Hunde + hütte should provide similar results as when querying Hütte für meine Hund.

In this blog post, we’ll define a compound and talk about the concept of decompounding. We'll discuss typical decompounding methods and walk you through our preferred method. We'll also discuss the challenge of knowing when not to decompound (such as in the case of airport being split into air and port).

What is a compound? What do we mean by compounding languages?

If you’ve had a haircut or seen a motorcycle, you’re familiar with compounds. A compound is the combination of two or more words (lexemes) to form a new word. Compounds are usually composed of nouns, adjectives, and verbs. For example:

Open, as in soy sauce
Hyphenated, as in probability-based and gray-haired
Closed, as in shortcut and whiteboard

In the context of search, we are interested in the last type, closed compounds, as the other two are easier to split.

Most if not all languages use compounding to generate new words. For example:

In Dutch: verjaardagskalender (birthday calendar) = verjaardag (birthday) + kalender (calendar)
In Finish: kukkakimppu (flower bouquet) = kukka (flower) + kimppu (bouquet, collection)
In Japanese: お好み焼き (okonomiyaki) = お好み (okonomi, preference) + 焼き (yaki, cooking)

German compounding & compounds

Compounding can produce extremely long words in some languages. In German, compounds have no size limits. In 1996, the Guinness Book of World Records recognized Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft (from the Association for Subordinate Officials of the Main Maintenance Building of the Danube Steam Shipping Electrical Services) as the longest published word.

In Germanic languages, a compound is usually composed of a head and one or several modifiers. The head is on the right (the end of the word), with the modifier branch on the left. They are often connected using linking morphemes. In linguistics, a morpheme is the smallest unit of meaning (think dogs = dog + s or uncomfortable = un + comfort + able), and linking morphemes are used to connect words. They can also be used to express a degree of inflection, such as when using the genitive or plural:

Hundehütte
Voorlichtingssysteem (Voorlichtingssysteem, prevention system in Dutch)
Eftermiddagskaffe (Eftermiddagskaffe, afternoon coffee in Danish)

Linking morphemes

Linking morphemes are language dependent. Here are examples from several languages:

Swedish: o, u, e, s
Dutch: s, e, en
German: s, e, en, n, nen, ens, ns

Some words may be partially cut when connected, as in Fahrschule (driving school); the verb Fahren (to drive) is cut to its stem.

What is decompounding?

Decompounding is the act of splitting a compound into its meaningful lexemes; extracting every meaningful unit.

One thing we must avoid is breaking down words a bit too much. For example, the Dutch word Voorlichtingssysteem (prevention system) can be cut into Voorlichting (prevention) + s (plural) + Systeem (system). Stopping there would make sense. We could keep going by cutting Voorlichting into Voor (in front of) + lichting (batch), but at that point, we’d lose the true meaning of the compound.

Typical methods of decompounding

Most decompounding methods are based on a lexicon of words that shouldn’t be split further. We will call these words atoms. For example, in the German word Wissenschaftskolleg (science college), the atoms are Wissenschaft (science) and Kolleg (college). Wissenschaft could be cut into Wissen (knowledge) and Schaft (shaft), but by doing that, we’d lose the original meaning. Using this lexicon, we can easily split compounds by finding the longest atoms that can fit perfectly inside them.

We can do this in several ways. Decompounding technologies apply different methods:

JWord Splitter recursively cuts from left to right
Banana Splitter goes from right to left, greedily trying to match the longest atoms
ASV Toolbox uses a Patricia tree

Advanced decompounding techniques: frequency & natural language processing (NLP)

More advanced methods use statistical properties of words. They look at the frequency at which the atoms appear in large quantities of data and split the word only if the probability of the whole word as meaningful is lower than the probabilities of its parts. NLP machine learning has also been used to solve the decompounding word-splitting problem. In one of the most cited resources on decompounding query words^footnote-1, the authors combine statistical properties of a word, such as frequency of appearance in a large corpus and co-occurrences in queries, and combine them using a support-vector machine (SVM).

More recently, researchers Martin Riedl and Chris Biemann proposed a method called SECOS, which uses statistical properties of words and distance on a distributional thesaurus for splitting compounds into likely and related words.

Our approach: Building a good lexicon for decompounding

Speed is critical to search. We cannot implement an approach that would significantly slow down a search engine and thereby break such features as search-as-you-type. Thus, we cannot use a complex or heavy machine learning-based approach at query time.

With this in mind, we concluded that a good lexicon is the most important part of an efficient decompounding system. Below is an example of a German testing set for which we computed the following measures using a lexicon made up of only the atoms present in the testing set. The matches were made by using a simple greedy longest match algorithm. Half the test set consisted of words that should not be decompounded, for example Handschuh, glove, which should not be divided into Hand + schuh (hand + shoe); the other half contained compounds that should be split.

Compound	Translation	Split
Wissenschaftskolleg	Science college	Wissenschaft + Kolleg
Hundehütte	Dog house	Hund + Hütte
Handschuh	Glove	Handschuh
Hinterziehung	Evasion	Hinterziehung

We used all the atoms of the test set as our lexicon (e.g., Wissenschaft, Kolleg, Hunde) and got the following results:

Method	Vocabulary	F1-score	Precision	Recall	Accuracy
Greedy longest	3,702 words	0.998	0.998	0.998	0.999

This is quite impressive, but it’s cheating. In real life, we don’t have a perfect set of atoms. For example, a lexicon composed only of Hund and Hütte would have a perfect score if the test set were made up of only Hundehütte, but it wouldn’t fare well on real data.

We needed to build a more reliable and exhaustive lexicon. Our approach included these steps:

Finding a large amount of text in our target language
Extracting frequently used nouns, adjectives, and verbs to create a first lexicon
Trimming the lexicon by keeping only words that could not be decompounded by other words with respect to a given score (detailed below)
Enriching the lexicon with potential variations of its words

Finding a large quantity of data in a target language

Free-to-use, unsupervised textual data is available online from several sources. We like Wikimedia data, especially Wikipedia and Wikinews data. (For languages for which Wikipedia does not provide enough data, OSCAR is also an option, though we haven’t used it). Of course, the data had to be properly cleaned by extracting the text of the article and then cleaning it using standard text preprocessing, as well as using a language classifier to make sure that most of our extracted sentences were in the desired language (online options include fastText, langdetect, and CLD3).

Extracting parts of speech: nouns, adjectives, and verbs

Now that we had a large quantity of text in our target language, we wanted to find potential atoms within it. We knew that compounds are made of nouns, adjectives, and verbs, and that if we could find frequently used nouns, adjectives, and verbs in our data, we would have a good set of potential atoms.

To extract our atoms, we used part-of-speech tagging (POS tagging), a classification task that matches words with their parts of speech (e.g., nouns, pronouns, adjectives, adverbs). Fortunately, there are libraries that have deep learning models trained for part-of-speech tagging in several languages. The most well known are spaCy, Stanza, flair, and trankit. We also counted the number of occurrences of these potential atoms, which would be useful in our next step.

Lemmatization is identifying the base or dictionary form of an inflected word (its lemma). For example, the base form of dogs is dog, the base form of was is be, and the base form of walking is walk. As words are subject to morphological changes, we lemmatized them to get a more exhaustive amount. Then we filtered out infrequently used words, as they might have been noise or typos, as well as very long words, which were unlikely to be atoms. [latexpage]

Trimming our lexicon

At this stage, our lexicon was composed of relatively frequently used nouns, adjectives, and verbs that we considered potential atoms. However, a lot of these potential atoms were most likely compounds. To filter out these unwanted compounds, we used the statistical properties of our potential atoms. If a given word can be decompounded by a combination of other words, and the geometric mean (see below) of the probabilities of those other words is higher than the probability of the whole word, we took it out of our lexicon.

We computed the probability of a potential atom as:

Formula for calculating the probability for a potential atom

Where totalwordcount is the total number of words in our corpus and wordcount( $ W_{i} $ ) the number of times the word $ W_{i} $ appeared in the corpus.

We computed the geometric mean of probabilities of a potential decompounding $ W_{d} $ as:

Where N is the number of atoms in the split.

If, for a given word $ w_{n} $, we can find a split $ W_{d} $ for which the geometric mean of probability p($ W_{d} $) is higher than the probability of the full word p($ w_{n} $), we can remove $ w_{n} $ from our lexicon. Note that we also take into account language-specific linking morphemes in our potential splits.

For example, if in our corpus we had Akustikgitarre 50 times, Akustik 75 times, and Gitarre 150 times, with a total word count of 50,000:

Formula for calculating total word count of a corpus

Because Akustikgitarre has a lower probability than Akustik and Gitarre combined, we can remove it from our lexicon.

Enriching the lexicon

For our final step, we reverted what we did with the lemmatization. We have an internal declension dictionary that lists potential variations of given words in several languages. We used it to add all the variations of our atoms to our atom lexicon.

Decompounding algorithm

Now that we have a lexicon, we can apply a decompounding algorithm to potential compounds at indexing and query time. Unfortunately, using the decompounding on the geometric mean of probability described above is computationally costly (it produces an exponential number of potential splits). Which means that if we have a word such as Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft, with about 20 potential splits, we would have to test a million possibilities. So we stick to a simple right-to-left (right to left because all of our target languages are left branching), longest match algorithm. This greedy algorithm simply starts from the end of the word and tries matching the longest possible word in our lexicon, growing right to left. It then continues with the remainder of the word until it matches all the atoms in it. We also allow language-specific linking morphemes between atoms.

Evaluating the decompounding

We compared our results with those of SECOS^footnote-2. Our evaluation metrics are detailed in the appendix.

As a testing set, we used the German data set provided on the SECOS GitHub page. The base data set provides only compounds and no words that should not be decompounded (such as Handschuh).

Having words that should not be decompounded was important when evaluating the precision of our method. We utilized atoms of at least 7 characters that were not decompounded by other atoms in the test set as our should-not-be-split set. The final test set is composed of 3,702 words, including 1,851 compounds and 1,851 words that should not be decompounded. It is slightly noisy but good enough for comparing methods.

The following plots were made by modifying the size of the generated lexicon (using word frequency as the filtering criteria). Note that as the lexicon grows, the recall increases (we are able to decompound more compounds), but the precision falls (we start decompounding words that shouldn’t be split).

Combinatorial uses the combinatorial method described above (using the geometric mean) on the lexicon built by our method up to the third point (before augmenting the lexicon with our declension dictionary). Greedy represents the greedy method using the same lexicon. Greedy + alternative is the full lexicon enriched by our declension dictionary (up to point 4). Note that the vocabulary size for greedy + alternative is not right, as it doesn’t count the additional declensions. SECOS vocabulary does not vary, and we used the pre-trained model provided on their GitHub page.

Once again, we can’t use the combinatorial method in practice because of its computational cost. Our method (greedy + alternatives) reaches a close F1-score with a much faster computation time.

You may have noticed that the SECOS results are not similar to the ones reported in the researchers’ article. That’s because our evaluation was stricter than the one they used. We followed the evaluation defined in an article on compound splitting methods^footnote-3 and considered a split wrong even if it was partially correct. For example, while Wissenschaftskolleg -> Wissen + Schaft + Kolleg is completely wrong for us, it would be partially correct for the SECOS evaluation. Nevertheless, using their evaluation script on the German test set, their method had an F1-score of 0.87, while ours (greedy + alternatives) reached 0.92.

The decompounding feature in action

We support decompounding for six languages: Dutch (nl), German (de), Finnish (fi), Danish (da), Swedish (sv), and Norwegian Bokmål (no). To learn more about decompounding, take a look at our documentation on splitting compound words.

The generated lexicons are good for general data, but many of our customers work in niche fields that have very specific vocabularies. Using our customer dictionaries, you can force the decompounding of a given word in any way you want.

In summary

We showed that finding a suitable lexicon is the most important part of a good decompounding algorithm. We implemented a way to compute a good lexicon, which needs only:

A large quantity of target language text
A part-of-speech tagging classifier in the target language
A list of language-specific linking morphemes

This lexicon, combined with a greedy longest match decompounding algorithm, beats more-complex implementations and doesn’t sacrifice query and indexing speed.

Appendix

A.1 Measures

To evaluate the quality of a decompounding, we used the usual tests of precision, recall, F1-score, and accuracy. For precision, recall, and F1-score, we followed the definition that Koehn and Knight used in most of their papers on decompounding:

Correct split = #words which should be split and were split correctly
Correct non = #words which should not be split and were not
Wrong not = #words which should be split but were not
Wrong split = #words which shouldn’t have been split but were
Wrong faulty = #word which should have been split, were split, but wrongly
Correct = correct spilt + correct non
Wrong = wrong not + wrong split + wrong faulty

Which brings us to:

We also consider a split correct if it matches the desired split +/- linking morphemes. For example, when decompounding AdressenListe, if we find Adressen + Liste instead of Adresse + Liste, we still consider it a correct split. Note that we are using a strict evaluation, as it considers a split wrong even if it is partially correct.

Decompounding query keywords from compounding languages. E. Alfonseca et al., Proceedings of ACL-08: HLT, Short Papers, 2008.↩
Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods. M. Riedl, C. Biemann, HLT-NAACL, 2016.↩
Empirical Methods for Compound Splitting. P. Koehn, K. Knight, 10th Conference of the European Chapter of the Association for Computational Linguistics, 2003.↩

tech blog.Developer's Rooftop

// words behind the code