We believe that emergent properties in the PanLex Database such as expression scores are an area ripe for further research, both inside and outside the PanLex team. On the other hand, a more obscure word like “stoat” is found in a lot fewer dictionaries, receives a much lower expression score, and is also much less likely to be part of a speaker’s vocabulary. It is also extremely likely to be in a speaker’s vocabulary. For example, a word like “house” is likely to be found in many translation dictionaries and receives a very high expression score. An expression score is essentially a representation of how likely an individual word is to be found in a dictionary, and the PanLex team hypothesizes that it is an approximation of the likelihood that the word is part of a speaker’s vocabulary. An expression score is sum of the quality scores of the sources in which the expression (an individual word or word-like entry) is found. Unexpected benefitĪn interesting unintended property discovered early in the history of PanLex is that not just translation quality scores can be calculated, but also expression scores.
This means that a high score can result from a translation that is attested either in a few high-quality sources or many low-quality sources. To calculate a translation’s quality, we sum the quality scores of all the sources that attest the translation. These quality scores allow PanLex to make use of a very wide variety of sources, especially for languages with scant published data, while maintaining accuracy in translation results. A simple wordlist produced by a hobbyist and published on a blog may receive a much lower score. A well-researched dictionary by a trained linguist will receive a quality score of 8 or 9. PanLex Translator App with red bars indicating relative quality of translations of English “house” into French.Įach of the thousands of sources (dictionaries, databases, etc.) that PanLex consults for translation data is assigned a quality score ranging from 0 to 9 indicating our opinion of the accuracy of translations based on the source. We will leave discussion of inferred translation quality scores to a future post.) (PanLex can also infer translations that are not directly attested in any single source. The translation quality score is based on the number of PanLex sources the translation is found in, and the quality of those sources. This bar represents PanLex’s translation quality score, a measure of the level of confidence we have in that particular translation of the original word into the target language. If you have used the PanLex Translator, you may have noticed that beside each translated word is a small red bar, of varying lengths. But how, exactly, are these fake words generated? We use an emergent property of the linguistic information contained in the PanLex Database, and a simple probabilistic algorithm. Tags: Fake Word Generator, Markov Chain, PanLex Translator, Shakespeare, stoat, text generator, translation, translation qualityĮach month, PanLex generates and publishes new “fake words” such as “unequalitis” and “adjustache” to entertain our newsletter readers in the Fake Word of the Month challenge.