Never Lose a Dead End
I tried to find the cause of what looked like an interesting text encoding bug. There wasn't one.
The background: ftfy
I make a Python package called ftfy
, which detects and fixes encoding problems ("mojibake") in Unicode, like when someone means to say "İstanbul" but it comes out as "İstanbul". Lots of people use it. It's not a machine learning thing, it's a heuristic that I design by hand.
One of the users recently asked if I could make it possible for ftfy to detect encoding mix-ups involving the Windows-1257 encoding, which in the dark days before UTF-8 was used for a lot of Lithuanian and Latvian text. I worked on it and adjusted the heuristic, and now version 6.3 can do that.
Looking for a real-world example
I could easily construct artificial examples of Lithuanian mojibake, but then I'd just be testing whether my own assumptions about Lithuanian mojibake fit my own assumptions about Lithuanian mojibake, so I wouldn't be testing anything at all. I wanted to see the new heuristic work to fix real text in the wild.
I've collected a bunch of interesting test cases. I used to use the Twitter API to find these. That's gone, of course. Now I use the OSCAR corpus, a collection of fragments of Web pages found by crawling the web, supposedly with a lot of the spam filtered out. (It's targeted at AI training, which is not what I'm doing with ftfy, but it does work for the purpose of finding test cases.)
By running a script that searched through OSCAR until it found what looked like Lithuanian mojibake, I found this sentence:
Å iaip ÄÆdomu, kaip ÄÆsivaizduoji.
and ftfy decoded that to:
Šiaip įdomu, kaip įsivaizduoji.
Yes! YES! It worked on real text that existed on a web page! There's even multiple layers of mishaps that appear to have happened to the text:
- The text was written in the UTF-8 encoding (the encoding used almost everywhere today).
- It was decoded in the Windows-1257 encoding (the old Windows encoding for Baltic languages).
- The decoded first word contained a non-breaking space -
Å<NBSP>iaip
- which got reinterpreted as just a space. ftfy can handle that, and it was great to see it handle it in this new encoding.
I don't just want to paste stuff into my test cases if I don't know what it's about, so I asked Google Translate what this sentence meant:
Anyway, it's interesting how you imagine.
Hm. That's a vague sentence. I needed more context.
Finding the page was a bit difficult. OSCAR gave me the URL for it, but the domain it was on (ratu.lt) is now a parked domain. This would be a job for the Internet Archive, except that day it got cyberattacked and went offline.
Today, the Internet Archive is mostly functioning again, so I could find the original page and try to figure out what it's about. The title translates to "Never Lose a Dead End Lyrics Review".
This translation seems a little weird
Finding the full text of the page was exciting at first, because ftfy could fix parts of it but not the whole thing. There seemed to be more layers to the encoding bug than I'd seen in just that one sentence.
Some sentences on the page have accented letters that look entirely normal. Others are full of mojibake. It's not even consistent — a glitched-out word containing an em dash ("vaikystÄ—je") comes right after an em dash that was actually used for its purpose as punctuation.
That's okay. I've seen encoding errors become multifaceted and inconsistent, especially when Web pages are involved. There's one way to do Unicode right (UTF-8) and limitless ways to do it wrong. I asked Google Translate to translate more of the text, so I could try to understand more what's going on.
If you can read Lithuanian and you loaded the archived page, you can presumably already tell that something else is off. It took me some time to come to the same realization.
The text starts with a confusingly flowery introduction, then settles into what seems to be some kind of interview, discussing the lyrics of a Lithuanian song, "Niekada nepamesk aklavietės" (which might mean "Never Lose a Dead End"). The formatting of the page is apparently all broken, so it's hard to follow who's saying what.
One of the paragraphs turns out to mean "weight loss stories simple tips to get rid of belly fat". Okay, so this is a copy of the text with spammy links inserted into it, that's a familiar enough thing and it might even be the point where the encoding error was introduced.
The song title keeps being mentioned. One of the speakers describes that in childhood you might have "heard these lyrics to death".
If I could find this song, "Niekada nepamesk aklavietės", that Lithuanians have supposedly heard to death, I'd have the context. Oddly, the lyrics — supposedly the focus of the article — are not quoted anywhere in it. I started searching for the song title. There are no Google results.
The interview is odd. I know that Google is probably mistranslating some things, but the conversation seems to have no particular topic except for music, vaguely. Queen, Bach, and Deep Purple are name-dropped, but for no clear reason. Aside from the song title recurring, and the themes of making music and listening to music, no paragraph really has a connection to the previous one.
There's a YouTube video for a Lithuanian song embedded on the page. It is not the song they are talking about. The name of the group performing it translates to "The Lithuanians".
Farther down, one of the headings is "See, that’s what the app is perfect for." in English.
Finally, I have the realization
I've been bamboozled. This entire page is AI-generated.
There was no interview. There was no song. And there was no process with an encoding error that created the page.
The page is just complete nonsense. It's a Large Language Model generating words that supposedly statistically go with each other in Lithuanian. And its statistics think that reasonable Lithuanian words include "kokybÄ—s" and "ÄÆdomu" and "www.youtube.com/embed/PmqdXrR9wrU".
There never was an encoding error that led to this web page. The LLM that created it generated fake encoding errors because that's what it believed Lithuanian looks like.
Maybe that means that the examples the LLM was trained on includes the kind of wild mojibake I was looking for in the first place, but more likely they just fucked up their own training data.
Text is fake now
I know text with mojibake in it is not the highest-quality writing. It means, for example, that nobody proofread it. But I love finding mojibake anyway, because it helps tell the story of what goes wrong with Unicode, which helps me try to fix it.
In this situation, it's like I'm a raccoon rooting through the garbage, and finding that it's all fake garbage molded from plastic to look like garbage.
I recently updated the documentation of another Python package, "wordfreq", essentially saying that I wouldn't be updating the frequency list anymore because text is fake now. This got a surprising amount of attention, mostly sympathetic. Journalists e-mailed me asking me to elaborate, or prove it, or talk about AI on their podcast, all of which I declined. But this whole silly nonsense is an example of the same thing. Don't worry, I'm still working on ftfy.
So I do not yet have a real-world example of Windows-1257 mojibake that ftfy can fix. My search led to a dead end. Which makes the fake title of the fake song oddly appropriate.