Estimating filename encodings with Okteta

Today I had to make a backup of a user’s complete home directory on a remote system. Options might have been scp or similar, but I opted for the KDE way: Opening a Konqueror with sftp://user@host/ and a Konqueror with a local directory. Set “Show hidden files” to true, selected all directories and files in the home directory (don’t ask me why I didn’t just select the home directory itself, perhaps I wanted to see what I was copying), then drag’n’dropped all into the local directory.

Fine so far: the sftp kio-slave started to collect all dirs and files (more than expected, I had missed some copies of big Java libraries, bah), did so for a while, until it complained about some file names with unknown chars in it (๏ฟฝ). Turned out there had been a system update which changed the official encoding of the file names to UTF-8, without touching existing names. Which hurts if you use non-ASCII characters, like you do if your language has e.g. umlaute. And then there were several file transfers from different operating systems which used even other encodings (remember, with unixoid(?) systems the filenames inside the filesystem are just handled as strings of bytes, and it’s up to the programs/libs to do the de-/encoding, based on environment variables).

Now there is a great feature with Konqueror (or the sftp kio-slave?) that you can set the encoding to apply for the names in the remote filesystem (menu “Extras”->”Set remote encoding”). But this only works, if there is exactly one and not several encodings which have been used for the file names, like in this case.

Remote commandline access and convmv from Bjรถrn Jacke to the rescue. First guess for the old names was ISO 8859-1, which solved most, but not all. As I am not too experienced which encoding with which strange name has been used on what system, for the rest I tried two or three which sounded familiar, but was not successfull. As it was quite a little bit of work to change the encoding all the time in the -to parameter of the commandline, I… no, did not write a foreach loop with a pause (Bash syntax is not in my sleeves), but just had the bytes of the filenames in question written in a file, with
ls --literal >listing
And bytes in a file, yes, can be looked at with Okteta ๐Ÿ™‚
Right clicked on the file “listing” in the Konqueror view of files accessed with sftp, “opened with” Okteta, put the mouse cursor over the char encoding combobox in the status bar, then passed through the encodings with the wheel, until the char in question showed the expected letter. Gave that encoding as parameter to convmv, repeated the same with another name in yet another encoding, finally convmv reported no other name not encodable with UTF-8, puh. Not a usecase I had in mind for Okteta, so far ๐Ÿ˜‰

Now repeated the initial drag’n’dropping and had all files backup’ed. Including original timestamp, but ownership updated to my local account. Like I would have liked it. That part was also a pleasure, Konqueror, Dolphin-KPart and sftp kio-slave authors, thanks ๐Ÿ™‚

Update: Just curious. What would have been the real admin’s approach to this problem?


2 thoughts on “Estimating filename encodings with Okteta

  1. Interesting ๐Ÿ™‚

    I’m not an admin per se, but I think I would have gone with something like ssh user@host ‘tar cf – /home/user’ > userhome.tar

  2. I think the job of the admin is the job of getting things done — the more experienced admins probably already have a list of tricks, but there’s no official way that I’m aware of.

    BTW, that convmv look’s really interesting; I could imagine finding it handy myself.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.