Tuesday, November 1, 2016

Short Tip: Find files with non-printable ASCII characters

I have a directory full of different files obtained from the Internet and it turned out that some of them contain UTF-8 characters because of which indexing didn't work. So, I had to find all files that contain such characters. The solution I found was the following one:
LC_ALL=C find . -name '*[! -~]*'
This command will print all filenames with embedded unicode characters represented as question marks. Few facts about this command:
  1. Assignment (LC_ALL=C) temporarily switches to C locale during the execution of find(1) command. The effect of this is that find(1) will not interpret multibyte utf8 characters, but strictly byte-per-byte input.
  2. find(1) will then search for file name that don't contain printable ASCII characters. To see this, take a look at a glob pattern. First and last star mean that the square brackets can be anywhere within the file name. Square bracket, on the other hand, specifies class of characters outside (exclamation negates range) of a range from space (ASCII code 32) to tilde character (ASCII code 126).
The output of find(1) command will include question marks in places where byte (ASCII char) has a value below 32 or above 126. In order to see what unicode character is in the particular place, you can pipe output to, e.g. cat(1) command, like this:
LC_ALL=C find . -name '*[! -~]*' | cat
This will work because cat(1) command will have unicode encoding selected (the value of the variable LC_ALL isn't changed for it) and will properly interpret and output multibyte sequences used in utf8 coding. Actually, if you want to nitpick, cat isn't going to interpret anything but will initialize terminal to properly handle utf8 characters which will do actual interpretation. 

No comments:

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)