Saturday, October 18, 2014

More RegEx Recipes for Text Processing

Yesterday I wrote a piece on using some elementary Find-and-Replace stuff using Notepad++ for handling an issue that occasionally arises in transcript handling.

On this post I thought I'd add some real RegEx recipes for more advanced text handling.

(I will be using two programs, the first is the previously mentioned Notepad++.  It is a free, as in beer, software that I use extensively.  The other is TextPad by Helios Software Solutions.  It is only about $30 for a license, but if you need the advanced features, it is WELL worth the price.  I will indicate which recipes must use TextPad, otherwise the default is using Notepad++.)

(I will refer to these RegEx processes as recipes, meaning there is a Find portion and a Replace portion.  I will use quotes to make reading easier, but the start and end quotes are not meant to be used in the actual recipe.  I will indicate a space as <space>, since those tend to be difficult to read on line.  For example; Find: "\r\n<space>\r\n".  Also I will use "F:" and "R:" instead of using all the extra letters of "Find:" and "Replace:"  And obviously, unless noted, the quotes are NOT part of the recipes and are included solely to make it easier to read.)

Alright, first an explanation of the recipe for replacing the erroneous EOL characters from the other blog.

That recipe was F:  "\r\r\n" R:  "\r\n\r\n".
     Explanation:  Find the lonely Carriage Return followed by the correct EOL sequence (\r\r\n) and replace it with the correct EOL sequence - twice (\r\n\r\n).
     Summary:  Clean up files with this odd combination of EOLs.

Another version of this issue is a file that ONLY has Line Feed characters at the end of EVERY line.
This recipe would be F:  "\n"  R:  "\r\n"
(This explanation and summary is left as an exercise for the reader.)

The recipe to remove double-spacing is F: "\r\n\r\n" R: "\r\n".
     Explanation:  Find two ASCII EOL markers in a sequence (\r\n\r\n) and replace with a single EOL (\r\n).
     Summary:  Remove all empty lines, such as found in a double-spaced file.

Trim trailing whitespace F: "[<space>\t]+$" R:  ""
     Explanation:  Find and select any and all "+" spaces "<space>" or tabs "\t" that occur just before the end of the line "$" and replace them all "[]" with nothing.
     Summary:  Trim the end of a line of extraneous whitespace.

Now the follow on to trim leading whitespace F: "^[<space>\t]+" R:  ""
     Explanation, same as above, except "^" means the start of a line.

To find most problem non-ANSI characters use F: "[^\xd,\xa,\x20-\x7a]"
     ANSI characters are the printing characters and lower non-printing ones. The "\x20-\x7a" catches the upper problem characters.


Thanks and have a great day.

chuck

Monday, October 6, 2014

Atypical ASCII Line Breaks

WOW!  Bet I got your attention with that title!

This is a geekish post with real-world application to those importing transcripts into Visionary software.

I just received a support request that included this piece of information "This transcript does not use typical line break ASCII characters. "

I've seen enough of these to know that the issue is caused by the End Of Line (EOL) character(s).

In the PC (Windows) world, all text files are supposed to use two (non-printing) characters to signal the end of a line.  Those two characters reflect the old type-writer days (those of you too young to remember type-writers can skip past this explanation).  When the end of a line was reached on a type-writer, you returned the carriage to the first column and rolled the platen up one (or more) lines.  This passed into the personal computer world as "Carriage Return" and "Line Feed" or CR/LF to indicate the end of a line.

However, there are some groups that do not use this standard.  They follow the old unix way (shout out to my old Unix SysAdmin buddies).  These groups only use one character; the Carriage Return, to indicate the EOL.

This is what that looks like:



(I happen to use a program called Notepad++ that I have set to display all characters; that is why I can see the EOL's (and little dots for all the spaces).  Here is a link to the home page of this free - as in beer - software:  http://www.Notepad-plus-plus.org.)

And there you have the problem.

Have a great day out there.

Questions? Answers - Support@VisionaryLegal.com

chuck


What?  Are you still reading?

OK, you know who you are and you want a solution.

Well, using Notepad++, choose Search, Replace.

In the Replace tab enter (without the quotes) "\r\r\n" in the Find What field and "\r\n\r\n" in the Replace With field.  Click the radio button for Extended (\n...) and click the Replace All button.

When it finishes, all the single CR's will have been replaced by the more appropriate CRLF combination.

Of course at this point you are staring at a transcript that is double-spaced, so you can clear that with one more Search and Replace operation using "\r\n\r\n" and "\r\n".  Click the Replace All button a couple times until there are no more replacements.

Just like voting; save early and save often.

crb