More Text Processing Features

From LMU BioDB 2013
Jump to: navigation, search

This page extends the basic text processing from Introduction to the Command Line by revealing more command line features and tricks, with a focus on additional sed capabilities.

Contents

More on Command Line Editing

As you use the command line more and more — particularly for long, piped sequences of commands — you’ve probably come to appreciate the usefulness of the arrow keys for recalling past commands and moving backward/forward along the current command, as well as the convenience afforded by Tab autocompletion. It may not surprise you to know that there are quite a few more available tools for working with commands.

Keyboard Shortcuts

In addition to the left and right arrow keys for moving backward and forward along the current command, the following keys allow additional types of moves. All of these keys use “modifiers” such as the control or alt keys. To use these shortcuts, hold down the modifier first, then tap the letter. Once the key does its work, you can let go of the modifier:

  • control-a jumps to the beginning of the current command (mnemonic: “a” as in the beginning of the alphabet).
  • control-e jumps to the end of the current command (mnemonic: “e” as in the end of the command).
  • control-d does “forward delete” (i.e., what the Del key does in typical word processing applications).
  • control-k deletes everything from the cursor to the end of the line.

The following shortcuts use the alt key as the modifier; you can think of the control shortcuts as acting on single characters while the alt shortcuts act on words. If you’re using the command line via a remote connection like ssh or PuTTY, these might need additional configuration in order to work. Contact Dr. Dionisio if these shortcuts don’t seem to work for you:

  • alt-f moves forward one word at a time (mnemonic: f as in forward).
  • alt-b moves backward one word at a time (mnemonic: b as in backward).
  • alt-d does “forward delete” on words.

This last shortcut uses two modifier keys, but is well worth the effort:

  • control-shift-_ (technically the dash “-” key, but the shift key is held down so you’re really typing the underscore “_”) performs an undo of your last edit. Depending on the kind of edits you’ve made, you might be able to undo further, but that is not guaranteed.

The history Command

The up and down arrow keys have been mentioned as a means for navigating your past history of commands. More explicitly, you can type:

history

...and you will get a numbered list of past commands. To invoke a prior command by number, precede it with an exclamation point:

!8

This will invoke whatever the history command listed as command 8.

A little trivia: among techies, the exclamation point has the nickname “bang.” Thus, the history invocation above can be read as “bang eight.”

Note that the history command, like the other commands that you’ve seen so far, produces text as output, which can be piped into other commands. Thus, you can do things like list all of your past commands that involve sed by piping history through grep:

history | grep "sed"

This makes it easier to locate some prior command if you can think of some text that will distinguish that command from the others in your history.

More on sed

As has been mentioned before, only a fraction of sed’s features has been described so far. This section describes a selection of additional things that sed can do...and it still isn’t a comprehensive list.

Slash (/) in Your Text

The slash character (“/”) is used by sed as a separator: in the s///g and y/// rules, the slash separates the text/pattern to match from the text that will replace those matches. But what if you want to include the slash character for either matches or replacements? In this case, the slash isn’t a separator, but just another symbol in the pattern or replacement.

To tell sed that a particular slash should be interpreted as part of the text, and not a separator, precede it with a backslash (“\”). The slash that follows it is then treated as part of the pattern or replacement text:

sed "s/\// slash /g"

...will replace all “/” characters with the word “slash.”

sed "s/Title/<h1>&<\/h1>/g"

...will enclose all instances of the text “Title” in h1 HTML tags: <h1>Title</h1>.

This technique for indicating that a symbol should be treated as “plain” text and not something with special meaning is usually called escaping that symbol.

Newline/Linebreak Tricks

You may or may not have noticed that sed is very line-centric: it operates on patterns within lines of text, with each line being treated separately. This can be used to your advantage thanks to a few line-oriented commands.

Picking Lines

You can precede your s///g or y/// rules with one or two numbers — these numbers represent the lines which you would like the rule to affect. Two numbers represent a line range — they must be separated by a comma. The sed command will then only affect the lines within that range. For example:

sed "5s/hello/goodbye/g"

...will only replace “hello” with “goodbye” on the fifth line of the input text.

sed "3,5s/LMU/UCLA/g"

...will only affect lines 3, 4, and 5 of the input text.

Inserting Lines

A backslash followed by a lowercase n\n — represents a new line. When used in the replacement section of either the s///g or y/// rules, \n will “break” the incoming line, thus adding a line to the resulting output. Thus,

sed "s/paragraph/&\n/g"

...adds a line break after every instance of “paragraph” in the input text.

Combining Lines

Paradoxically, the reverse operation of eliminating lines is actually very tricky in sed; this is precisely due to sed’s use of lines as separate units of work. For input, the newlines are effectively invisible to sed, requiring some trickery which, for now, will be left unexplained because what’s going on is just too involved.

Thus, for now, take on faith that this sed rule will combine all of the input text into a single line:

sed ':a;N;$!ba;s/\n//g'

A few notes on this this particular rule:

  • Needless to say, you must type this in exactly — note the single quotes as opposed to the usual double quotes.
  • The latter part of the rule should look familiar: s/\n//g does mean “substitute all newlines with nothing.” The trick is that the previous sequence of symbols is what makes these newline characters visible to sed in the first place. s/\n//g by itself does not work.
  • A variant of the command above is to replace newlines with something else. For example:
    sed ':a;N;$!ba;s/\n/ /g'
    ...still combines the input text into a single line, but inserts a space wherever the linebreaks used to be.

Deleting Lines

You can delete lines outright with a new “letter rule:” D. This invocation of sed, then, pretty much deletes everything:

sed "D"

Thus, D is not very useful by itself. It does work with the “line picking” directive of preceding the letter with numbers, and this is typically how D is used:

sed "2,4D"

...deletes the second, third, and fourth line from the input text.

sed "10D"

...will delete only the tenth line of the input text.

There are other ways to tell sed which lines to delete, but for now we’ll stop with the above deletion by line number.

Strategies with Picking, Inserting, Combining, and Deleting Lines

Combining these line-manipulation tricks can be very useful for modifications that can’t be captured by a combination of s///g and/or y/// rules alone. For example, what if you want a substitution to happen only after some other text pattern has been found? The way to mark that you’ve already seen something is to add a line to it. You can then delete the line breaks, or the lines themselves, when you’re done.

Suppose you wanted to convert all “a”s on a single line into hyphens (“-”), but only if those “a”s appear after a letter “x” — this is impossible with plain s///g or y/// rules. You could, however, break that line after the “x” then perform the “a” replacement only on the second line, finally putting that line back together:

sed "s/x/&\n/g" | sed "2s/a/-/g" | sed ':a;N;$!ba;s/\n//g'

Pass a single line of data into that command to see it at work; you can use the echo command to pass some text that you can type in on the spot:

echo aaaaaxaababab | sed "s/x/&\n/g" | sed "2s/a/-/g" | sed ':a;N;$!ba;s/\n//g'

Changing Only the nth Match on a Line

At this point, we can expand the s///g substitution rule a little bit by giving you something other than g at the end of that rule. g stands for global, meaning that all matches of the given pattern should be replaced in the input text.

Instead of g, you can place a number at the end of the s rule. This will replace only the number-th occurrence of the given pattern on each line. Thus, replacing only the third “hello” in a line with “dolly” can be done as follows:

sed "s/hello/dolly/3"

This command will leave “hello” and “hello hello” unchanged, but will change “hello hello hello hello” into “hello hello dolly hello”.

Pattern Shortcuts

The pattern section of the s/// rule (note how we’re no longer including the g since numbers can go at the end too) can handle quite a few more types of text sequences. The two new ones in this section require that you include -r between the sed command and the s/// rule. This is because, by default, sed doesn’t “turn on” all of its pattern-matching features; the -r modifier tells sed to acknowledge the following, among others:

A Specific Number of Repetitions

Suppose you want to substitute any instance of “ta” repeated 10 times with an asterisk (“*”). You can type the whole thing:

sed "s/tatatatatatatatatata/*/g"

Sometimes this is acceptable; other times, it’s a pain. For this, you can specify a number of repetitions for a desired pattern, enclosed in curly braces { }:

sed -r "s/(ta){10}/*/g"

The -r modifier makes sed recognize this shortcut, and will interpret {10} as “the last item repeated 10 times.” The “item” in question can be any other pattern, enclosed in parentheses.

You can skip the parentheses for patterns of a single character or the square bracket [ ] choice of characters.

sed -r "s/z{5}/ snore /g"

...replaces all “zzzzz” occurrences with “ snore ”, while

sed -r "s/[cq]{2}/k/g"

...replaces any pair of “c”s and “q”s with a single “k” in the output.

Multiple Choice

Another recognized -r pattern is “multiple choice.” Suppose you wanted to replace either “dog” or “cat” with the word “pet.” As with the number of repetitions, you can do it the long way:

sed "s/dog/pet/g" | sed "s/cat/pet/g"

...but with multiple choice, represented by the vertical bar | (yes, the same vertical bar for pipes, but now used in a different context), the command becomes a little more compact:

sed -r "s/dog|cat/pet/g"

Note what these -r patterns have in common: they are essentially shortcuts, because they can be done using “simpler” sed patterns. However, having the repetition count {n} or multiple choice | symbols can shorten things, thus making some commands more readable and less error-prone.

Multiple sed Rules in a Single Bound

To conclude this next “layer” of command line information, we introduce something relatively simple: stringing multiple sed commands together. This is also a shortcut of sorts, but does not require -r since sed can do this by default.

In many cases, instead of piping text information like so:

sed "y/actg/tgac/" | sed "s/t/u/g"

...you can just combine the rules into a single sed command — just put a semicolon in between them:

sed "y/actg/tgac/;s/t/u/g"

This makes sed apply the two rules one after another.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox