Dynamic Text Processing

From LMU BioDB 2017
Jump to: navigation, search

As hinted in our Introduction to the Command Line, we actually have more power at our fingertips than one might expect thanks to the command line’s ability to pass a coherent stream of data from one command to another. On this page, we cover two commands that lend themselves particularly well to this approach: grep and sed.

Finding Text: grep

grep finds specific text within its input data according to some pattern. Unfortunately, explaining the name is too complicated for now, so let’s just leave it at grep:

grep "<pattern>"

This will try to find the desired pattern in the lines that you type. If a line matches, it will repeat that line. If it doesn’t match, it will just wait for the next line until you hit Control-d to end your input.

Try this:

grep "Romance"

Then, type any number of lines. Include the word Romance in some but not others. Notice that the only lines that repeat are the ones with Romance in them. Notice also that the matching is case-sensitive—i.e., romance will not match.

Non-Exact Matches

Exact matches are interesting, but most other everyday applications can do this without a problem. Note how we said that grep can match a pattern and not just search text. It turns out that grep can “understand” a wide variety of symbols that represent different patterns of text.

A period (.) represents any single character. Thus, this pattern:

grep "st..r"

...produces all lines that have “st” and “r” with any two symbols in between. So lines with steer or Fred Astaire will match, but store or restart will not.

Here are some other patterns that you’ll find useful. Needless to say, this is just the tip of the iceberg; as you get more comfortable with grep, you can learn more and more variations for text patterns.

[<characters>] Matches lines that have any of the characters listed in <characters>
^pattern Matches lines that start with the given pattern
pattern$ Matches lines that end with the given pattern
pattern* Matches zero or more repetitions of the given pattern
^[^<characters>]*$ Matches lines that do not have the characters listed in <characters>

Note the dual use of ^; when within brackets [ ] this means “do not match the characters” but when it is the first symbol of the pattern, it represents the start of a line.

As mentioned, there are many more, but this is a start.

A Few Examples

It’s the patterns that truly reveal grep’s potential power. For example, try this:

grep "[qz]"

Here’s what appears on the screen if the user types “hello world,” “quit bugging me,” “Quit bugging me,” “what's up,” “Zounds!,” “zoundz!,” then Control-d:

hello world
quit bugging me
quit bugging me
Quit bugging me
what's up
Zounds!
zoundz!
zoundz!

Since only “quit bugging me” and “zoundz!” match the [qz] pattern, then only those lines are repeated by grep.

Negations ([^ ]) may seem unintuitive at first but after some consideration their behavior does make sense:

grep "[^qz]"

At first, one might think that this will match data that have neither q nor z within. However, this is not the case:

hello world
hello world
quit bugging me
quit bugging me
Quit bugging me
Quit bugging me
what's up
what's up
Zounds!
Zounds!
zoundz!
zoundz!

That’s because if you have any character that isn’t a q nor z, then grep considers that to be a match. Only data that consists entirely of qs and zs will not match:

qqqq
zzzzzzz
qzqzzzq
qq

The key to matching data that don’t contain those characters at all is to combine them with ^, *, and $:

grep "^[^qz]*$"

This pattern says that no character from the beginning to the end of the line may be a q nor z:

hello world
hello world
quit bugging me
Quit bugging me
Quit bugging me
what's up
what's up
Zounds!
Zounds!
zoundz!

Remember again, though, that we are case-sensitive by default, so you need to include Q and Z in your pattern if you want to factor in capital letters.

Modifying Text: sed

If grep is equivalent to the Find feature on many everyday applications, then sed is like Search and Replace, but on steroids. sed stands for stream editor—a name that makes sense because we’ve talked about data as something that “flows” from one program to another via pipes.

sed’s main function is to take a line of text, then modify it according to some rule that you give it. As with grep, there are lots of rules that you can specify, but for this section we will highlight just a subset.

This is how a sed invocation looks:

sed "<rule>"

As you might expect, this sed invocation is intended to be used as part of a pipe, such as:

cat <filename> | grep "<pattern>" | sed "<rule>"

The above pipe will extract the lines in filename that match pattern, then make sed modify those lines according to the given rule.

When used by itself, as you should be expecting by now, sed "<rule>" then just modifies the lines that you type in on the fly, until you hit Control-d.

  • Note: sed by itself does not make any permanent changes to files, so don’t worry about messing up any of the data that you’re using. All examples just display the edited text. Saving to a file is a matter of using output redirection (>).

Replacing Text That Fits a Pattern

Perhaps the most commonly-used modification rule in sed is s/<pattern>/<replacement>/g. The use of the term pattern is no accident—the patterns that sed recognizes for matching text are nearly identical to those recognized by grep. replacement then takes the place of those matched patterns of text:

sed "s/<pattern>/<replacement>/g"

Give this a shot:

sed "s/Hello/Goodbye/g"

If you then type Hello, hi, bye, and Hello World! as individual lines, sed interjects its output to produce this:

Hello
Goodbye
hi
hi
bye
bye
Hello World!
Goodbye World!

Observe that, unlike grep, sed does repeat every line you type, regardless of whether or not that line matches the pattern. The difference is that, when there is a match, sed performs the specified replacement. Thus, Hello becomes Goodbye, and Hello World! becomes Goodbye World!

The power behind this search-and-replace functionality comes from the patterns that you can use, again very similar to those used by grep. In addition to exact matches (like the one showed above), you can do:

. Matches a single character
[<characters>] Matches lines that have any of the characters listed in <characters>
^pattern Matches lines that start with the given pattern
pattern$ Matches lines that end with the given pattern
pattern* Matches zero or more repetitions of the given pattern
[^<characters>]* Matches characters that are not listed in <characters>

Note how, when we are doing text replacement, the use of [^ ] is a little more intuitive:

sed "s/[^FLIMVSPTAYHQNKDECWRG-]/*/g"

...replaces any letter that is not one of the characters in between the brackets with an asterisk (*), period. No need to use ^ and $ to indicate the beginning and end of the line.

In terms of replacement, in addition to replacement with an exact piece of text, you can also delete the matched text; this is a matter of placing nothing in between the second set of slashes (/):

sed "s/Evil//g"

...deletes the text “Evil” from any input lines that have it. Of course, you can use non-exact patterns, such as this:

sed "s/^..//g"

The above rule unconditionally deletes the first two characters of each input line, no matter what that character is.

Perhaps even more powerful, you can also include the matched text—even though you don’t know what that is per line—in the replacement. Do this by including an ampersand (&) in the replacement text. The matched text replaces the ampersand in the final output. For example:

sed "s/.../& /g"

...this will replace any three characters with the same characters plus a space. Since those three characters will differ from line to line (and in fact many lines will have more than one set of three characters), having & available lets you keep those while adding a space.

As another example:

sed "s/[aeiou]/&&/g"

...will double up every lowercase vowel in the input text.

Try those search-and-replace operations on Microsoft Word 😅 Not impossible, but probably harder than doing it with sed (once you learn the patterns).

Gathering Up a Bunch of Rules in a Single File

What if you want to perform a whole bunch of search/replace activities on some text data? On the one hand, you can type multiple sed commands in a pipe. For example, changing all “The”s to “Them” then changing “Bones” to “Brains” may be done this way:

sed "s/The /Them /g" | sed "s/Bones/Brains/g"

When you have a lot of substitutions to do, it would be a pain to write out a long pipe. For precisely this reason, sed allows a variation that does not include the actual rule in the command, but reads the rules from a separate file:

sed -f <file with rules>

The file with rules is a simple text file, with one sed substitution rule per line. Invoking sed -f <file with rules> on a stream of text data is equivalent to performing sed, sequentially, once for every rule in file with rules. It’s mainly a time saver, but a significant one.

Replacing Characters With Another Set of Characters

As powerful as s/<pattern>/<replacement>/g it is, it actually has limitations. For example, what if you wanted to do something similar to a “secret decoder ring,” where, say, every letter becomes the letter after it, and “z” cycles back to “a”? You might think that including a sequence of s/<pattern>/<replacement>/g rules in a file will do this, but it won’t (for simplicity, we’re only including lowercase):

s/a/b/g
s/b/c/g
s/c/d/g
...
s/z/a/g

This won’t work: since the replacements are done in sequence, a word like “adios” then becomes “bdios” after the first substitution (i.e., “b” for “a”). Then, when “b” is substituted for “c”, “bdios” then becomes “cdios”—which isn’t what you want.

What we need is a different rule, which substitutes multiple letters for a different one in one fell swoop. This rule does exist in sed, and that is:

y/<original characters>/<new characters>/

Because the replacement must be one-to-one, there must be as many characters in <original characters> as there are in <new characters>. With the y/<original characters>/<new characters>/ rule, the “secret decoder ring” becomes possible:

sed "y/abcdefghijklmnopqrstuvwxyz/bcdefghijklmnopqrstuvwxyza/"

As you might expect, this sed command will “decode” the message produced by the one above:

sed "y/bcdefghijklmnopqrstuvwxyza/abcdefghijklmnopqrstuvwxyz/"

Inclusion of uppercase letters, plus any other substitutions, are left to you for practice. Do note, however, how y/<original characters>/<new characters>/ is materially different from s/<pattern>/<replacement>/g.

Slash (/) in Your Text

The slash character (“/”) is used by sed as a separator: in the s///g and y/// rules, the slash separates the text/pattern to match from the text that will replace those matches. But what if you want to include the slash character for either matches or replacements? In this case, the slash isn’t a separator, but just another symbol in the pattern or replacement.

To tell sed that a particular slash should be interpreted as part of the text, and not a separator, precede it with a backslash (“\”). The slash that follows it is then treated as part of the pattern or replacement text:

sed "s/\// slash /g"

...will replace all “/” characters with the word “slash.”

sed "s/Title/<h1>&<\/h1>/g"

...will enclose all instances of the text “Title” in h1 HTML tags: <h1>Title</h1>.

This technique for indicating that a symbol should be treated as “plain” text and not something with special meaning is usually called escaping that symbol. Escaping is necessary for any symbol that has special meaning in sed’s patterns. Thus, you can correctly infer that symbols like [, ], ^, ., *, and $ also need to be escaped.

Line-Based Actions

You may or may not have noticed that sed is very line-centric: it operates on patterns within lines of text, with each line being treated separately. This can be used to your advantage thanks to a few line-oriented commands.

Picking Lines

You can precede your s///g or y/// rules with one or two numbers — these numbers represent the lines which you would like the rule to affect. Two numbers represent a line range — they must be separated by a comma. The sed command will then only affect the lines within that range. For example:

sed "5s/hello/goodbye/g"

...will only replace “hello” with “goodbye” on the fifth line of the input text.

sed "3,5s/LMU/UCLA/g"

...will only affect lines 3, 4, and 5 of the input text.

Inserting Lines

A backslash followed by a lowercase n\n — represents a new line. When used in the replacement section of either the s///g or y/// rules, \n will “break” the incoming line, thus adding a line to the resulting output. Thus,

sed "s/paragraph/&\n/g"

...adds a line break after every instance of “paragraph” in the input text.

Combining Lines

Paradoxically, the reverse operation of eliminating lines is actually very tricky in sed; this is precisely due to sed’s use of lines as separate units of work. For input, the newlines are effectively invisible to sed, requiring some trickery which, for now, will be left unexplained because what’s going on is just too involved.

Thus, for now, take on faith that this sed rule will combine all of the input text into a single line:

sed ':a;N;$!ba;s/\n//g'

A few notes on this this particular rule:

  • Needless to say, you must type this in exactly — note the single quotes as opposed to the usual double quotes.
  • The latter part of the rule should look familiar: s/\n//g does mean “substitute all newlines with nothing.” The trick is that the previous sequence of symbols is what makes these newline characters visible to sed in the first place. s/\n//g by itself does not work.
  • A variant of the command above is to replace newlines with something else. For example:
    sed ':a;N;$!ba;s/\n/ /g'
    ...still combines the input text into a single line, but inserts a space wherever the linebreaks used to be.

Deleting Lines

You can delete lines outright with a new “letter rule:” d. This invocation of sed, then, pretty much deletes everything:

sed "d"

Thus, d is not very useful by itself. It does work with the “line picking” directive of preceding the letter with numbers, and this is typically how d is used:

sed "2,4d"

...deletes the second, third, and fourth line from the input text.

sed "10d"

...will delete only the tenth line of the input text.

There are other ways to tell sed which lines to delete, but for now we’ll stop with the above deletion by line number.

Quitting After a Specific Line

A special case for deleting lines is deleting from a certain line up to the very end. This can be tricky especially if you don’t know how many more lines are coming in. For this, there is the q rule: q for “quit.” If you think about it, deleting from a certain line onward is actually the same as just stopping further processing. If sed stops output past a certain line, then that is equivalent to deleting those lines:

sed "59q"

…will effectively cut off the data stream from the 59th line onward.

Changing Only the nth Match on a Line

At this point, we can expand the s///g substitution rule a little bit by giving you something other than g at the end of that rule. g stands for global, meaning that all matches of the given pattern should be replaced in the input text.

Instead of g, you can place a number at the end of the s rule. This will replace only the number-th occurrence of the given pattern on each line. Thus, replacing only the third “hello” in a line with “dolly” can be done as follows:

sed "s/hello/dolly/3"

This command will leave “hello” and “hello hello” unchanged, but will change “hello hello hello hello” into “hello hello dolly hello”.

Caveat: Pattern Matching is Greedy

We conclude this overview of sed with a note on how it matches text: sed is said to be greedy. It is probably best to illustrate this with an example:

sed "s/(.*)/term/g"

Before trying this out yourself, try to anticipate what this will do. Ostensibly, this should take any set of characters between parentheses and replace that with the string term. If you agree with that reasoning, go ahead and try it out. What does a line of (a + b) * (a - b) / 2 produce?

If you were surprised to see term / 2 in the output, and not term * term / 2 don’t feel bad—this catches almost everyone off guard. This is where being greedy comes in: the reason you see just term / 2 is that sed does not match the closing ) until the very last one that it finds. Thus, it replaces the entire (a + b) * (a - b) match with term. Think of it as gobbling everything up until there’s no longer any match for the rest of the line.

The solution to this is to make sure sed does not count intervening )s (or whatever symbol you consider as the terminator) within the in-between text:

sed "s/([^)]*)/term/g"

Note how we have replaced the period . here with “anything that is not )”—this will effectively force sed to declare a match at the first closing parenthesis ) that it finds:

(a + b) * (a - b) / 2
term * term / 2

This approach isn’t perfect—for example it does not handle nested parentheses correctly—but for certain situations, this trick will be just what you need (nudge nudge hint hint).

There’s More Where This Came From

Amazingly, there’s a lot more that grep and sed can do that has not been covered here. But what’s on this page should give you a broad enough palette of things to try when tackling any assignments involving the command line. You won’t necessarily need all of these features, but the features you do need for certain have definitely been covered.