Title: Workbook 8 String Processing Tools
1Workbook 8String Processing Tools
Pace Center for Business and Technology
2String Processing Tools
- Key Concepts
- When storing text, computers transform characters
into a numeric representation. This process is
referred to as encoding the text. - In order to accommodate the demands of a variety
of languages, several different encoding
techniques have been developed. These techniques
are represented by a variety of character sets. - The oldest and most prevalent encoding technique
is known as the ASCII character set, which still
serves as a least common denominator among other
techniques. - The wc command counts the number of characters,
words, and lines in a file. When applied to
structured data, the wc command can become a
versatile counting tool. - The cat command has options that allow
representation of nonprinting characters such as
NEWLINE. - The head and tail commands have options that
allow you to print only a certain number of lines
or a certain number of bytes (one byte usually
correlates to one character) from a file.
3What are Files?
- Linux, like most operating systems, stores
information that needs to be preserved outside of
the context of any individual process in files.
(In this context, and for most of this Workbook,
the term file is meant in the sense of regular
file). Linux (and Unix) files store information
using a simple model information is stored as a
single, ordered array of bytes, starting from at
first and ending at the last. The number of bytes
in the array is the length of the file. 9 - What type of information is stored in files? Here
are but a few examples. - The characters that compose the book report you
want to store until you can come back and finish
it tomorrow are stored in a file called (say)
/bookreport.txt. - The individual colors that make up the picture
you took with your digital camera are stored in
the file (say) /mnt/camera/dcim/100nikon/dscn1203.
jpg. - The characters which define the usernames of
users on a Linux system (and their home
directories, etc.) are stored in the file
/etc/passwd. - The specific instructions which tell an x86
compatible CPU how to use the Linux kernel to
list the files in a given directory are stored in
the file /bin/ls.
4What is a Byte?
- At the lowest level, computers can only answer
one type of question is it on or off? What is
it? When dealing with disks, it is a magnetic
domain which is oriented up or down. When dealing
with memory chips, it is a transistor which
either has current or doesn't. Both of these are
too difficult to mentally picture, so we will
speak in terms of light switches that can either
be on or off. To your computer, the contents of
your file is reduced to what can be thought of as
an array of (perhaps millions of) light switches.
Each light switch can be used to store one bit of
information (is it on, or is it off). - Using a single light switch, you cannot store
much information. To be more useful, an early
convention was established group the light
switches into bunches of 8. Each series of 8
light switches (or magnetic domains, or
transistors, ...) is a byte. More formally, a
byte consists of 8 bits. Each permutation of ons
and offs for a group of 8 switches can be
assigned a number. All switches off, we'll assign
0. Only the first switch on, we'll assign 1 only
the second switch on, 2 the first and second
switch on, 3 and so on. How many numbers will it
take to label each possible permutation for 8
light switches? A mathematician will quickly tell
you the answer is 28, or 256. After grouping the
light switches into groups of eight, your
computer views the contents of your file as an
array of bytes, each with a value ranging from 0
to 255.
5Data Encoding
- In order to store information as a series of
bytes, the information must be somehow converted
into a series of values ranging from 0 to 255.
Converting information into such a format is
called data encoding. What's the best way to do
it? There is no single best way that works for
all situations. Developing the right technique to
encode data, which balances the goals of
simplicity, efficiency (in terms of CPU
performance and on disk storage), resilience to
corruption, etc., is much of the art of computer
science. - As one example, consider the picture taken by a
digital camera mentioned above. One encoding
technique would divide the picture into pixels
(dots), and for each pixel, record three bytes of
information the pixel's "redness", "greenness",
and "blueness", each on a scale of 0 to 255. The
first three bytes of the file would record the
information for the first pixel, the second three
bytes the second pixel, and so on. A picture
format known as "PNM" does just this (plus some
header information, such as how many pixels are
in a row). Many other encoding techniques for
images exist, some just as simple, many much more
complex.
6Text Encoding
- Perhaps the most common type of data which
computers are asked to store is text. As
computers have developed, a variety of techniques
for encoding text have been developed, from the
simple in concept (which could encode only the
Latin alphabet used in Western languages) to
complicated but powerful techniques that attempt
to encode all forms of human written
communication, even attempting to include
historical languages such as Egyptian
hieroglyphics. The following sections discuss
many of the encoding techniques commonly used in
Red Hat Enterprise Linux.
7ASCII
- One of the oldest, and still most commonly used
techniques for encoding text is called ASCII
encoding. ASCII encoding simply takes the 26
lowercase and 26 uppercase letters which compose
the Latin alphabet, 10 digits, and common English
punctuation characters (those found on a
keyboard), and maps them to an integer between 0
and 255, as outlined in the following table.
8ASCII
- What about the integers 0 - 32? These integers
are mapped to special keys on early teletypes,
many of which have to do with manipulating the
spacing on the page being typed on. The following
characters are commonly called "whitespace"
characters.
9ASCII
- Others of the first 32 integers are mapped to
keys which did not directly influence the
"printed page", but instead sent "out of band"
control signals between two teletypes. Many of
these control signals have special
interpretations within Linux (and Unix).
10Generating Control Characters from the Keyboard
- Control and whitespace characters can be
generated from the terminal keyboard directly
using the CTRL key. For example, an audible bell
can be generated using CTRLG, while a backspace
can be sent using CTRLH, and we have already
mentioned that CTRLD is used to generate an "End
of File" (or "End of Transmission"). Can you
determine how the whitespace and control
characters are mapped to the various CTRL key
combinations? For example, what CTRL key
combination generates a tab? What does CTRLJ
generate? As you explore various control
sequences, remember that the reset command will
restore your terminal to sane behavior, if
necessary. - A tab can be generated with CTRLI, while CTRLJ
will generate a line feed (akin to hitting the
RETURN key). In general, CTRLA will generate
ASCII character 1, CTRLB will generate ASCII
character 2, and so on. - What about the values 128-255? ASCII encoding
does not use them. The ASCII standard only
defines the first 128 values of a byte, leaving
the remaining 128 values to be defined by other
schemes.
11ISO 8859 and Other Character Sets
- Other standard encoding schemes have been
developed, which map various glyphs (such as the
symbol for the Yen and Euro), diacritical marks
found in many European languages, and non Latin
alphabets to the latter 128 values of a byte
which the ASCII standard leaves undefined. The
following table lists a few of these standard
encoding schemes, which are referred to as
character sets. The following table lists some
character sets which are supported in Linux,
including their informal name, formal name, and a
brief description.
12ISO 8859 and Other Character Sets
- Notice a couple of implications about ISO 8859
encoding. - Each of the alternate encodings map a single
glyph to a single byte, so that the number of
letters encoded in a file equals the number of
bytes which are required to encode them. - Choosing a particular character set extends the
range of characters that can be encoded, but you
cannot encode characters from different character
sets simultaneously. For example, you could not
encode both a Latin capital A with a grave and a
Greek letter Delta simultaneously.
13Unicode (UCS)
- In order to overcome the limitations of ASCII and
ISO 8859 based encoding techniques, a Universal
Character Set has been developed, commonly
referred to as UCS, or Unicode. The Unicode
standard acknowledges the fact that one byte of
information, with its ability to encode 256
different values, is simply not enough to encode
the variety of glyphs found in human
communication. Instead, the Unicode standard uses
4 bytes to encode each character. Think of 4
bytes as 32 light switches. If we were to again
label each permutation of on and off for 32
switches with integers, the mathematician would
tell you that you would need 4,294,967,296 (over
4 billion) integers. Thus, Unicode can encode
over 4 billion glyphs (nearly enough for every
person on the earth to have their own unique
glyph the user prince would approve).
14Unicode (UCS)
- What are some of the features and drawbacks of
Unicode encoding? - Scale
- The Unicode standard will easily be able to
encode the variety of glyphs used in human
communication for a long time to come. - Simplicity
- The Unicode standard does have the simplicity of
a sledgehammer. The number of bytes required to
encode a set of characters is simply the number
of characters multiplied by 4. - Waste While
- The Unicode standard is simple in concept, it is
also very wasteful. The ability to encode 4
billion glyphs is nice, but in reality, much of
the communication that occurs today uses less
than a few hundred glyphs. Of the 32 bits (light
switches) used to encode each character, the
first 20 or so would always be "off". - ASCII Non-compatibility
- For better or for worse, a huge amount of
existing data is already ASCII encoded. In order
to convert fully to Unicode, that data, and the
programs that expect to read it, would have to be
converted.
15Unicode Transformation Format (UTF-8)
- UTF-8 encoding attempts to balance the
flexibility of Unicode, and the practicality and
pervasiveness of ASCII, with a significant
sacrifice variable length encoding. With
variable length encoding, each character is no
longer encoded using simply 1 byte, or simply 4
bytes. Instead, the traditional 127 ASCII
characters are encoded using 1 byte (and, in
fact, are identical to the existing ASCII
standard). The next most commonly used 2000 or so
characters are encoded using two bytes. The next
63000 or so characters are encoded using three
bytes, and the more esoteric characters may be
encoded using from four to six bytes. Details of
the encoding technique can be found in the
utf-8(7) man page. With full backwards
compatibility to ASCII, and the same functional
range of pure Unicode, what is there to lose? ISO
8859 (and similar) character set compatibility. - UTF-8 attempts to bridge the gap between ASCII,
which can be viewed as the primitive days of text
encoding, and Unicode, which can be viewed as the
utopia to aspire toward. Unfortunately, the
"intermediate" methods, the ISO 8859 and other
alternate character sets, are as incompatible
with UTF-8 as they are with each other. - Additionally, the simple relationship between the
number of characters that are being stored and
the amount of space (measured in bytes) it takes
to store them is lost. How much space will it
take to store 879 printed characters? If they are
pure ASCII, the answer is 879. If they are Greek
or Cyrillic, the answer is closer to twice that
much.
16Text Encoding and the Open Source Community
- In the traditional development of operating
systems, decisions such as what type of character
encoding to use can be made centrally, with the
possible disadvantage that the decision is wrong
for some community of the operating system's
users. In contrast, in the open source
development model, these types of decisions are
generally made by individuals and small groups of
contributors. The advantages of the open source
model are a flexible system which can accommodate
a wide variety of encoding formats. The
disadvantage is that users must often be educated
and made aware of the issues involved with
character encoding, because some parts of the
assembled system use one technique while others
parts use another. The library of man pages is an
excellent example
17Text Encoding and the Open Source Community
- When contributors to the open source community
are faced with decisions involving potentially
incompatible formats, they generally balance
local needs with an appreciation for adhering to
widely accepted standards where appropriate. The
UTF-8 encoding format seems to be evolving as an
accepted standard, and in recent releases has
become the default for Red Hat Enterprise Linux. - The following paragraph, extracted from the
utf-8(7) man page, says it well
18Internationalization (i18n)
- As this Workbook continues to discuss many tools
and techniques for searching, sorting, and
manipulating text, the topic of
internationalization cannot be avoided. In the
open source community, internationalization is
often abbreviated as i18n, a shorthand for saying
"i-n with 18 letters in between". Applications
which have been internationalized take into
account different languages. In the Linux (and
Unix) community, most applications look for the
LANG environment variable to determine which
language to use. - At the simplest, this implies that programs will
emit messages in the user's native language. - More subtly, the choice of a particular language
has implications for sorting orders, numeric
formats, text encoding, and other issues.
19The LANG environment variable
- The LANG environment variable is used to define a
user's language, and possibly the default
encoding technique as well. The variable is
expected to be set to a string using the
following syntax - The locale command can be used to examine your
current configuration (as can echo LANG), while
locale -a will list all settings currently
supported by your system. The extent of the
support for any given language will vary.
20The LANG environment variable
- The following tables list some selected language
codes, country codes, and code set
specifications.
21Revisiting cat, head, and tail
- Revisiting cat
- We have been using the cat command to simply
display the contents of files. Usually, the cat
command generates a faithful copy of its input,
without performing any edits or conversions. When
called with one of the following command line
switches, however, the cat command will indicate
the presence tabs, line feeds, and other control
sequences, using the following conventions. - Using the -A command line switch, the whitespace
structure of the file becomes evident, as tabs
are replaced with I, and line feeds are
decorated with . E.g. cat -A /etc/hosts
22Revisiting head and tail
- The head and tail commands have been used to
display the first or last few lines of a file,
respectively. But what makes a line? Imagine
yourself working at a typewriter click! clack!
click! clack! clack! ziiing! Instead of the
ziing! of the typewriter carriage at the end of
each line, the line feed character (ASCII 10) is
chosen to mark the end of lines. - Unfortunately, a common convention for how to
mark the end of a line is not shared among the
dominant operating systems in use today. Linux
(and Unix) uses the line feed character (ASCII
10, often represented \n), while Macintosh
operating systems uses the carriage return
character (ASCII 13, often represented \r or M),
and Microsoft operating systems use a carriage
return/line feed pair (ASCII 13, ASCII 10).
23Revisiting head and tail
- For example, the following file contains a list
of four musicians. - Linux (and Unix) text files generally adhere to a
convention that the last character of the file
must be a line feed for the last line of text.
Following the cat of the file musicians.mac,
which does not contain any conventional Linux
line feed characters, the bash prompt is not
displayed in its usual location.
24Revisiting head and tail
25The wc (Word Count) Command
- Counting Made Easy
- Have you ever tried to answer a 25 words or
less quiz? Did you ever have to write a
1500-word essay? - With the wc you can easily verify that your
contribution meets the criteria. - The wc command counts the number of characters,
words, and lines. It will take its input either
from files named on its command line or from its
standard input. Below is the command line form
for the wc program
26The wc (Word Count) Command
- When used without any command line switches, wc
will report on the number of characters, lines,
and words. Command line switches can be combined
to return any combination of character count,
line count or word count.
27How To Recognize A Real Character
- Text files are composed using an alphabet of
characters. Some characters are visible, such as
numbers and letters. Some characters are used for
horizontal distance, such as spaces and TAB
characters. Some characters are used for vertical
movement, such as carriage returns and line
feeds. - A line in a text file is a series of any
character other than a NEWLINE (line feed)
character and then a NEWLINE character.
Additional lines in the file immediately follow
the first line. - While a computer represents characters as
numbers, the exact value used for each symbol
varies depending on which alphabet has been
chosen. The most common alphabet for English
speakers is ASCII, also called Latin-1.
Different human languages are represented by
different computer encoding rules, so the exact
numeric value for a given character depends on
the human language being recorded.
28So, What Is A Word?
- A word is a group of printing characters, such as
letters and digits, surrounded by white space,
such as space characters or horizontal TAB
characters. - Notice that our definition of a word does not
include any notion of meaning. Only the form of
the word is important, not its semantics. As far
as Linux is concerned, a line such as
29QuestionsChapter 1. Text Encoding and Word
Counting
30Chapter 2. Finding Text grep
- Key Concepts
- grep is a command that prints lines that match a
specified text string or pattern. - grep is commonly used as a filter to reduce
output to only desired items. - grep -r will recursively grep files underneath a
given directory. - grep -v prints lines that do NOT match a
specified text string or pattern. - Many other command line switches allow users to
specify grep's output format.
31Searching Text File Contents using grep
- In an earlier Lesson, we saw how the wc program
can be used to count the characters, words and
lines in text files. In this Lesson we introduce
the grep program, a handy tool for searching text
file contents for specific words or character
sequences. - The name grep stands for general regular
expression parser. What, you may well ask, is a
regular expression and why on earth should I want
to parse one? We will provide a more formal
definition of regular expressions in a later
Lesson, but for now it is enough to know that a
regular expression is simply a way of describing
a pattern, or template, to match some sequence of
characters. A simple regular expression would be
Hello, which matches exactly five characters
H, e, two consecutive l characters, and a
final o. More powerful search patterns are
possible and we shall examine them in the next
section. - The figure below gives the general form of the
grep command line
32Searching Text File Contents using grep
- There are actually three different names for the
grep tool 10 - fgrep Does a fast search for simple patterns. Use
this command to quickly locate patterns without
any wildcard characters, useful when searching
for an ordinary word. - grep Pattern searches using ordinary regular
expressions. - egrep Pattern searches using more powerful
extended regular expressions. - The pattern argument supplies the template
characters for which grep is to search. The
pattern is expected to be a single argument, so
if pattern contains any spaces, or other
characters special to the shell, you must enclose
the pattern in quotes to prevent the shell from
expanding or word splitting it.
33Searching Text File Contents using grep
- The following table summarizes some of grep's
more commonly used command line switches. Consult
the grep(1) man page (or invoke grep --help) for
more.
34Show All Occurrences of a String in a File
- Under Linux, there are often several ways of
accomplishing the same task. For example, to see
if a file contains the word even, you could
just visually scan the file - Reading the file, we see that the file does
indeed contain the letters even. Using this
method on a large file suffers because we could
easily miss one word in a file of several
thousand, or even several hundred thousand,
words. We can use the grep tool to search through
the file for us in an automatic search - Here we searched for a word using its exact
spelling. Instead of just a literal string, the
pattern argument can also be a general template
for matching more complicated character
sequences we shall explore that in a later
Lesson.
35Searching in Several Files at Once
- An easy way to search several files is just to
name them on the grep command line - Perhaps we are more interested in just
discovering which file mentions the word nine
than actually seeing the line itself. Adding the
-l switch to the grep line does just that
36Searching Directories Recursively
- Grep can also search all the files in a whole
directory tree with a single command. This can be
handy when working a large number of files. - The easiest way to understand this is to see it
in action. In the directory /etc/sysconfig are
text files that contain much of the configuration
information about a Linux system. The Linux name
for the first Ethernet network device on a system
is eth0, so you can find which file contains
the configuration for eth0 by letting the grep -r
command do the searching for you 11
37Searching Directories Recursively
- Every file in /etc/sysconfig that mentions eth0
is shown in the results. - We can further limit the files listed to only
those referring to an actual device by filtering
the grep -r output through a grep DEVICE - This shows a common use of grep as a filter to
simplify the outputs of other commands. - If only the names of the files were of interest,
the output can be simplified with the -l command
line switch.
38Inverting grep
- By default, grep shows only the lines matching
the search pattern. Usually, this is what you
want, but sometimes you are interested in the
lines that do not match the pattern. In these
instances, the -v command line switch inverts
grep's operation.
39Getting Line Numbers
- Often you may be searching a large file that has
many occurrences of the pattern. Grep will list
each line containing one or more matches, but how
is one to locate those lines in the original
file? Using the grep -n command will also list
the line number of each matching line. - The file /usr/share/dict/words contains a list of
common dictionary words. Identify which line
contains the word dictionary - You might also want to combine the -n switch with
the -r switch when searching all the files below
a directory
40Limiting Matching to Whole Words
- Remember the file containing our nursery rhyme
earlier? - Suppose we wanted to retrieve all lines
containing the word at. If we try the command
- Do you see what happened? We matched the at
string, whether it was an isolated word or part
of a larger word. The grep command provides the
-w switch to imply that the specified pattern
should only match entire words. - The -w switch considers a sequence of letters,
numbers, and underscore characters, surrounded by
anything else, to be a word.
41Ignoring Case
- The string Bob has quite a meaning quite
different from the string bob. However,
sometimes we want to find either one, regardless
of whether the word is capitalized or not. The
grep -i command solves just this problem.
42ExamplesFinding Simple Character Strings
- Verify that your computer has the system account
lp, used for the line printer tools. Hint the
file /etc/passwd contains one line for each user
account on the system.
43QuestionsChapter 2. Finding Text grep
44Chapter 3. Introduction to Regular Expressions
- Key Concepts
- Regular expressions are a standard Unix syntax
for specifying text patterns. - Regular expressions are understood by many
commands, including grep, sed, vi, and many
scripting languages. - Within regular expressions, . and are used to
match characters. - Within regular expressions, , , and ?specify a
number of consecutive occurrences. - Within regular expressions, and specify the
beginning and end of a line. - Within regular expressions, (, ), and specify
alternative groups. - The regex(7) man page provides complete details.
45Introducing Regular Expressions
- In the previous chapter you saw grep used to
match either a whole word or part of a word. This
by its self is very powerful, especially in
conjunction with arguments like -i and -v, but it
is not appropriate for all search scenarios. Here
are some examples of searches that the grep usage
you've learned so far would not be able to do - First, suppose you had a file that looked like
this
46Introducing Regular Expressions
- What if you wanted to pull out just the names of
the people in people_and_pets.txt? A command like
grep -w Name would match the 'Name' line for
each person, but also the 'Name' line for each
person's pet. How could we match only the 'Name'
lines for people? Well, notice that the lines for
pets' names are all indented, meaning that those
lines begin with whitespace characters instead of
text. Thus, we could achieve our goal if we had a
way to say "Show me all lines that begin with
'Name'". - Another example Suppose you and a friend both
witnessed a hit-and-run car accident. You both
got a look at the fleeing car's license plate and
yet each of you recalls a slightly different
number. You read the license number as "4I35VBB"
but your friend read it as "413SV88". It seems
that what you read as an 'I' in the second
character, your friend read as a '1'. Similar
differences appear in your interpretations of
other parts of the license like '5' vs 'S' and
'BB' vs '88'. The police, having taken both of
your statements, now need to narrow down the
suspects by querying their database of license
plates for plates that might match what you saw.
47Introducing Regular Expressions
- One solution might be to do separate queries for
"4I35VBB" and "413SV88" but doing so assumes that
one of you is exactly right. What if the
perpetrator's license number was actually
"4135VB8"? In other words, what if you were right
about some of the characters in question but your
friend was right about others? It would be more
effective if the police could query for a pattern
that effectively said "Show me all license
numbers that begin with a '4', followed by an 'I'
or a '1', followed by a '3', followed by a '5' or
an 'S', followed by a 'V', followed by two
characters that are each either a 'B' or an '8'".
- Query scenarios like these can be solved using
regular expressions. While computer scientists
sometimes use the term "regular expression" (or
"regex" for short) to describe any method of
describing complex patterns, in Linux and many
programming languages the term refers to a very
specific set of special characters used for
solving problems like the above. Regular
expressions are supported by a large number of
tools including grep, vi, find and sed.
48Introducing Regular Expressions
- To introduce the usage of regular expressions,
lets look at some solutions to two problems
introduced earlier. Don't worry if these seem a
bit complicated, the remainder of the unit will
start from scratch and cover regular expressions
in great detail. - A regex that could solve the first problem, where
we wanted to say "Show me all lines that begin
with 'Name'" might look like this - ...that's it! Regular expressions are all about
the use of special characters, called
metacharacters to represent advanced query
parameters. The carat (""), as shown here, means
"Lines that begin with...". Note, by the way,
that the regular expression was put in
single-quotes. This is a good habit to get into
early on as it prevents bash from interpreting
special characters that were meant for grep.
49Introducing Regular Expressions
- Ok, so what about the second problem? That one
involved a much more complicated query "Show me
all license numbers that begin with a '4',
followed by an 'I' or a '1', followed by a '3',
followed by a '5' or an 'S', followed by a 'V',
followed by two characters that are each either a
'B' or an '8'". This could be represented by a
regular expression that looks like this - Wow, that's pretty short considering how long it
took to write out what we were looking for! There
are only two types of regex metacharacters used
here square braces ('') and curly braces
(''). When two or more characters are shown
within square braces it means "any one of these".
So 'B8' near the end of the expression means
"'B' or '8'". When a number is shown within curly
braces it means "this many of the preceding
character". Thus, 'B82' means "two characters
that are each either a 'B' or an '8'". Pretty
powerful stuff! - Now that you've gotten a taste of what regular
expressions are and how they can be used, let's
start from scratch and cover them in depth.
50Regular Expressions, Extended Regular
Expressions, and the grep Command
- As the Unix implementation of regular expression
syntax has evolved, new metacharacters have been
introduced. In order to preserve backward
compatibility, commands usually choose to
implement regular expressions, or extended
regular expressions. In order to not become
bogged down with the differences, this Lesson
will introduce the extended syntax, summarizing
differences at the end of the discussion. - One of the most common uses for regular
expressions is specifying search patterns for the
grep command. As was mentioned in the previous
Lesson, there are three versions of the grep
command. Reiterating, the three differ in how
they interpret regular expressions.
51Regular Expressions, Extended Regular
Expressions, and the grep Command
- fgrep
- The fgrep command is designed to be a "fast"
grep. The fgrep command does not support regular
expressions, but instead interprets every
character in the specified search pattern
literally. - grep
- The grep command interprets each patterns using
the original, basic regular expression syntax. - egrep
- The egrep command interprets each patterns using
extended regular expression syntax. - Because we are not yet making a distinction
between the basic and extended regular expression
syntax, the egrep command should be used whenever
the search pattern contains regular expressions.
52Anatomy of a Regular Expression
- In our discussion of the grep program family, we
were introduced to the idea of using a pattern to
identify the file content of interest. Our
examples were carefully constructed so that the
pattern contained exactly the text for which we
were searching. We were careful to use only
literal characters in our regular expressions a
literal character matches only itself. So when we
used hello as the regular expression, we were
using a five-character regular expression
composed only of literal characters. While this
let us concentrate on learning how to operate the
grep program, it didn't allow us to get a full
appreciation of the power of regular expressions.
Before we see regular expressions in use, we
shall first see how they are constructed.
53Anatomy of a Regular Expression
- A regular expression is a sequence of
- Literal Characters Literal characters match only
themselves. Examples of literals are letters,
digits and most special characters (see below for
the exceptions). - Wildcards Wildcard characters match any
character. Within a regular expression, a period
(.) matches any character, be it a space, a
letter, a digit, punctuation, anything. - Modifiers A modifier alters the meaning of the
immediately preceding pattern character. For
example, the expression abc matches the
strings ac, abc, abbc, abbbc, and so on,
because the asterisk () is a modifier that
means any number of (including zero). Thus, our
pattern means to match any sequence of characters
consisting of one a, a (possibly empty) series
of b characters, and a final c character. - Anchors Anchors establish the context for the
pattern, such as "the beginning of a line", or
"the end of a word". For example, the expression
cat would match any occurrence of the three
letters, while cat would only match lines that
begin cat.
54Taking Literals Literally
- Literals are straightforward because each literal
character in a regular expressions matches one,
and only one, copy of itself in the searched
text. Uppercase characters are distinct from
lowercase characters, so that A does not match
a. - Wildcards
- The "dot" wildcard
- The character . is used as a placeholder, to
match one of any character. In the following
example, the pattern matches any occurrence of
the literal characters x and s, separated by
exactly two other characters.
55Bracket Expressions Ranges of Literal
Characters
- Normally a literal character in a regex pattern
matches exactly one occurrence of itself in the
searched text. Suppose we want to search for the
string hello regardless of how it is
capitalized we want to match Hello and HeLLo
as well. How might we do that? - A regex feature called a bracket expression
solves this problem neatly. A bracket expression
is a range of literals enclosed in square
brackets ( and ). For example, the regex
pattern Hh is a character range that matches
exactly one character either an uppercase H or
a lowercase h letter. Notice that it doesn't
matter how large the set of characters within the
range is, the set matches exactly one character,
if it matches any at all. A bracket expression
that matches the set of lowercase vowels could be
written aeiou and would match exactly one
vowel. - In the following example, bracket expressions are
used to find words from the file
/usr/share/dict/words. In the first case, the
first five words that contain three consecutive
(lowercase) vowels are printed. In the second
case, the first 5 words that contain lowercase
letters in the pattern of vowel-consonant-vowel-co
nsonant-vowel-consonant are printed.
56Bracket Expressions Ranges of Literal
Characters
- If the first character of a bracket expression is
a , the interpretation is inverted, and the
bracket expression will match any single
occurrence of a character not included in the
range. For example, the expression aeiou
would match any character that is not a vowel.
The following example first lists words which
contain three consecutive vowels, and secondly
lists words which contain three consecutive
consonant-vowel pairs.
57Range Expressions vs. Character Classes Old
School and New School
- Another way to express a character range is by
giving the start- and end-letters of the sequence
this way a-d would match any character from
the set a, b, c or d. A typical usage of this
form would be 0-9 to represent any single
digit, or A-Z to represent all capital
letters.
58Range Expressions vs. Character Classes Old
School and New School
- As an alternative to such quandaries, modern
regular expression make use character classes.
Character classes match any single character,
using language specific conventions to decide if
a given character is uppercase or lowercase, or
if it should be considered part of the alphabet
or punctuation. The following table lists some
supported character classes, and the ASCII
equivalent range expression, where appropriate.
59Range Expressions vs. Character Classes Old
School and New School
- Character classes avoid problems you may run into
when using regular expressions on systems that
use different character encoding schemes where
letters are ordered differently. For example,
suppose you were to run the command - On a Red Hat Enterprise Linux system, this would
match every word in the file, not just those that
contain capital letters as one might assume. This
is because in unicode (utf-8), the character
encoding scheme that RHEL uses, characters are
alphabetized case-insensitively, so that A-Z is
equivalent to AaBbCc...etc.
60Range Expressions vs. Character Classes Old
School and New School
- On older systems, though, a different character
encoding scheme is used where alphabetization is
done case-sensitively. On such systems A-Z
would be equivalent to ABC...etc. Character
classes avoid this pitfall. You can run - on any system regardless of the encoding scheme
being used and it will only match lines that
contain capital letters. - For more details about the predefined range
expressions, consult the grep manual page. For
more information on character encoding schemes
under Linux, refer back to chapter 8.3. To learn
about how character encoding schemes are used to
support other languages in Red Hat Enterprise
Linux, begin with the locale manual page. -
61Common Modifier Characters
- We saw a common usage of a regex modifier in our
earlier example abc to match an a and c
character with some number of b letters in
between. The character changed the
interpretation of the literal b character from
matching exactly one letter to matching any
number of b's. - Here are a list of some common modifier
characters - b? The question mark (?) means either one or
none the literal character is considered to be
optional in the searched text. For example, the
regex pattern ab?c matches the strings ac,
and abc, but not abbc. - b The asterisk () modifier means any number
of (including zero) of the preceding literal
character. The regex pattern abc matches the
strings ac, abc, abbc, and so on.
62Common Modifier Characters
- b The plus () modifier means one or more,
so the regex pattern b matches a non-empty
sequence of b's. The regex pattern abc matches
the strings abc and abbc, but does not match
ac - bm,n The brace modifier is used to specify a
range of between m and n occurrences of the
preceding character. The regex pattern b2,4
would match abbc and abbbc, and abbbbc, but
not abc or abbbbbc. - bn With only one integer, the brace modifier is
used to specify exactly n occurrences for the
preceding character.
63Common Modifier Characters
- In the following example, egrep prints lines from
/usr/share/dict/words that contain patterns which
start with a (capital or lowercase) a, might or
might not next have a (lowercase) b, but then
definitely follow with a (lowercase) a. - The following example prints lines which contain
patterns which start al, then use the .
wildcard to specify 0 or more occurrences of any
character, followed by the pattern bra.
64Common Modifier Characters
- Notice we found variations on the words algebra
and calibrate. For the former, the . expression
matched ge, while for the latter, it matched
the letter i. - The expression ., which is interpreted as "0
or more of any character", shows up often in
regex patterns, acting as the "stretchable glue"
between two patterns of significance. - As a subtlety, we should note that the modifier
characters are greedy they always match the
longest possible input string. For example, given
the regex pattern
65Anchored Searches
- Four additional search modifier characters are
available - foo A caret () matches the beginning of a
line. Our example foo matches the string foo
only when it is at the beginning of a line - foo A dollar sign () matches the end of a
line. Our example foo matches the string foo
only at the end of a line, immediately before the
newline character. - \ltfoo\gt By themselves, the less than sign (lt)
and the greater than sign (gt) are literals.
Using the backslash character to escape them
transforms them into meaning first of a word
and end of a word, respectively. Thus the
pattern \gtcat\lt matches the word cat but not
the word catalog. - You will frequently see both and used
together. The regex pattern foo matches a
whole line that contains only foo and would not
match that line if it contained any spaces. - The \lt and \gt are also usually used as pairs.
66Anchored Searches
- In the following an example, the first search
lists all lines that contain the letters ion
anywhere on the line. The second search only
lists lines which end in ion.
67Coming to Terms with Regex Grouping
- The same way that you can use parenthesis to
group terms within a mathematical expression, you
also use parenthesis to collect regular
expression pattern specifiers into groups. This
lets the modifier characters ?, and
apply to groups of regex specifiers instead of
only the immediately preceding specifier. - Suppose we need a regular expression to match
either foo or foobar. We could write the
regex as foo(bar)? and get the desired results.
This lets the ? modifier apply to the whole
string bar instead of only the preceding r
character. - Grouping regex specifiers using parenthesis
becomes even more flexible when the pipe symbol
() is used to separate alternative patterns.
Using alternatives, we could rewrite our previous
example as (foofoobar). Writing this as
foofoobar is simpler and works just as well,
because just like mathematics, regex specifiers
have precedence. While you are learning, always
enclose your groups in parenthesis.
68Coming to Terms with Regex Grouping
- In the following example, the first search prints
all lines from the file /usr/share/dict/words
which contain four consecutive vowels (compare
the syntax to that used when first introducing
range expressions, above). The second search
finds words that contain a double o or a double
e, followed (somewhere) by a double e.
69Escaping Meta-Characters
- Sometimes you need to match a character that
would ordinarily be interpreted as a regular
expression wildcard or modifier character. To
temporarily disable the special meaning of these
characters, simply escape them using the
backslash (\) character. For example, the regex
pattern cat. would match the letters cat
followed by any character cats or catchup.
To match only the letters cat. at the end of a
sentence, use the regex pattern cat\. to
disable interpreting the period as a wildcard
character. - Note one distracting exception to this rule. When
the backslash character precedes a lt or gt
character, it enables the special interpretation
(anchoring the beginning or ending of a word)
instead of disabling the special interpretation.
Shudder. It even gets worse - see the footnote at
the bottom of the following table.
70Summary of Linux Regular Expression Syntax
- The following table summarizes regular expression
syntax, and identifies which components are found
in basic regular expression syntax, and which are
found only in the extended regular expression
syntax.
71Summary of Linux Regular Expression Syntax
- The following table summarizes regular expression
syntax, and identifies which components are found
in basic regular expression syntax, and which are
found only in the extended regular expression
syntax.
72Regular Expressions are NOT File Globbing
- When first encountering regular expressions,
students understandably confuse regular
expressions with pathname expansion (file
globbing). Both are used to match patterns in
text. Both share similar metacharacters (,
?, ...), etc.). However, they are
distinctly different. The following table
compares and contrasts regular expressions and
file globbing.
73Regular Expressions are NOT File Globbing
- In the following example, the first argument is a
regular expression, specifying text which starts
with an l and ends .conf, while the second
argument is a file glob which specifies all files
in the /etc directory whose filename starts with
l and ends .conf. - Take a close look at the second line of output.
Why was it matched by the specified regular
expression? - Why does the line containing the text krb5.conf
match the expression? The l is found way back
in the word default! - In a similar vain, when specifying regular
expressions on the bash command line, care must
be taken to quote or escape the regex
meta-characters, lest they be expanded away by
the bash shell with unexpected results. In all of
the examples found in this discussion, the first
argument to the egrep command is protected with
single quotes for just this reason.
74Where to Find More Information About Regular
Expressions
- We have barely scratched the surface of the
usefulness of regular expressions. The
explanation we have provided will be adequate for
your daily needs, but even so, regular
expressions offer much more power, making even
complicated text searches simple to perform. - For more online information about regular
expressions, you should check - The regex(7) manual page.
- The grep(1) manual page.
75Examples
- Regular Expression Modifiers
76Chapter 4. Everything Sorting sort and uniq
- Key Concepts
- The sort command sorts data alphabetically.
- sort -n sorts numerically.
- sort -u sorts and removes duplicates.
- sort -k and -t sorts on a specific field in
patterned data.
77The sort Command
- Sorting is the process of arranging records into
a specified sequence. Examples of sorting would
be arranging a list of usernames into
alphabetical order, or a set of file sizes into
numeric order. - In its simplest form, the sort command will
alphabetically sort lines (including any
whitespace or control characters which are
encountered). The sort command uses the local
locale (language definition) to determine the
order of the characters (referred to as the
collating order). In the following example,
madonna first displays the contents of the file
/etc/sysconfig/mouse as is, and then sorts the
contents of the file alphabetically.
78Modifying the Sort Order
- By default, the sort command sorts lines
alphabetically. The following table lists command
line switches which can be used to modify this
default sort order.
79Examples of sort
- As an example, madonna is examining the file
sizes of all files that start with an m in the
/var/log directory. - She next sorts the output with the sort command.
80Examples of sort
- Without being told otherwise, the sort command
sorted the lines alphabetically (with 1952 coming
before 20). Realizing this is not what she
intended, madonna adds the -n command line
switch.
81Examples of sort
- Better, but madonna would prefer to reverse the
sort order, so that the largest files come first.
She adds the -r command line switch - Why ls -1?
- Why was the -1 command line switch given to the
ls command in the first example, but not the
others? By default, when the ls command is using
a terminal for standard out, it will group the
filenames in multiple columns for easy
readability. When the ls command is using a pipe
or file for standard out, however, it will print
the files one file per line. The -1 command line
switch forces this behavior for for terminal
output as well.
82Specifying Sort Keys
- In the previous examples, the sort command
performed its sort based on the first characters
found on a line. Often, formatted data is not
arranged so conveniently. Fortunately, the sort
command allows users to specify which column of
tabular data to use for determining the sort
order, or, in more formally, which column should
be used as the sort key. - The following table of command line switches can
be used to determine the sort key.
83Sorting Output by a Particular Column
- As an example, suppose madonna wanted to
reexamine her log files, using the long format of
the ls command. She tries simply sorting her
output numerically. - Now that the sizes are no longer reported at the
beginning of the line, she has difficulty.
Instead, she repeats her sort using the -k
command line switch to sort her output by the 5th
column, producing the desired output.
84Specifying Multiple Sort Keys
- Next, madonna is examining the file /etc/fdprm,
which tables low level formatting parameters for
floppy drives. She uses the grep command to
extract the data from the file, stripping away
comments and blank lines.
85Specifying Multiple Sort Keys
- She next sorts the data numerically, using the
5th column as her key.
86Specifying Multiple Sort Keys
- Her data is successfully sorted using the 5th
column, with the formats specifying 40 tracks
grouped at the top, and 80 tracks grouped at the
bottom. Within these groups, however, she would
like to sort the data by the 3rd column. She adds
an additional -k command line switch to the sort
command, specifying the third column as her
secondary key. - Now the data has been sorted primarily by the
fifth column. For rows with identical fifth
columns, the third column has been used to
determine the final order. An arbitrary number of
keys can be specified by adding more -k command
line switches.
87Specifying the Field Separator
- The above examples have demonstrated how to sort
data using a specified field as the sort key. In
all of the examples, fields were separated by
whitespace (i.e., a series of spaces and/or
tabs). Often in Linux (and Unix), some other
method is used to separate fields. Consider, for
example, the /etc/passwd file.
88Specifying the Field Separator
- The lines are structured into seven fields each,
but the fields are separated using a instead
of whitespace. With the -t command line switch,
the sort command can be instructed to use some
specified character (such as a ) to separate
fields. - In the following, madonna uses the sort command
with the -t command line switch to sort the first
10 lines of the /etc/passwd file by home
directory (the 6th field). - The user bin, with a home directory of /bin, is
now at the top, and the user mail, with a home
directory of /var/spool/mail, is at the bottom.
89Summary
- In summary, we have seen that the sort command
can be used to sort structured data, using the -k
command line switch to specify the sort field
(perhaps more than once), and the -t command line
switch to specify the field delimiter. - The -k command line switch can receive more
sophisticated arguments, which serve to specify
character positions within a field, or customize
sort options for individual fields. See the
sort(1) man page for details.
90The uniq Command
- The uniq program is used to identify, count, or
remove duplicate records in sorted data. If given
command line arguments, they are interpreted as
filenames for files on which to operate. If no
arguments are provided, the uniq command operates
on standard in. Because the uniq command only
works on already sorted data, it is almost always
used in conjunction with the sort command. - The uniq command uses the following command line
switches to qualify its behavior.
91The uniq Command
- In order to understand the uniq command's
behavior, we need repetitive data on which to
operate. The following python script simulates
the rolling of three six sided dice, writing the
sum of 100 roles once per line. The user madonna
makes the script executable, and then records the
output in a file called trial1.
92Reducing Data to Unique Entries
- Now, madonna would like to analyze the data. She
begins by sorting the data and piping the output
through the uniq command. - Without any command line switches, the uniq
command has removed duplicate entries, reducing
the data from 100 lines to only 15. Easily,
madonna sees that the data looks reasonable the
sum of every combination for three six sided die
is represented, with the exception of 3. Because
only one combination of the dice would yield a
sum of 3 (all ones), she expects it to be a
relatively rare occurrence.
93Counting Instances of Data
- A particularly convenient command line switch for
the uniq command is -c, or --count. This causes
the uniq command to count the number of
occurrences of a particular record, prepending
the result to the record on output. - In the following example, madonna uses the uniq
command to reproduce its previous output, this
time prepending the number of occurrences of each
entry in the file.
94Counting Instances of Data
- As would be expected (by a statistician, at
least), the largest and smallest numbers have
relatively few occurrences, while the
intermediate numbers occur more numerously. The
first column can be summed to 100 to confirm that
the uniq command identified every occurrence.
95Identifying Unique or Repeated Data with uniq
- Sometimes, people are just interested in
identifying unique or repeated data. The -d and
-u command line switches allow the uniq command
to do just that. In the first case, madonna
identifies the dice combinations that occur only
once. In the second case, she identifies
combinations that are repeated at least once.
96QuestionsChapter 4. Everything Sorting sort
and uniq
97Chapter 5 Extracting and Assembling Text cut
and paste
- Key Concepts
- The cut command extracts texts from text files,
based on columns specified by bytes, characters,
or fields. - The paste command merges two text files line by
line.
98The cut Command
- Extracting Text with cut
- The cut command extracts columns of text from a
text file or stream. Imagine taking a sheet of
paper that lists rows of names, email addresses,
and phone numbers. Rip the page vertically twice
so that each column is on a separate piece. Hold
onto the middle piece which contains email
addresses, and throw the other two away. This is
the mentality behind the cut command. - The cut command interprets any command line
arguments as filenames of files on which to
operate, or operates on the standard in stream if
none are provided. In order to specify which
bytes, characters, or fields are to be cut, the
cut command must be called with one of the
following command line switches.
99The cut Command
- The list arguments are actually a comma-separated
list of ranges. Each range can take one of the
following forms.
100Extracting text by Character Position with cut
-c
- With the -c command line switch, the list
specifies a character's position in a line of
text, where the first character is character
number 1. As an example, the file
/proc/interrupts lists device drivers, the
interrupt request (IRQ) line to which they
attach, and the number of interrupts which have
occurred on that IRQ line. (Do not be concerned
if you are not yet familiar with the concepts of
a device driver or IRQ line. Focus instead on how
cut is used to manipulate the data).
101Extracting text by Character Position with cut
-