Title: Unix%20Lecture%206
1Unix Lecture 6
2HW6 - Part II
- solutions posted on my website
- see syllabus
3Text ProcessingCommand Line Utility Programs
- ex
- iconv
- join
- paste
- sort
- tr
- uniq
- xargs
4TextPro Lexicon File
- Lexicon file core.text
- Background
- TextPro
- An information extraction system used as SRI
International, Menlo Park, CA - Developed by Doug Appelt
5copy machen.txt into your account
- gt cd ..
- gt cd c6932aab
- gt ls
- machen.txt
- gt cp machen.txt c6932aad
- gt cd
- gt ls
- machen.txt
6Text ProcessingCommand Line Utility Programs
- tr translate or delete characters
- Example 1 delete (-d) all the new line
characters from machen.txt, and redirect the
output to a file named machen-cont.txt. - cat machen.txt tr -d "\n" gt machen-cont.txt
- Example 2 delete (-d) all characters from
machen.txt except for alphabetical characters,
new lines, and spaces, and redirect the output to
a file named machen-alpha.txt. - cat machen.txt tr -c -d "alpha\n " gt
machen-alpha.txt - Try also
- cat machen.txt tr -c -d "alpha\n" gt
machen-alpha.txt
7Text ProcessingCommand Line Utility Programs
- tr can be used to make a wordlist from a text.
This can be done by replacing all spaces with a
newline - cat machen.txt tr " " "\n" less
- cat machen.txt tr " " "\012" less
- We can combine the command above with the delete
functionality of tr to make a wordlist without
unwanted characters - cat machen.txt tr " " "\n" tr -c -d
"alpha\n" gt lex -
8Text ProcessingCommand Line Utility Programs
- sort prints the lines of its input or
concatenation of all files listed in its argument
list in sorted order. (The -r flag will reverse
the sort order.) - sort -r movie_characters
-
9Text ProcessingCommand Line Utility Programs
- uniq takes a text file and outputs the file with
adjacent identical lines collapsed to one - it is a kind of filter program
- typically it is used after sort
- cat machen.txt tr " " "\n" tr -c -d
"alpha\n sort uniq gt lex -
10Text ProcessingCommand Line Utility Programs
- sed stream editor
- a special editor for automatically modifying
files - a find and replace program, it reads text from
standard input and writes the result to standard
outout (normally the screen) The search pattern
is a regular expression (see references). - sed search pattern is a regular expression,
essentially the same as a grep regular expression
- often used in a program to make changes in a file
11Text ProcessingCommand Line Utility Programs
- sed simple example 1
- sed 's/United States/USA/' lt usa-gaz.text gt
new-usa-gaz.text - s Substitute command
- /../../ Delimiter
- United States Regular Expression Pattern String
- USA Replacement string
- lt old_file gt new_file
12Text ProcessingCommand Line Utility Programs
- sed simple example 2
- sed 's/\(United\)\(States\)/\2\1/'lt
usa-gaz.textgtusa-switch-gaz.text - switch two words around
- \( word onset
- \) word end
- /../../ delimiter
- \1 register 1
- \2 register 2
-
13Text ProcessingCommand Line Utility Programs
- multiple sed commands may also be stored in a
script file. The "-f" option is used on the
command line to access the commands in the
script - sed -f sedscript.sed file
14Text ProcessingCommand Line Utility Programs
-
- sed 's//LexEntry /gs// ./' lex gt newlex
- match the beginning of the line
- match the end of the line
15Text ProcessingCommand Line Utility Programs
shell script
- ! /usr/local/bin/tcsh
- usage make_lex filename1 make_lex filename1
filename2, - first, make sure the user typed in at least one
argument - if ( lt 1 ) then
- echo "This script needs at least 1 argument."
- echo "Exiting...(annoyed)"
- exit 666
- endif
- foreach name ()
- cat name tr " " "\n" tr -c -d
"alpha\n" sort uniq gt mylex - sed 's//LexEntry /gs// ./' mylex gt newlex
- echo "Your new lexical file is called
'newlex'." - end