Is it possible to escape regex metacharacters reliably with sed -


i'm wondering whether possible write 100% reliable sed command escape regex metacharacters in input string can used in subsequent sed command. this:

#!/bin/bash # trying replace 1 regex in input file sed  search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3" replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"  # sanitize input search=$(sed 'script escape' <<< "$search") replace=$(sed 'script escape' <<< "$replace")  # use in sed command sed "s/$search/$replace/" input 

i know there better tools work fixed strings instead of patterns, example awk, perl or python. prove whether possible or not sed. let's concentrate on basic posix regexes have more fun! :)

i have tried lot of things anytime find input broke attempt. thought keeping abstract script escape not lead wrong direction.

btw, discussion came here. thought place collect solutions , break and/or elaborate them.

note:

  • if you're looking prepackaged functionality based on techniques discussed in answer:
    • bash functions enable robust escaping in multi-line substitutions can found @ bottom of post (plus perl solution uses perl's built-in support such escaping).
    • @edmorton's answer contains tool (bash script) robustly performs single-line substitutions.
  • all snippets assume bash shell (posix-compliant reformulations possible):

single-line solutions


escaping string literal use regex in sed:

to give credit credit due: found regex used below in this answer.

assuming search string single-line string:

search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3'  # sample input containing metachars.  searchescaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.  sed -n "s/$searchescaped/foo/p" <<<"$search" # if ok, echoes 'foo' 
  • every character except ^ placed in own character set [...] expression treat literal.
    • note ^ 1 char. cannot represent [^], because has special meaning in location (negation).
  • then, ^ chars. escaped \^.

the approach robust, not efficient.

the robustness comes not trying anticipate special regex characters - vary across regex dialects - focus on 2 features shared regex dialects:

  • the ability specify literal characters inside character set.
  • the ability escape literal ^ \^

escaping string literal use replacement string in sed's s/// command:

the replacement string in sed s/// command not regex, recognizes placeholders refer either entire string matched regex (&) or specific capture-group results index (\1, \2, ...), these must escaped, along (customary) regex delimiter, /.

assuming replacement string single-line string:

replace='laurel & hardy; ps\2' # sample input containing metachars.  replaceescaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape  sed -n "s/\(.*\) \(.*\)/$replaceescaped/p" <<<"foo bar" # if ok, outputs $replace 


multi-line solutions


escaping multi-line string literal use regex in sed:

note: makes sense if multiple input lines (possibly all) have been read before attempting match.
since tools such sed , awk operate on single line @ time default, steps needed make them read more 1 line @ time.

# define sample multi-line literal. search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3 /def\n\t[a-z]\+\([^ ]\)\{3,4\}\4'  # escape it. searchescaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n')           #'  # use in sed command reads input lines front. # if ok, echoes 'foo' sed -n -e ':a' -e '$!{n;ba' -e '}' -e "s/$searchescaped/foo/p" <<<"$search" 
  • the newlines in multi-line input strings must translated '\n' strings, how newlines encoded in regex.
  • $!a\'$'\n''\\n' appends string '\n' every output line last (the last newline ignored, because added <<<)
  • tr -d '\n removes actual newlines string (sed adds 1 whenever prints pattern space), replacing newlines in input '\n' strings.
  • -e ':a' -e '$!{n;ba' -e '}' posix-compliant form of sed idiom reads all input lines loop, therefore leaving subsequent commands operate on input lines @ once.

escaping multi-line string literal use replacement string in sed's s/// command:

# define sample multi-line literal. replace='laurel & hardy; ps\2 masters\1 & johnson\2'  # escape use sed replacement string. ifs= read -d '' -r < <(sed -e ':a' -e '$!{n;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace") replaceescaped=${reply%$'\n'}  # if ok, outputs $replace is. sed -n "s/\(.*\) \(.*\)/$replaceescaped/p" <<<"foo bar"  
  • newlines in input string must retained actual newlines, \-escaped.
  • -e ':a' -e '$!{n;ba' -e '}' posix-compliant form of sed idiom reads all input lines loop.
  • 's/[&/\]/\\&/g escapes &, \ , / instances, in single-line solution.
  • s/\n/\\&/g' \-prefixes actual newlines.
  • ifs= read -d '' -r used read sed command's output as is (to avoid automatic removal of trailing newlines command substitution ($(...)) perform).
  • ${reply%$'\n'} removes single trailing newline, <<< has implicitly appended input.


bash functions based on above (for sed):

  • quotere() quotes (escapes) use in regex
  • quotesubst() quotes use in substitution string of s/// call.
  • both handle multi-line input correctly
    • note because sed reads single line @ at time default, use of quotere() multi-line strings makes sense in sed commands explicitly read multiple (or all) lines @ once.
    • also, using command substitutions ($(...)) call functions won't work strings have trailing newlines; in event, use ifs= read -d '' -r escapedvalue <(quotesubst "$value")
# synopsis #   quotere <text> quotere() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; } 
# synopsis #  quotesubst <text> quotesubst() {   ifs= read -d '' -r < <(sed -e ':a' -e '$!{n;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")   printf %s "${reply%$'\n'}" } 

example:

from=$'cost\(*):\n$3.' # sample input containing metachars.  to='you & i'$'\n''eating a\1 sauce.' # sample replacement string metachars.  # should print unmodified value of $to sed -e ':a' -e '$!{n;ba' -e '}' -e "s/$(quotere "$from")/$(quotesubst "$to")/" <<<"$from"  

note use of -e ':a' -e '$!{n;ba' -e '}' read input @ once, multi-line substitution works.



perl solution:

perl has built-in support escaping arbitrary strings literal use in regex: quotemeta() function or equivalent \q...\e quoting.
approach same both single- , multi-line strings; example:

from=$'cost\(*):\n$3.' # sample input containing metachars. to='you owe me $1/$& for'$'\n''eating a\1 sauce.' # sample replacement string w/ metachars.  # should print unmodified value of $to. # note replacement value needs no escaping. perl -s -0777 -pe 's/\q$from\e/$to/' -- -from="$from" -to="$to" <<<"$from"  
  • note use of -0777 read input @ once, multi-line substitution works.

  • the -s option allows placing -<var>=<val>-style perl variable definitions following -- after script, before filename operands.


Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -