I came across a situation where I had to look for 1100+ regexps in a large file.
Running grep 1100 times is trivial but slow: took 5+ minutes (using MSYS2 on Windows 10) which is kind of unacceptable in my build scenario.
So I came up with the below script that will merge as many regexp patterns in one run of grep as your platform allows (32000 on cygwin / Windows, 131072 on Linux)
TL;DR: the same amount of pattern matching now takes 10-12 seconds, which is 25 times faster!😎
(n.b: I know dividing by the longest line is not using all the possible length, but is safer this way and the further speed gain was not worth a more complex solution for me) 
#!/usr/bin/env bash
if [ ! -s "$1" ]
then
  echo "First parameter should contain the name of the input file!"
  exit 1
fi
if [ ! -s "$2" ]
then
  echo "Second parameter should contain the name of the file containing the regexps (one per line)!"
  exit 1
fi
if [ -z "$3" ]
then
  echo "Third parameter should contain the name of output file!"
  exit 1
fi
LONGESTLINE=$(awk 'length > max_length { max_length = length } END { print max_length}' $2)
LONGESTARG=$(getconf ARG_MAX)    # Get argument limit in bytes
BATCHSIZE=$(( (LONGESTARG - 10) / LONGESTLINE ))
echo "Safe batch size: (Argument length limit: $LONGESTARG - 10) / Longest line: $LONGESTLINE = $BATCHSIZE batch size"
LINECOUNTER=0
GREPCMD="grep -iE \""
time cat $2 |
while read
do
  if [ $((++LINECOUNTER % BATCHSIZE)) -eq "0" ]
  then
    GREPCMD="$GREPCMD$REPLY\""
    eval $GREPCMD $1
    GREPCMD="grep -iE \""
  else
    GREPCMD="$GREPCMD$REPLY|"
  fi
done | sort | uniq > $3
 
Nincsenek megjegyzések:
Megjegyzés küldése