I came across a situation where I had to look for 1100+ regexps in a large file.
Running grep 1100 times is trivial but slow: took 5+ minutes (using MSYS2 on Windows 10) which is kind of unacceptable in my build scenario.
So I came up with the below script that will merge as many regexp patterns in one run of grep as your platform allows (32000 on cygwin / Windows, 131072 on Linux)
TL;DR: the same amount of pattern matching now takes 10-12 seconds, which is 25 times faster!😎
(n.b: I know dividing by the longest line is not using all the possible length, but is safer this way and the further speed gain was not worth a more complex solution for me)
#!/usr/bin/env bash
if [ ! -s "$1" ]
then
echo "First parameter should contain the name of the input file!"
exit 1
fi
if [ ! -s "$2" ]
then
echo "Second parameter should contain the name of the file containing the regexps (one per line)!"
exit 1
fi
if [ -z "$3" ]
then
echo "Third parameter should contain the name of output file!"
exit 1
fi
LONGESTLINE=$(awk 'length > max_length { max_length = length } END { print max_length}' $2)
LONGESTARG=$(getconf ARG_MAX) # Get argument limit in bytes
BATCHSIZE=$(( (LONGESTARG - 10) / LONGESTLINE ))
echo "Safe batch size: (Argument length limit: $LONGESTARG - 10) / Longest line: $LONGESTLINE = $BATCHSIZE batch size"
LINECOUNTER=0
GREPCMD="grep -iE \""
time cat $2 |
while read
do
if [ $((++LINECOUNTER % BATCHSIZE)) -eq "0" ]
then
GREPCMD="$GREPCMD$REPLY\""
eval $GREPCMD $1
GREPCMD="grep -iE \""
else
GREPCMD="$GREPCMD$REPLY|"
fi
done | sort | uniq > $3
Nincsenek megjegyzések:
Megjegyzés küldése