For those of you who are struggling to find a way to convert PDF files into TXT files, here is a quick bash script. There are many alternatives out there, but none were reliable for me. You’ll need to have acroread and ghostscript installed for this to work.
#!/bin/bash
mkdir ps txt
FILES="*.pdf"
for f in $FILES
do
echo "Processing $f"
acroread -toPostScript $f ps/
g=`basename $f .pdf`
ps2txt ps/$g.ps > txt/$g.txt
done
You can also change the second to last line to read
ps2txt ps/$g.ps | grep -v "EXCLUDE" > txt/$g.txt
where EXCLUDE is a line that you want to exclude from each PDF. Please let me know if you have any problems.
enjoy,
db



