Redspin Security Blog

Converting Lots of PDFs to TXTs in Ubuntu/Debian

by David Bailey on Apr.15, 2010, under Redspin Labs

For those of you who are struggling to find a way to convert PDF files into TXT files, here is a quick bash script. There are many alternatives out there, but none were reliable for me. You’ll need to have acroread and ghostscript installed for this to work.


#!/bin/bash
mkdir ps txt
FILES="*.pdf"
for f in $FILES
do
echo "Processing $f"
acroread -toPostScript $f ps/
g=`basename $f .pdf`
ps2txt ps/$g.ps > txt/$g.txt
done

You can also change the second to last line to read
ps2txt ps/$g.ps | grep -v "EXCLUDE" > txt/$g.txt
where EXCLUDE is a line that you want to exclude from each PDF. Please let me know if you have any problems.

enjoy,
db

:, , , ,

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...

Archives

All entries, chronologically...