I’ve not posted to this blog in a while now. I’ve been busy with other things (mainly my PhD). Still, the good news is I’ve just passed my viva, so now I can pick up where I left off with my projects. In an earlier post I made a Wordle of my 9 month progress report, so I decided to do another Wordle of my thesis. However, this time I’ve decided to take another approach to generating the Wordle, which uses a PERL script.
First off, here is the generated Wordle:

A Wordle of my PhD thesis. The size of the word represents the frequency with which that word is used within my thesis.
Now, here is the PERL script I used to convert the LaTeX into a list of words:
#!/usr/bin/perl
#
use strict;
# An array of subroutine references, which will be called in order
# on each line of latex during the parseLatex subroutine.
our @functions = (
\&removeComments,
\&ignoreFigures,
\&ignoreMath,
\&removeInlineMaths,
\&readInclude,
\&removeEmph,
\&removeSectionHeader,
\&removeRefs,
\&removeCommands,
\&spaceToNewLine
);
if(@ARGV < 0)
{
print STDERR "Please specify a file to start reading.\n";
exit;
}
&parseLatex($ARGV[0]);
sub parseLatex { my($file) = @_;
my $fh;
open $fh, "<$file";
while(<$fh>)
{
my $line = $_;
foreach my $code (@functions)
{
$line = $code->($line, $fh);
}
print $line;
}
close $fh;
}
sub removeComments { my ($line) = @_;
$line =~ s/([^\\])?%.*/$1/;
$line
}
sub removeEmph { my ($line) = @_;
$line =~ s/\\emph\{([^\}]+)\}?/$1/g;
$line;
}
sub readInclude { my ($line) = @_;
my $file = "";
if( $line =~ m/\\include\{([^\}]+).tex\}/)
{
$file = "$1.tex";
&parseLatex($file);
"";
}
elsif( $line =~ m/\\include\{([^\}]+)\}/)
{
$file = "$1.tex";
&parseLatex($file);
"";
}
if( $line =~ m/\\input\{([^\}]+).tex\}/)
{
$file = "$1.tex";
&parseLatex($file);
"";
}
elsif( $line =~ m/\\input\{([^\}]+)\}/)
{
$file = "$1.tex";
&parseLatex($file);
"";
}
else
{
$line;
}
}
sub ignoreFigures { my ($line, $fh) = @_;
if($line =~ m/\\begin\{(((figure)|(table)|(equation)|(align))\*?)\}/)
{
# ignoreFiguresSub will keep reading until we find the matching
# \end{...} command.
&ignoreFiguresSub($fh, $1);
"";
}
else
{
$line;
}
}
sub ignoreFiguresSub { my ($fh, $until) = @_;
while(<$fh>)
{
&ignoreFigures($_,$fh);
if( $_ =~ m/\\end{$until}/)
{
last;
}
}
"";
}
sub removeInlineMaths { my ($line) = @_;
$line =~ s/\$[^\$]*\$//g;
$line;
}
sub ignoreMath{ my ($line,$fh) = @_;
if($line =~ m/\\\[/)
{
while(<$fh>)
{
if( $_ =~ m/\\\]/)
{
last;
}
}
"";
}
else
{
$line;
}
}
sub removeSectionHeader{ my ($line) = @_;
$line =~ s/\\(sub)*section\{([^\}]+)\}/$2/;
$line =~ s/\\chapter\{([^\}]+)\}/$1/;
$line;
}
sub removeRefs{ my ($line) = @_;
$line =~ s/([A-Za-z]+~)?\\(auto)?ref\{[^\}]*\}//g;
$line =~ s/\\cite[a-z]*\{[^\}]*\}//g;
$line =~ s/\\label\{[^\}]*}//g;
$line;
}
sub removeCommands { my ($line) = @_;
$line =~ s/\\[^ ]+//g;
$line;
}
sub spaceToNewLine { my ($line) = @_;
$line =~ s/\s+/\n/g;
$line;
}
Finally, with the above PERL saved to a file named striptex.pl inside the same folder as my main latex file (Thesis.tex), I used the following pipeline:
./striptex.pl Thesis.tex | tr A-Z a-z | sort | sed "s/[^a-zA-Z0-9]//g" > words.txt
Thanks to Colin Williams for his improvements to the script. Have fun.


