Word cloud (Tag cloud) has become a very popular visualization method for text data, despite it is almost useless in drawing statistically-relevent conclusions. Word clouds can, however, be a quick way to present research interests on personal webpages.
In this blog post, I will show how to use python to generate word cloud from a list of pdf files (a common file format for scientific publications).
We will need several python packages; wordcloud, PyPDF2, nltk and matplotlib, which can all be install from the conda-forge
channel from conda.
- setting up the environment
conda create -n pdf_wordcloud python3 wordcloud \
pypdf2 matplotlib nltk nltk_data
- activating the environment
source activate pdf_wordcloud
- here goes the scipt
TL;DR The script is deposited on github.
First import all the things that are needed:
import string
import re
import glob
import matplotlib.pyplot as plt
import wordcloud
import PyPDF2
import nltk
from calendar import month_name
from nltk.corpus import stopwords
In English, there are some general words (e.g. "you", "me", "is") that are not necessarily helpful in natural language processings. We call these stop words and we want to exclude these words from our text database. Each of the NLTK
and wordcloud
package provides a list of stop words. So we will curate a list of stop words for filtering out the stop words in later steps.
ENGLISH_STOP = set(stopwords.words('english'))
I implemented the wordcloud as a python object, and only the required initializing input is the directory of the PDF files, and I also curated some extra words (self.paper_stop
) that maybe publication-specific stop words (e.g. "Figure", "Supplementary" and dates, in this case).
class research_wordcloud():
'''
Make word cloud from all PDF under a folder
Usage:
rs = research(paper_path)
rs.extract_text()
rs.filter_text()
rs.generate_wordcloud(figurename)
'''
def __init__(self, paper_path):
'''
find all pdf under paper_path
'''
self.paper_path = paper_path
self.PDFs = glob.glob(paper_path + '/*pdf') #any PDF can be found?
self.texts = '' # store all texts
self.tokens = None
self.words = None
self.paper_stop = ['fig','figure','supplementary', 'author','press',
'PubMed', 'manuscript','nt','et','al', 'laboratory',
'article','cold','spring','habor','harbor',
'additional', 'additionalfile','additiona file']
months = [month_name[i].lower() for i in range(1,13)]
self.paper_stop.extend(months)
self.paper_stop.extend(list(map(lambda x: x.capitalize(), self.paper_stop)))
self.paper_stop = set(self.paper_stop)
And then, I implemented a function to retrieve texts from the PDF files using PyPDF2:
def extract_text(self):
'''
read pdf text
'''
for pdf in self.PDFs:
with open(pdf, 'rb') as paper:
pdf = PyPDF2.PdfFileReader(paper)
for page_num in range(pdf.getNumPages()-1): #skip reference
page = pdf.getPage(page_num)
self.texts += page.extractText()
And a also function for filtering out stop words, as well as verbs. NLTK offers implementations to 1. tokenizing words (nltk.word_tokenize
) from the full text, and 2. identifying if a word is a noun or verb, etc (nltk.pos_tag
).
def filter_text(self):
'''
remove stop words and punctuations
'''
self.tokens = nltk.word_tokenize(self.texts)
self.tokens = nltk.pos_tag(self.tokens) #(tag the nature of each word, verb? noun?)
self.words = []
num_regex = re.compile('[0-9]+')
for word, tag in self.tokens:
IS_VERB = tag.startswith('V')
IS_STOP = word in set(string.punctuation)
IS_ENGLISH_STOP = word in set(ENGLISH_STOP)
IS_WORDCLOUD_STOP = word in wordcloud.STOPWORDS
IS_NUMBER = num_regex.search(word)
IS_PAPER_STOP = word in self.paper_stop
condition = [IS_VERB, IS_STOP, IS_ENGLISH_STOP,
IS_WORDCLOUD_STOP, IS_NUMBER, IS_PAPER_STOP]
if not any(condition):
if word == "coli":
self.words.append('E. coli') #unfortunate break down of E. coli
else:
self.words.append(word)
self.words = ' '.join(self.words)
Now, we can generate a wordcloud from the words we have curated.
def generate_wordcloud(self, figurename):
'''
plot
'''
wc = wordcloud.WordCloud(
collocations=False,
background_color='white',
max_words=200,
max_font_size=40,
scale=3
)
try:
wc.generate(self.words)
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.savefig(figurename, bbox_inches='tight', transparent=True)
print('Written %s' %figurename)
except ValueError:
print(self.words)
So to run the whole thing:
PDF_path = '/home/wckdouglas/all_my_papers/'
wordcloud_image = '/home/wckdouglas/research_wordcloud.png'
wc = research_wordcloud(PDF_path)
wc.extract_text()
wc.filter_text()
wc.generate_wordcloud(wordcloud_image)