Using Python to extract district-wise COVID data from epid.gov.lk (from multiple PDF files)

Mohamed Faizan Cassim
Analytics Vidhya
Published in
5 min readMay 15, 2021

--

My main motivation for obtaining district-wise COVID-19 data for Sri Lankan COVID-19 cases, was to see how COVID-19 has evolved within the districts and spot patterns within the data, such as movement between districts, national/ religious festivals, political rallies etc. But due to the government publishing each day’s COVID data (for all districts) on a separate PDF file, this was a job for a python program to aggregate this data.

The process to aggregate the COVID data from the individual PDF files can be split into 4 separate tasks: Crawling through the main website for links to PDF COVID reports, cropping out the respective tables of interest, Reading the Tables of Interest and Aggregating the data.

Crawling through the epid.gov.lk for download links to the respective PDF files.

This was one of the easiest processes, thanks to python libraires like BeautifulSoup, which allows a user to create simple html element queries and list all elements of the same type and store them in a list.

This is how the website that we are going to download the PDF files from looks like:

Source: epid.gov.lk

As you can see from the above image, the links to the PDF files are presented in a tabular format, with both English and Sinhala versions of the PDF files.

Having the PDF files in two different languages adds a layer of complexity in finding, only, the English versions of the file. However, luckily, there is an easy method of differentiation. Look at two of the following links for example:

web/images/pdf/corona_virus_report/sitrep-sl-sin-14–05_10_21.pdf (Sinhala Version)

web/images/pdf/corona_virus_report/sitrep-sl-en-14–05_10_21.pdf (English Version)

In the above example, one can witness that using the keyword sl-en means that you can filter the English version of the PDF files.

The following is the code snippet to extract the links to the PDF files. The following code will be explained after the image.

Python Code Snippet 1

The first step of the code is to traverse through each row of the table. The second step is to then extract all hyperlinks in the table, which contains a PDF file. The Final step is to then extract all the English versions of the PDF file.

Cropping out the respective tables.

The Camelot Library for Python is a great tool for extracting tables out of PDF files into a Pandas Dataframe object, which makes for easier table manipulation. To make the process easier and less error prone, for our python program to read the date from the respective tables, it was easier to crop out the tables onto a separate PDF file, containing only the single table.

The following is the table that we wish to extract:

Table 1 Source: epid.gov.lk

If we know the pixel values for the boundaries of the table, then it is easy to extract this table. However, as the table can be in slightly unusual positions for each PDF file, it's important to account for these variations too, by increasing the crop area.

Obtaining the Pixel Coordinates to Crop

The sad reality is that there is no effortless way to obtain the crop pixel coordinates of the PDF file. However, after much research, I found it easier to convert the PDF file into a JPEG format and then use the curser to obtain the crop coordinates.

To get the exact crop coordinates; as pixel values can be relative, I used PyPDF2’s mediabox function to obtain the original width and height of the PDF file. The following image shows how this can be done:

Python Code Snippet 2

Once these coordinates were obtained. The PDF file was then converted to a JPEG image and opened in Ms Paint. The image was then resized to the original width and height of the PDF file. The curser was then used to obtain the Pixel coordinate of the table I wished to extract.

Transferring the Metadata onto the new PDF file.

As there is no date provided on the table, for which the table was relevant, there was no means to tract the tables by date. However, the creation date of the PDF file, luckily, coincided with the date relevant to the table. But when a new cropped version of the PDF file was created, the metadata was not automatically copied. However, the process was easy enough with a couple of lines of code.

Python Code Snippet 3

The following function shows how the cropping was done in the python code:

Python Code Snippet 4

Reading and Manipulating the Data from the tables

Once the table of interest was cropped out and the metadata transplanted, the process of reading the data off the tables was an effortless process. The Camelot library was able to convert the table data into a Pandas Dataframe object, which I subsequently converted into a Python list object, where I could then delete rows that were not relevant to the data I was trying to collect. As you can see from table 1, the first and last rows can be deleted.

Python Code Snippet 5

Aggregating the data for a specific user selected district.

To aggregate the data for a particular district, of the user’s choice: a user input was taken, and a case sensitive string find was done; to aggregate the data based on the district.

Python Code Snippet 6
Python Code Snippet 7

Results

The following tables show the rate of COVID infections in three districts of Sri Lanka.

Chart 1
Chart 2
Chart 3

Conclusion

As a Robotics Engineer, who is interested in Data Science, but am yet to have formal training in the field, I am immensely proud of this first step because of the complexity of the work involved. I managed to overcome this task using three separate Python scripts. My next endeavour using this data obtained is to apply some real data analytics and find some interesting corelations.

--

--

Mohamed Faizan Cassim
Analytics Vidhya

Robotics Engineer from Kolonnawa, Sri Lanka. Lived in 4 different countries and been to 6. Programing: C, C++, C# and Python (beginner at Rust).