top of page

Joyce Yanru Jiang

LDA Topic Modelling of Covid-19 media coverage in Uganda

August 31, 2020 By Joyce Yanru Jiang & Anne Whitehead

Uganda – like the rest of the world – has experienced a large volume of news coverage about the Covid-19 pandemic. Whitehead Communications used Artificial Intelligence (AI) to analyse the dominant topics that emerged within Covid-19 coverage by online news websites in Uganda. We gathered more than 13,000 news articles that included the terms “covid” or “corona” from 13 English language news websites. This analysis included the websites of Uganda’s major print publications Daily Monitor, New Vision, and Red Pepper as well as major online-only Ugandan media houses Chimp Reports, Nile Post, PML Daily, Observer and The Independent (the last two of which became online-only just recently during the pandemic lockdown), and other online-only news publishers including Softpower, The Tower Post, Eagle News, Trumpet News and The Brink News.

This was exploratory and experimental research to find out which topics stood out in Ugandan media coverage of Covid-19 according to a machine learning method called LDA Topic Modelling, which employs a Natural Language Processing (NLP) technique. This methodology alone cannot be considered comprehensive or conclusive, but it offers an initial indication and useful insights into how the subject was covered in the Ugandan press.

Figure 1 below shows of the volume of articles that we collected from each news website. This only represents the online written news media landscape in Uganda, but presents us with an initial sample for further analysis.

OUR PURPOSE
This research aims to explore how Covid-19 has been covered by Ugandan online news websites within the first six months of 2020 using machine learning. We identified top news sites in Uganda that publish in English and applied an LDA Topic Modelling computational technique to discover which topics are being covered related to the coronavirus pandemic. Our research is intended to deliver insights to those who have a special interest in Uganda’s media industry, or the country’s experience with Covid-19, or those interested in the application of machine learning to media and communications research. This is part of a wider research project by Whitehead Communications exploring the application of multiple new research methods to see the bigger picture, draw correlations and build stronger research-based foundations on which to develop communications strategy.

OVERVIEW OF COVID-19 MEDIA COVERAGE IN UGANDA
The weekly number of articles produced by Ugandan online media increased dramatically in mid-March of 2020, as the first Ugandan case was declared on the 21st of March2 and the country went into lockdown in the same week3. Covid-19 related coverage by volume of online articles reached its height in April of 2020, then began to decline. This indicates that media interest in the disease and its impact peaked during the period when lockdown was strictest and cases were just beginning to mount, but before the first Covid-19 death was announced in Uganda on the 23rd of July, 20204. We removed from our dataset any stories republished from foreign media outlets in order to focus only on news produced in Uganda.

According to our dataset, the Daily Monitor published the largest volume of Covid-19 related articles online in April, averaging 150 per week, followed by The Independent (~141/week), then PML Daily (~137/week), Chimp Reports (~114/week), Nile Post (~104/week) and New Vision (~68/week). The volume of articles began to drop again in May. You can see this trend in Figure 2 below.

METHODOLOGY
We used the Latent Dirichlet allocation (LDA) Topic Modelling technique to categorize news based on topics and assign topics to each article. The following are the main steps in our research.

Data collection: First, using local media knowledge, we identified the Ugandan news websites likely to produce the highest volume of articles. Then we developed a tailored scraping script in Python for each news website. On each website, we searched the capital-insensitive stem keywords “covid” and “corona” to gather articles related to Covid-19 news (this also included extended varieties of these keywords such as “coronavirus” and “Covid-19”). The data we scraped ranged in dates from January to mid July. Since we did not get comprehensive data from July, this month was excluded from our analysis.

Data cleaning: After gathering 10,427 articles including the keyword “covid”, we combined this dataset with 9,927 articles including the keyword “corona”, then we conducted data cleaning and got 12,750 in our final dataset for LDA modelling. The steps we took for data cleaning included:

o Removed duplicated articles
o Only kept articles whose headlines or article body text contained Covid-related keywords (ex. COVID, corona, lockdown, outbreak, curfew, reopen, facemask, respiratory, WHO, hospital, etc.). The keywords were determined in a discretionary way that may lead to some Type I and Type II errors. However, we believe our result is mathematically reliable.
o Removed articles that contain keyword “Toyota” (to exclude articles that were picked up in our data about the Toyota Corona car model) and “coronation”
o Remove articles that contain AFP (Agence France-Presse) and Xinhua (Chinese news agency) and do not have any Uganda-related keywords. Both AFP and Xinhua are international news agencies that sell news to Ugandan local media. Though the Ugandan audience is still exposed to this content, we chose to remove international articles unrelated to Uganda in order to focus on news produced locally.

Challenges with specific media houses: the Observer was not included in our topic model because its initial search results did not return more than 100 articles at a time. We were later in touch with Observer and they updated their website functionality to allow for unlimited search results, so that allowed us to measure the overall volume of articles they published on this topic, but unfortunately we were not able to run the algorithm again so their content was not fully included in our topic model. This is because, once we added additional articles, the order of data changed, which affected the results of modelling. We also faced a challenge with Nile Post, in that their dates were formatted as “X time ago” instead of the exact date. We reached out to Nile Post and they were able to change the format of dates on their website for us in time to include these results in our LDA Topic Model. Finally, we noticed a gap in our data from the Red Pepper, but were able to confirm with this media house that their website was inactive at the time.

LDA Topic Modelling: We ran our algorithm in many variations, changing specifications of the number of topics between 15 to 22, and found that we received the most optimal results when we set the number of topic clusters to 18. Two pairs of topics seemed redundant, so we grouped them together, which left us with 16 unique topics. Our algorithm generated two clusters that were a mix of personal stories, editorials and tips for coping with the effects of Covid-19, which we amalgamated into topic # 15. Also, stories to do with declaring the number of cases and efforts to conduct testing were amalgamated into topic # 1.

*Topic analysis: Our first trial for topic analysis was to check keywords for each cluster. We applied text analysis of article and headline contents using Python to develop word clouds for each cluster. Since some topics comprised of mixed subtopics, we were unable to tell the actual topic from analysing keywords themselves. The Figure 2 below is a demonstration of this challenge.

We then reviewed around 15 of the top matching sample articles in each cluster to further analyse and determine topic categories. We went over the article headlines together with our team manually again to confirm topic subjects and name each one.

TOPIC ANALYSIS
Our analysis identified 16 topics that emerged in Ugandan online news media coverage of Covid-19,
as listed below:

1. Cases & testing
2. Healthcare
3. Domestic outbreak (and government enforcement)
4. Parliament budgeting
5. Travel restrictions
6. Global outbreak and international response
7. Contributions to Covid-19 budget
8. Economy & finance
9. Education
10. Courts & justice
11. Sports
12. Culture & religion
13. Electoral politics
14. Police action
15. Editorials & personal stories
16. Presidential directives

These topics were identified through an unsupervised algorithm, which we ran several times using different parameters until we found the most optimal result. We checked the articles it grouped together to identify what the topics were. Some topics were straightforward, as the articles shared common themes, such as # 9 Education and # 11 Sports. Others were made up of mixed subtopics under one major topic, such as # 3, which gathered together stories about the domestic outbreak and how the government was responding. We also chose to combine four automatically generated topics into two, since their topics were very similar: stories about cases and testing were combined into topic # 1, and editorials and personal stories were combined into topic # 15. The algorithm clustered together groups of stories that differed, but shared a common thread, such as those mentioning the courts system (#10), or those related to different types of restrictions on transport and travel (#5).

More details about our topic modelling, along with a further examination of a few key topics related to Covid-19 and how they manifested in Ugandan news coverage are shared in the full report below.

Download the full report here.

bottom of page