The Public Debate in News Media comment section

Part 1: Web scraping

Anders Grundtvig
7 min readOct 23, 2020

Prologue: I have for a long time been interested in whether the so-called Social Web (digital infrastructure to enhance user contributions on the web) is democratizing the web by allowing people around the globe to share and be engaged in public debates or is dividing the web by encouraging polarizing and aggressive opinions to blossom. So for me, the question here is, how user contributions on the Social Web are affecting “the digital public debate”.

I am in this project looking at public debates taking place in the comment section of a Danish newspaper, Dagbladet Information. This first part is solely focusing on the data gathering process.

The dataset

The dataset that I am describing here consists of 3080 articles from information.dk, published between 1st of March 2020 to the 19th of October 2020. The data is web scraped from the Dagbladet Informations website with a web scraper tool I built in Python using the widely popular Beautiful Soup package. Let me elaborate a bit on the process and thoughts behind this particular web scraping exercise.

I am not going to explain what web scraping is or elaborated details of how it works. There are plenty of incredible resources out there that already do that. But I would like to highlight some aspects that to me have not been empathized enough on the first 20 results you will find when searching for “web scraping” and then go on showing a rough sketch of how my web scraper works.

Robot.txt

First off, there are a lot of debates and controversies about the legal aspects of web scraping. The key aspect is to be aware of it and do some research before starting to write codes. Each website out there -with respect for themselves- has what is known as a “robots text”. This is a machine-readable text located on the root of the website that tells web scrapers, also known as robots, how they should behave on a particular website. Robots texts can easily be found by adding “/robots.txt” to the root of the website URL (eg. information.dk/robots.txt). Try it out now, it is pretty cool. Please note that the robot texts can’t determine how the web scraping robots are behaving on the site, they can only set up behavioral guidelines, and then it is the programmer’s task to build the robots to obey. So please make sure your robots aren’t violating this requested behavior. Another aspect of web scraping to keep in mind is that web scraping basically is the driving factor in all web search engines infrastructures! All the results showing on your Bing, Yahoo! and Google searches are available simply because Microsoft, Verizon, and Google constantly are web scraping the entire web for results to show in your search results. What I am saying is that web scraping might feel like a blackhat business, but it is actually driving the biggest companies in the world.

The robot text on Information.dk told me that my robots were welcome on my desired destinations as long as I waited 10 seconds in between each website request, also known as crawl delay. For my program, it only meant a bit more waiting time for the program to be able to scrape all the desired articles since each article is one request. But all in all that was good news for this project.

The web scraping robots

The first step was to understand the web structure of Information.dk. Since I was interested in the user comment section and the user comments were located on the same page as the related article I simply needed a list of all the URLs of all the articles I was interested in. Information.dk provides an archive of all their articles structured month by month. That became handy for me. So first off, I programmed my robots to visit each month in the archive and return with the URL for each day. Then I told my robots to visit each day and return with the URL for each article it found on that day. That adds up for my robots first to visit seven URLs (months) then 214 URLs (days) and with 10 seconds crawl delay (robot.txt requirement) in between each URL request, we end on 37 minutes runtime.

Now I had a list of all the article URLs I was interested in. Now I was ready for the real execution. The next set of robots, I programmed to visit each of these URLs and this time returned with actual data I was looking for. For each URL I had two focal points, first I was interested in some data about the article and then I was interested in some data from the comments section. To begin with, the robots gathered data about the article they got; “author name”, “publishing Date”, “title”, “text header”, “abbreviated text body”, and the “article image URL”. Then the robots found the comment section and gathered comment data generated by the users; “commenter name”, “comment text”, “comment recommended by”, and “number of recommendations”.

The so-called robots are actually not small squared, grey-toned robots running all around the website searching, but instead a quiet boring and less illustrative HTML request managed by the python package “Beautiful Soup”. This request allowed me to access all the content of the website and then from there I could ask for the part that I was interested in. Let me give an example:

All the articles on information.dk have the same HTML structure for where the article objects, “Title”, “Header”, “Date”, and so forth are located and how to find them. To find the HTML structure of the article you can right-click on an object and click “inspect”. This allows you to see the HTML code “behind” the website. When eg. right-clicking the title of the article and clicking on “inspect”, you will find that the title is located in the HTML text under the “h1” tag. It is slightly more complicated than that but basically, with the newly found “h1” tag you can ask Beautiful Soup to get the content of “h1”, which in this case is the title of the article. Since the structure of all the articles is the same, it is now simply a task of looping over a list of all the article URLs and executing the same code over and over to get all the Titles for all the articles.

This is how the above example could look like in Python:

#Here I am importing libraries
from bs4 import BeautifulSoup as bs
import requests
#This line of code makes the request to get the HTML text from the website into an lxml format.
website= “https://www.information.dk/debat/2020/10/rune-lykkeberg-coronavirussen-virkelighedens-haevn-donald-trumps-totale-teater"
source = requests.get(website).text
soup = bs(source, ‘lxml’)
#Here I am searching the HTML text for the “article” section wherein i know the article meta-data are located. This way I am only making one request for each article instead of one request for each meta-data point
article = soup.find(“article”)
#Here I am printing the “h1” tag wherein I know the title is.
print(article.h1.text)

OUT: “Rune Lykkeberg: Coronavirussen er virkelighedens hævn over Donald Trumps totale teater”

One aspect of the web scraping task that ended up causing a bit of trouble was the articles with over 50 comments. Apparently information.dk only allows 50 comments on each page and then you have to press “next” to get the next 50 comments. This next button actually sends you to a new URL, which is the same URL as the original but with “?page=1” at the end of it. So in order to access more than the first 50 comments I had to tweak my program a little bit. There are plenty of different ways I could have proceeded, but I decided to tell my program to look for the “next” button and if it was present on the side, then visit the same URL again just added “?page=1” to it, and then looping over this process by adding 1 to the “page=” section of the URL until there no longer was a “next” button present on the given site.

Now I had all the elements for building the web scraping tool. I designed the tool so it takes two inputs: a start date and end date. Then it looks through the archive on information.dk, finds the available article URL’s for the desired days, and then began gathering the requested data by scraping one article at a time and saving the newly gathered information into a data frame.

After about 9 hours of run time, (10 seconds crawl delay in between each URL request) I had gathered a dataset from information.dk with 35700 user comments spread over 3089 articles from the period 1st of March until 19th of October.

Next up is asking questions to the dataset and explore what it can show us about the public debate. That’s the theme for the next post.

//Anders Grundtvig, Master in Techno-Anthropology

--

--

Anders Grundtvig
0 Followers

PhD student. Master in Techno-Anthropology. Working in the field where humans encounter technology and technology encounter humans.