Have you come across a robojournalist? Robojournalism is an awesome way to produce reports on repetitive events available to people and at the same time letting human journalists focus on work that requires research and human insight. One example of such system is Quakebot by LA Times. However, I found it surprisingly hard to find examples where usage of robojournalism is clearly stated.
What is robojournalism?
In robojournalism, or automated journalism, some data is transformed into news reports written in some human language. This is achieved with natural language generation (NLG) techniques. According to Graefe [1, p.9], the motivation for robojournalism is as follows: “… not only can algorithms create thousands of news stories for a particular topic, they also do it more quickly, cheaply, and potentially with fewer errors than any human journalist.” According to the same paper, robojournalism is the best fit when there is structured data available, and the topic is repetitive in nature. Some traditional use cases are sports, financial and weather reporting.
According to Graefe , the process of automated journalism has five phases:
- collection of data,
- identification of interesting events,
- prioritisation of insights,
- generation of the narrative,
- and finally publishing the story.
They emphasise that automated journalism systems require input from domain experts to define domain specific rules, and criteria of newsworthiness. At simplest, the system could be something like described in my earlier article How to create your own robojournalist. A more complex system could report more detailed information of a match, which would require the system to prioritise the insights for each match. In other words, the system should score events in the match by newsworthiness.
There are requirements for human journalists to fulfil for a news report to be qualified as good. From those requirements identified for human journalists, Leppänen et al.  derived that a news generation system should be:
- modifiable and transferable to other domains,
- producing fluent output,
- based on data that is available,
- and producing news that are topical.
The idea of news automation is not novel at all. One of the earliest articles with the idea is as early as in 1970 by Glahn . In the article they describe a process for weather forecast reporting, similar to current template-based solutions by software providers in which a set of predefined rules are used to determine which prewritten statements are selected to create a story .
Where are the robojournalists (now)?
If the idea is old and the required technology exists, why is it surprisingly difficult to find examples of automatically generated news in major news publications? In 2020, Graefe and Bohlken  summarised results on how readers perceive the credibility, quality, and readability of automated news in comparison to human-written news by doing a meta-analysis of 11 peer reviewed papers in English language scientific journals between 2017 and 2020, and noted the same. Furthermore, according to them, the number of media organisations listed as clients of NLG providers is rather small . Then again, they add: “…although this may have to do with reasons of commercial confidentiality.” Based on these notions, they argue that the field of automated journalism remains to be in early market expansion stage, similarly as Dörr said already in 2016 .
In 2020, Graefe and Bohlken  summarised results on how readers perceive the credibility, quality, and readability of automated news in comparison to human-written news by doing a meta-analysis of 11 peer reviewed papers in English language scientific journals between 2017 and 2020.
They anticipate that media houses might hold back with robojournalism while their readers would disapprove of automated news. According to MAIN-model (Modality-Agency-Interaction-Navigability) , automatically generated news could be conflictingly perceived by readers.
Graefe and Bohlken identified the following heuristics mentioned by Sundar  to apply to this confrontation between human and robojournalists. Authority heuristic suggests that readers should prefer human journalists, as they can be seen as a subject-matter experts. And furthermore to advantage of human journalists, social presence heuristicsuggests that readers should prefer human journalists as it feels more like interacting with a human than a machine. But what is conflicting is that machine heuristic suggests that machine-generated news should be perceived as more objective than human-written ones as they do not include bias.
Interestingly, Graefe and Bohlken state the following [4, p.57]: “In other words, regardless of the actual source [human or robojournalist], participants assigned higher ratings simply if they thought that they read a human-written article.” According to them, the overall comparison results show that readers perceived credibility to be the same for both and human and robojournalists, quality to be slightly in favour of human writers, and readability to be hugely in favour of human writers.
Graefe and Bohlken  anticipate that the finding, where people score machine generated news lower just by knowing they are by a machine, may encourage news providers to hold back from stating that a story was reported by a robojournalist. They state that this emphasises the ethical challenges of robojournalism (see Dörr & Hollnbuchner for discussion about the ethical challenges ). Lastly, I’d like to add that not stating the real reporter of a news report, be it a human or a robot, disrespects the requirement for transparency.
In a study in 2019 by Sirén-Heikel et al. , 13 out of 26 interviewed American and European media representatives mentioned domain-specific, template-based NLG as the type of automation used at their media house. Based on this, and followed by the fact that in 2020 Graefe and Bohlken identified no major progress in the industry, I dare to anticipate that template-based systems remains the industry standard.
Use of template-based systems in encouraged by the fact that they respect the requirements of transparency, accuracy and fluency of output. One of the downsides of template-based systems is that they do not respect the requirement of modifiability and transferability of the system. Here, I refer to the requirements of automated journalism mentioned in the previous section of this article.
As in any field coming across automation, robojournalism has of course risen discussion, whether the future of human journalists is at stake. There are some quite recent news about human journalists losing their jobs for robojournalists such as Microsoft ‘to replace journalists with robots’ in 2020.
Benefits and limitations
All tough the use of robojournalism might be progressing slowly, it has some clear benefits to it. Graefe lists the following benefits in 2016 .
- Speed–Automatisation allows speedy publishing of news reports — a report can be published nearly real-time when the source data becomes available.
- Scale–The scale in which reports are published can be increased — for example, rather than reporting only the major earthquakes, a report can be published about all observations by seismographic sensors without running out of journalistic resources.
- Error free–Automated journalism systems are argued to be less error-prone as they do not make mistakes such as misspelling or calculation errors. In other words, their accuracy is higher than that of human journalists’ — assuming that there is no errors in the code and that the source data is correct.
- Objectivity–Again assuming that no subjectivity is coded into the system and the data is objective, the system will produce objective output.
- Personalisation–Finally, automatisation allows personalisation of news report for smaller target groups — even individuals, and offering news on demand.
In addition to the potential benefits, Graefe mentions the following limitations . It should be noted that NLG has advanced tremendously after the article was published.
- An automated journalism system can only be as good as the data it uses. In other words, the availability and quality of the source data is key to success.
- Although the system might identify interesting events in the source data, it cannot ask the question “why?”. Thus, human validation and reasoning is still required.
- The algorithms lack ingenuity, and are thus limited in their ability to observe society and fulfil journalistic tasks.
- People prefer reading human-written rather than automated news, according to experimental evidence.
As mentioned, robojournalism has many benefits to it and is already well suited for reporting repetitive events on which structured data is available. News automation allows us to report news for smaller target audiences, report same thing on multiple languages, and to personalise content–all this without creating massive additional costs.
Despite the clear benefits, it is difficult to find examples of articles produced by robojournalists. I wonder if this is due to news houses holding back the development of the systems, or holding back the information that they are using them.
Source: Miia Rämö Data Consultant @Solita, I build data pipelines and develop APIs in Azure, MSc in Data Science with focus on NLP,