Computers read newspapers too.

Being up-to-date has always been the number one bread-and-butter activity for any media company. In the face of a true flood of data that is nowadays inundating the news networks and social media channels of the world, computer scientists have entered the field and are working to develop machine-reading software that can read up on the stories for which human researchers do not have time.

Pioneering research for the use of machine reading in media is being conducted at the London Media Technology Campus (LMTC), the main base of operations for the strategic partnership formed between the BBC and University College London (UCL) in 2014. This new research center has also given a new home to UCL’s machine reading group, led by German machine reading expert Dr. Sebastian Riedel. He states that models using artificial neural networks have made an impressive comeback by winning several machine translation contests in 2014 with almost no training data. The entire field of natural language processing (NLP) has been going through a revival ever since.

Machine Reading

Machine reading—also referred to as natural language understanding—is a branch of NLP that concerns itself with how textual information can be transformed into more abstract data representations that computers can manipulate more easily. In earlier times, the standard approach to machine reading was similar to a computational assembly line. Computer scientists would write down a set of predefined rules as to how a language functions according to its syntax, grammar, and semantics. These rules would then be used by the software to try to establish meaning. However, the problem with this approach is that languages are incredibly complex systems with a myriad of ambiguities, making it next to impossible to pre-define all of the rules.

Modern machine reading has therefore increasingly turned towards statistical methods to ‘learn’ the language in a more natural way. This is done by computationally examining a large volume of sample texts (called corpora) as ‘training data’ to establish statistical relationships between the elements in sentences (words, parts of speech, etc.). Statistical models that achieve this can generally be classified as either supervised or unsupervised learning models. Supervised models require so-called annotated corpora – essentially pre-chewed texts, where humans have already put in the solution answers that are expected from the model. While this approach is completely valid, its biggest flaw is that in order to improve the models and expand their capabilities, more annotated training data is necessary. And creating these datasets takes a lot of time.

Unsupervised learning models (like neural networks) are the big winners that have enabled the aforementioned advances in machine translation. While with the same amount of training data these models would inherently be less accurate than supervised methods, the great advantage of them is that they can also be trained on non-annotated data. This is what makes methods such as neural networks suitable to be directly transferred and applied to machine reading. The sheer mass of raw training data that is publicly available on the Internet can offset the inherent lower accuracy and allow for these models to be trained with relatively little annotated input.

Neural Networks in NLP

Generally, whenever an input is fed into a neural-net model, it is run through several ‘hidden layers’ of activation functions: see Figure 1. The simplest example of such a function is a ‘perceptron’, which returns 1 if the input is greater than some threshold and 0 otherwise (this is similar to the neurons in the human brain, which either fire or don’t – depending on the strength of the input). The outputs of the activation functions are passed on to the next layer until the end is reached, when a final output is computed. The output of the first round is nothing more than a random guess. Nonetheless, the model can calculate how far it was from the correct result by comparing the outputs of the model to the actual solutions. This then enables the model to adjust its activation functions to achieve a lower error rate for the next round of inputs. This technique is called back-propagation and is used to fine-tune the hidden layers over the course of many iterations. In this training phase, the model requires some human input to provide the solutions. However, after its training is complete, it can process data and generate outputs without the need for further human help.

Neural Net Image
Figure 1: The structure of a neural network model. Note that in reality, the number of hidden layers can vary.

 

In NLP, such neural networks are used to create ‘word embeddings’—see Figure 2—by reading hundreds of thousands of sentences. The goal of this activity is to situate any word in a vector space that stores how far away (mathematically) a word is from any other word. Moreover, it has been discovered that the exact distance between words can be related back to the nature of this difference. An example for this are gender differences: the mathematical distance from “queen” to “king” is approximately the same as the distance from “aunt” to “uncle”. This ‘encoding’ of relationships is a general property of word embeddings, which makes them a very natural representation of language data.

 

Word embedding graph
Figure 2: Visualisation of a word embedding in the jobs region. (Source)

Applications

Newspapers, magazines, and broadcasters alike are increasingly dependent on external news agencies to supply them with both breaking news and factual data to support their articles. In a constant race not to miss out on any big stories, broadcasting giants are forced to spend millions of pounds per year to get access to current information. In-house research and information gathering has become increasingly impractical and unreliable. Part of the LMTC research group is therefore attempting to use machine reading to automatically read though the news reports and social media feeds of the world, and to then create knowledge databases about entities such as countries, companies, and influential people, aggregating news stories and constantly updating the information.

“We want our software to not only understand language, but also to answer questions”, says Dr. Riedel. He states that the expression of such ambitions invokes frequent comparisons to present day personal assistants like Apple’s Siri, Google Now and Microsoft’s Cortana. However, the technology at work within these applications is a world apart from what runs inside modern machine-reading software. While the algorithms behind smart assistants, with their ability to recognise and extract meaning from anybody’s voice, represent a remarkable (and at times quite goofy) achievement of computer science, these systems lack the necessary knowledge to answer any “real” questions that cannot easily be looked up online.

Riedel believes that much more interesting applications of intelligent personal assistants begin to arise when coupled with machine reading algorithms. For the years to come, he plans to drive the lab’s research towards building an expert assistant system that can be commanded to familiarise itself with any topic the users might need assistance with, not only the aforementioned medial knowledge bases. After the AI has finished reading everything there is to read about the topic, it should then be able to answer even complicated questions. “Lawyers will be able to task their personal assistant to read the legislation of an entire country. The AI will then be able to answer specialised questions and find the excerpts the lawyer is looking for”.

To make this vision a reality, the research group is currently building a machine-reading AI that can pass elementary level science exams just by reading the relevant textbooks. This may sound like child’s play in comparison to reading a 2000-page law code, but one must not forget that, for a computer, a complex subject matter is potentially more computationally intensive, but not inherently difficult. Bluntly put, software doesn’t care whether it is reading a complicated manual for an MRI scanner or a collection of children stories. And it will read everything.

Books

Jobs

Entry-Level Engineer at NC Department of Transportation
Expires: 02/03/2021 Employer: NC Department of Transportation
Registered Nurse ($3,200 Hiring Incentive)(Job Id 15544) at South Dakota State Government
Expires: 02/17/2021 Employer: South Dakota State Government - Department of Health
Registered Nurse ($3,200 Hiring Incentive)(Job Id 15541) at South Dakota State Government
Expires: 02/17/2021 Employer: South Dakota State Government - Department of Health
Engineering Technician III at Washington County, Oregon
Expires: 02/07/2021 Employer: Washington County, Oregon
Customer Accounts Specialist I at City of Portland Bureau of Human Resources
Expires: 01/30/2021 Employer: City of Portland Bureau of Human Resources
Social Worker IA&T - 2nd shift at Carteret County Government
Expires: 02/01/2021 Employer: Carteret County Government
Supervisory Interdisciplinary Scientist at US Food and Drug Administration (FDA)
Expires: 02/02/2021 Employer: US Food and Drug Administration (FDA)
Mental Health Professional at El Paso County
Expires: 02/01/2021 Employer: El Paso County
Maintenance Supervisor at CSL
Expires: 02/15/2021 Employer: CSL
Clean Utilities Engineer at CSL
Expires: 02/15/2021 Employer: CSL
BAS Engineer at CSL
Expires: 02/15/2021 Employer: CSL
Program Manager - Family Planning at El Paso County
Expires: 02/01/2021 Employer: El Paso County
Deputy Human Services Director - Health at Carteret County Government
Expires: 02/08/2021 Employer: Carteret County Government
Student Trainee for Engineers and Architects at U.S. Army Corps of Engineers, Baltimore District
Expires: 01/30/2021 Employer: U.S. Army Corps of Engineers, Baltimore District
Business Systems Analyst (Operations and Policy Analyst 2) at Oregon Department of Environmental Quality
Expires: 02/01/2021 Employer: Oregon Department of Environmental Quality
Management and Program Analyst Summer Student Trainee (GS-7/GS-9) at U.S. Government Accountability Office
Expires: 02/08/2021 Employer: U.S. Government Accountability Office
Management and Program Analyst Summer Student Trainee (GS-04) at U.S. Government Accountability Office
Expires: 02/08/2021 Employer: U.S. Government Accountability Office
Associate General Counsel at Office of the Massachusetts State Treasurer and Receiver General
Expires: 02/04/2021 Employer: Office of the Massachusetts State Treasurer and Receiver General
Fiscal Policy Analyst at Office of the City Auditor, City of Sacramento
Expires: 02/03/2021 Employer: Office of the City Auditor, City of Sacramento
Director Human Resources at Hillsborough County Government
Expires: 02/27/2021 Employer: Hillsborough County Government
Planning Director at County of Frederick, VA Local Government
Expires: 02/08/2021 Employer: County of Frederick, VA Local Government
Planner Coordinator at Maryland-National Capital Park and Planning Commission (Prince George's County, MD)
Expires: 02/16/2021 Employer: Maryland-National Capital Park and Planning Commission (Prince George's County, MD)
Consumer Safety Officer (Emergency Response Coordinator) at US Food and Drug Administration (FDA)
Expires: 01/25/2021 Employer: US Food and Drug Administration (FDA)
Public Affairs Specialist at Federal Emergency Management Agency (FEMA) Pathways Students and Recent Graduates
Expires: 01/29/2021 Employer: Federal Emergency Management Agency (FEMA) Pathways Students and Recent Graduates
Code Comp Investigator II at Fairfax County Government
Expires: 01/30/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Behavioral Health Senior Clinician - Youth & Family at Fairfax County Government
Expires: 02/06/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Social Services Specialist III at Fairfax County Government
Expires: 01/30/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Public Health Nurse IV at Fairfax County Government
Expires: 01/30/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Kinship Care Specialist (Social Services Specialist III) at Fairfax County Government
Expires: 01/30/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Water Resources/Dam Safety Engineer (Senior Engineer III) at Fairfax County Government
Expires: 02/06/2021 Employer: Fairfax County Government - Fairfax County Human Resources
FOIA Analyst (Management Analyst II) at Fairfax County Government
Expires: 01/23/2021 Employer: Fairfax County Government - Fairfax County Human Resources
Administrative Assistant & Scheduler (Administrative Aide) at Fairfax County Government
Expires: 01/30/2021 Employer: Fairfax County Government - Fairfax County Human Resources