Organisation

When our team members were gathered, we contacted our clients for our very first meeting to introduce each other and gather project requirements. As part of project initiation, we formed a group chat with the clients for easy communication, a project board on Trello for progress tracking and a repository on GitHub for easy sharing and viewing of code. Every part of the project is shared simultaneously with the clients and our teaching assistant so that they can monitor our contributions and progress easily.

Furthermore, we scheduled weekly meetings with the client to explain our completed tasks and seek constructive guidance if we have questions or issues. The weekly sessions also helped our clients to understand our levels of expertise and offer appropriate feedback. As we are working remotely both with the clients and our team members, we would delegate tasks among ourselves each week for teamwork and efficiency.

Research and Readings

Answering complex questions about tabular information is a challenging task. No two tables are the same and the answers can come in multiple forms and usually computed from a subset of cells. To begin our research the client suggested two papers that explored how a program would be able to answer natural language questions over tables and which parsing technique would be most beneficial.

TaPas (Table Parser) is one of two papers that explored this concept. TaPas is a question answering model that reasons over tables without generating logical forms. “TaPas effectively pre-trains over large scale data of text-tablepairs and successfully restores masked words and table cells.” The results showed that TaPas is better or on par compared to traditional semantic parsers.

The second paper explored TaBERT. TaBERT is a trained langage model (LM), built on a collection of 26 million large tables that learns natural language sentences and table structures. TaBERT is built on top of the BERT, a natural language processing (NLP) model and takes a combination of natural language queries and tables as input. TaBERT encodes only the subsets of the table that are more important to the query , this means TaBERT can handle large quantities of data.

Data Exploration

So far we have been collaborating with our client for a few weeks and all of the tasks given to us revolved around getting to know the basis of natural language processing, parsing the tabular data as well as searching for and generating relevant tabular data in order to train the model we are going to create.

In order to provide a wide range of data for the training purposes, we chose 5 different kinds of lab results (such as blood testing, cytology, etc.). Next, we found 5 different results for each of the kind. Examples for particular testing were different not only in measurements, but also in what substances were measured and in what way. For example one blood test could measure glucose level while the other not. This way we ensure that the model will not be adjusted to only one type of reports.

When we had all the results converted to csv files we started generating new reports. There were 25 different reports and we generated 40 reports for each of the 25 templates. Summing up, 1000 reports were created. To quickly finish this task we used Python as a programming language.