(29.11.2021 - 12.12.2021)
TaPas, as mentioned in the previous section of the blog, is a model which approaches question answering over tables without generating logical forms. TaPas trains from weak supervision, it predicts denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. Extending BERT’s architecture with additional positional embeddings to encode the tabular structure allows TaPas to learn operations from a natural language without the need to specify them in logical forms. Pre-training the model on 6.2 million tables with at most 500 cells per table ensures that the model can tackle different kinds of data and table templates.
After investigating the model, we found out that TaPas works relatively fine for clinical data, but they're still are some problems with certain vocabulary and abbreviations. Therefore we decided to improve the accuracy of the answers by fine-tuning the model with previously generated fake data.
The TaPas model is pre-trained using the masked language modelling (MLM) objective on a large collection of tables from Wikipedia and associated texts. However, given our task is to reason over clinical tabular data we need to further pre-train the model using lab reports. This will increase the reasoning capabilities of TaPas, which further improves performance on downstream tasks specific to clinical data.
To do this we built an excel file with the data generated as explained in blog 1 where each row corresponds to a question related to a table.
The position column identifies whether the question is the first, second, ... in a sequence of questions related to a table.
The table_file column identifies the name of the table file, which refers to a CSV file in the table_csv directory.
The answer_coordinates and answer_text columns indicate the answer to the question. The answer_coordinates is a list of tuples, each tuple being a (row_index, column_index) pair. The answer_text column is a list of strings, indicating the cell values.
We fine-tuned TaPas on Sequential Question Answering (SQA), a dataset built by Microsoft Research that deals with asking questions related to a table in a conversational set-up. This will improve the communication between the user and bot and therefore the user's experience.
In order to compare and analyse how much our fine-tuned TaPas model improved in terms of accuracy, tests involving pseudo-reports are generated through Python scripting. The Pytest module is used here for the analysis.
The types of tests include: