AITOK

📄 Paper: https://doi.org/10.1007/s44230-025-00131-4

AITOK in NTCIR-18 MedNLP-CHAT

NTCIR-18 MedNLP-CHAT

This page publishes the tools and data used by Team AITOK when conducting additional experiments and evaluating their accuracy following participation in NTCIR-18 MedNLP-CHAT.

Dataset

Sample

The data consists of a question, an answer, and labels for the answer. The labels for each answer are objective labels (‘medicalRisk’, ‘ethicalRisk’, and ‘legalRisk’) judged by experts considering German laws and medical guidelines.

Data size: German dataset comprises 112 pairs of Question and Answer as test set. Questions and answers are created by humans, referencing responses from a chatbot. Answer labels represent the evaluation of the answers, which will be estimated in this task. There are three labels (medicalRisk, ethicalRisk, and legalRisk) assigned by experts based on German laws and medical guidelines.

Languages: Step 1: Data is created. Step 2: It is translated into the other languages. The training and test data will be translated to English and French manually by professional translators.

Tool

LLMs_for_MedNLP_CHAT_Accuracy_Evaluation.ipynb

We are releasing an accuracy evaluation tool in IPython notebook (ipynb) format that can be run on Google Colaboratory. To run this tool, launch it in the Google Colaboratory environment, create a results directory, place the ground truth data (xlsx) and LLM output results (csv) there, and then execute each cell.

ROC CurveFig 1

This tool generates a ROC curve for each output result and calculates the AUC. ROC curve graphs are overlaid to compare and evaluate LLMs like Fig 1.

Results

LLMs output: results

All data required for accuracy evaluation is located under the results directory. This includes 112 questions from the German task released in NTCIR-18 MedNLP-CHAT, containing only the correct answers for the three risks, along with 117 CSV files listing the probability values (TRUE/FALSE) for the three risks across three languages (de/en/fr) obtained using 13 different LLMs.

Statistical results: csv xlsx

This document summarizes the ROC-AUC values and statistical data when using 13 LLMs for three languages and three risks.

License

Code

The analysis code is released under the MIT License (see LICENSE).

Data

The dataset and results are released under the Creative Commons Attribution 4.0 International License (CC BY 4.0) (see LICENSE-data).

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP22K12293.
This containt is managed by Hiroki Tanioka (taniokah[at]gmail.com), since 2025.

Citation

If you use this repository, dataset, or evaluation framework, please cite:

Tanioka, H. (2025).
Towards Safe and Trustworthy Healthcare AI: Risk Assessment of Medical Dialogue Using LLMs.
Human-Centric Intelligent Systems.
https://doi.org/10.1007/s44230-025-00131-4