Analysis of laboratory tests at Huddinge hospital (STO)

Application domain: Analysis of laboratory tests at Huddinge hospital
Source: M.D. Rutger Bentz, Huddinge hospital, Sweden
Dataset size: 10.2 MB
Data format: in preparation for use with SPECTRE
Pointers: Dr. Lars Asker (

The data (STO)

In an attempt to automatically generate comments from laboratory test results, we have started collaboration with physicians at the clinical chemical laboratory at Huddinge hospital in Stockholm Sweden. We have started analysing three datasets with information of relevance for this task. The first is a file containing patient information such as age, sex, and links to test results for the patient. The file contains information about more than 17.000 patients that have been analysed at the hospital laboratory during the years 1996 and 1997. The second file contains the corresponding results for up to 13 laboratory tests that have been performed for each patient, and the third file is the comments given by the laboratory physicians for each patient and set of laboratory test results.

The dataset is interesting from several points of view.

First, the major part of all comments consist of a relatively limited set of standardized sentences. These sentences are so frequently occuring that, in order to reduce the amount of work for involved laboratory physicians, the laboratory have introduced a collection of codes (106 codes) for each such standard sentence. This indicates that an automatic generation of such comments could be possible.

Second, the codes are most often used by the physicians, but also edited to some extent to allow for a more natural structure of the resulting comments. This makes the problem interesting also from a NLP point of view, since it is most likely that a grammar and limited lexicon will be necessary for interpretation (in the traning phase) and generation of comments (in the evaluation phase).

Third, many comments refer to specific values of the test results. For example, "M-component, IgG-lambda, mostly unchanged concentration of 12-14 g/L, since 960916". The example sentence contains values of 12-14 and 960916 which are taken directly from the input test results. It would be reasonable to expect the same comments to occur also for another patient with say, a concentration of 11-13 and a different test date. The learned rules should be able to reflect this and hence allow for variables that can be instansiated at execution time. This makes the application suitable for techniques like ILP.

Fourth, the actual set of comments are dependent on a complex relationship between several of the test results as well as specific patient information. This makes it a difficult problem that require the application of state of the art machine learning techniques to be solved.

back to index