Victor's Thesis: 2007

Monday, November 5, 2007

Final Progress of CDAL

The final version of CDAL is version 2.6. The code is located at the computer at the RPAH, under "C:\Program Files\Apache Group\Apache2\cgi-bin\cdal26\". The size of this folder is over 200MB because it includes the SNOMED-CT server as well.

To run CDAL, you must do so at the RPAH, since it requires connection to the CareVue Information System. Otherwise, you will receive connection error of all sorts. The url to CDAL is "127.0.0.1/cgi-bin/cdal26/cdal.py". At the interface, you can enter the question based on CDAL's pre-defined syntax.

The web-server is a bit trickier to explain. Whenever you make any changes to the source code, you must make the same changes under the "..\htdoc\cdal26\" folder. This means that the "htdoc\cdal26\" folder and the "cgi-bin\cdal26\" folder must be identical in order for any change that you have made to be effective.

Finally, Matlab must be installed on the machine (which it already is) in order to run hypothesis testing. All code on hypothesis testing is written in Matlab. Under "/hypothesis_testing", you must generate an .exe file for every method that you have written, so that "ht_interface.py" can call it accordingly. To generate a .exe file, use the mcc command inside Matlab.

For future work on CDAL, please refer to my thesis under "Future Work".

For any clarification or problem please contact me. Thanks.

Sunday, October 7, 2007

Hypothesis Testing

The current CDAL does not support hypothesis testing. Thus, the task is to implement this function on CDAL. There are two outside sources that contains built-in functions for hypothesis testing. They are R and Matlab. Matlab is chosen to carry out this task because it is the easier of the two to be implemented.

Example: For researchers, an interesting question that needs to be analyzed is to test whether the heart rate of patients who are ventilated are higher than that of those who are not ventilated.

Expected time-frame: 1 week (mid Wk 10-11)

Expected finished date: Wk 11 Thursday

Tuesday, October 2, 2007

Review of the Architcture of CDAL

The current architecture for CDAL is as followed:

The architecture of CDAL has been converted to be object oriented. Thus, what the user enters as the query will first be checked by the syntax parser (including David's SNOMED server for terminology correctness when implemented). Once checked, the query will be split by the semantic parser, which produces many different answer objects and condition objects (if any).

Both the answer objects and the condition objects can be based on different categories, (we call this an event).

The categories (and their corresponding number of attributes and definitions) are as followed:

Chart_events (total): 786
- Chart_events (numeric): 734 - All the numerical charted information for patients (E.g. heart rate, peep, cvp, etc.)
- Chart_events (categoric): 52 - All the categorical charted information for patients (E.g. ventilation mode, airway, etc.)

Medication_events: 52 - All the iv-drip-infusion (sedation and inotropes) information for patients (E.g. Propofol, Fentanyl, etc.)

Patient_events: 6 - All the basic demographic information for patients (E.g. medical record number, sex, etc.)

Lab_events: 63 - All the chemical information for patients (E.g. Chloride, Sodium, pH, etc.)

Group_events (total): 74 - All the group-of-variables pre-defined by the medical staff. Unlike the other event types, this returns more than a single attribute. For example, sedation will return all the propofol, fentanyl, etc. that the patient has taken.
- Sedation: 8
- Inotropes: 14
- Antibiotics: 46
- Thromboebolic_prophylaxis: 6

Total: 981 attributes

For example, for a condition, there can be a patient_event (age > 30), or a chart_event (heart rate > 60), or a medication_event (propofol > 1), etc. Note that a chart_event can either be numeric (heart rate > 60) or categoric (ventilation mode = PS). Furthermore, the conditions can be connected by logical operator (AND / OR).

Similarly, for an answer, there can a patient_event (all values of mrn), or a chart_event (all values of heart rate), or a medication_event (all values of propofol). One thing extra is the inclusion of group_event. So the user can retrieve, not just one, but many pre-defined groupings values witin an attribute. For example, all values of sedation will return all the sedation group, including propofol, fentanyl, morphine, etc. Furthermore, each answer object contains its corresponding reference entity (all values, any value, last value) and statistical entity (mean, sd, max, min, range, mode, etc).

The medical groupings (Sedation, Inotropes, Antibiotics and Thromboebolic prophylaxis) are defined by Angela from RPAH and are the same for the auto-population project and WRIS project.

After the semantic parsing, these condition and answer objects are passed to the SQL generator, which produces the corresponding atomised query tree. Basically, a complex large query is spilt into many separate simpler queries, the individual answers are then joined to compute the final results. The performance issue to note is as followed:

pid = patient identifier, and this is an index used within the database. This is different to the medical record number that the staff uses. To enhance performance, queries should be split according to pid for the archival database, and gprid (global patient record identifier) for the real-time database.

The improvement in speed for the archival database is about 2-3 times faster, as it is no longer needed to wait for more than 1 minute for any query in the archival database. For the real-time database, the improvement is not significant.

After the SQL generator creates the query, it is passed to the database transceiver, which sedns the queries to be executed by the DBMS software. The results (in an array) are then passed to the response generator, which creates the corresponding result objects. Again, this is an object-oriented approach. So each result has an attribute name (heart rate), a type (such as a chart_event, etc.) and its corresponding values, mrn, and chart-time.

These result objects (all stored in a single class called Results), all finally passed back to the interface where the values are displayed (in David's interface).

That's the overall structure of the current version of CDAL. The prototype has now been completed.

One more thing that may be added (if time permits and if we have ideas) is the retrieval of freetext_event.

Tuesday, September 25, 2007

Plans for coming 5 weeks

1. Updating the built-in dictionary used by CDAL - from the list mentioned above (0.5 week: in semester break).
2. Initial Implementation - My work (database) and David's work (interface) need to be implemented together (0.5 week: in semester break).
3. Explore further areas in CDAL - data mining from free-text fields in database. Note that this is not expected to be implemented into CDAL, but only provide an idea in this area. (0.5 week: week 10)
4. Further Testing - At the moment, the CDAL prototype has only been informally tested. The tests performed so far are non-systematic and non-automated. The final CDAL version must be tested for completeness and soundness, using automated tests and must follow a properly designed test model (1 week: week 10-11).
5. Final Implementation - My work (database) and David's work (SNOMED-CT) need to be implemented together (0.5 week: week 11)
6. Demo - Final version of CDAL need to be presented to Jon and hospital staff (1 day: week 12).
7. Documentation - User manual and thesis need be completed (1 week: week 12).

Work done Up to 25/9

SQL Generation

The CDAL prototype is now completed. This prototype is only connected to the ISM and the GICU real-time, and only a limited variables across the different tables can be retrieved.
However, all the major categories can now be extracted, and include the followings:

1. Patient event - All the basic demographic information for patients (E.g. medical record number, sex, etc.)
2. Chart event (Numerical) - All the numerical charted information for patients (E.g. heart rate, peep, cvp, etc.)
3. Chart event (Categorical) - All the categorical charted information for patients (E.g. ventilation mode, airway, etc.)
4. Medication event - All the iv-drip-infusion (sedation and inotropes) information for patients (E.g. Propofol, Fentanyl, etc.)
5. Laboratory event - All the chemical information for patients (E.g. Chloride, Sodium, pH, etc.)
6. Group event - All the group-of-variables pre-defined by the medical staff. Unlike the other event types, this returns more than a single attribute. For example, sedation will return all the propofol, fentanyl, etc. that the patient has taken.

Dictionaries

A list of database terms has been mapped to the terminologies that doctors use, and include all their corresponding synonyms and abbreviations. Please see attached. All terms on this list can now be extracted by the CDAL prototype.

Monday, September 17, 2007

Work done Up to 17/9

SQL Generation

The SQLGenerator has been implemented on both the real-time and the archival databases. This prototype now allows user to make any query involving any attribute that is defined by the underlying dictionaries. The results extracted from both databases are then combined and shown to the user in a text-based interface.

In addition, one more event type called "Group Event" has been defined. Unlike the other event types (such as Chart event, patient event, etc) which returns a single attribute each time, the group event type returns an "aggregated results", meaning a group of pre-defined attributes are returned to the user.

The most common group events are sedation and inotropes, and a typical clinical question that a physician may ask is: "For each patient in the GICU, find the time and dosage of all the sedation that the patient had taken during the last 24 hrs."

The SQLGenerator will then output in the format:
[patient's] [sedation] [dosage] [chart-time]

User Interface
The current user interface should contain the following features:
1. Automatically update the query as the user types and makes selection (with the use of AJAX).
2. Check whether the variable names that the user entered are found by the SNOMED-CT server.
3. Map the terms entered by the user to the underlying database terms.
4. Trace a variable using the SNOMED-CT server in the case that a variable name is not defined.
5. Display the query result in tabular format.

At the moment, feature 1 is completed. Features 2 - 5 are in progress. Feature 2 is currently implemented with a dictionary replacing the SNOMED-CT server.

Monday, August 27, 2007

Work done Up to 28/8

SQL Generation
The SQLGenerator has been extended to include the following categories in the answer and condition part of the query:
1. Patient event - All the basic demographic information for patients (E.g. medical record number, sex, etc.)
2. Chart event (Numerical) - All the numerical charted information for patients (E.g. heart rate, peep, cvp, etc.)
3. Chart event (Categorical) - All the categorical charted information for patients (E.g. ventilation mode, airway, etc.)
4. Medication event (drip) - All the iv-drip-infusion (sedation and inotropes) information for patients (E.g. Propofol, Fentanyl, etc.)
5. Medication event (dose) - All the dosage (antibiotics and thromoebolic prophylaxis) information for patients (E.g. Panadol, etc.)
6. Laboratory event - All the chemical information for patients (E.g. Chloride, Sodium, pH, etc.)
7. Output event - All the output information for patients (E.g. urine, etc.)

At the moment, categories 1,2,3,4 are completed. Categories 5,6,7 are in progress.

User Interface
The user interface has been extended to include the following features:
1. Automatically update the query as the user types and makes selection (with the use of AJAX).
2. Check whether the variable names that the user entered are found in the dictionary.
3. Display the query result in the format selected by the user (E.g. Table, List, etc.)

At the moment, features 1,2 are completed. Feature 3 is in progress. Feature 2 will later be implemented with the SNOMED-CT server to replace the dictionary.

Saturday, August 18, 2007

Work done Up to 19/8

SQL Generation
The SQL Generator has been extended to handle the followings:
1. Multiple conditions can be processed by separating the logical operator (and/or) and breaking them into sub-queries which will be processed separately and combined into a single set of results.
2. Retrieve data from archive database.
3. Able to distinguish between categorical/numerical entities and handle them correspondingly.

User Interface
The user interface has been extended to allow:
1. Selecting patients by demographical information and historical information.

Other Features
The following statistical functions can now be included in CDAL:
1. mean
2. sd (standard deviation)
3. median
4. max (maximum)
5. min (minimum)

Thursday, August 2, 2007

Input User Interface

The current user interface uses combo-boxes and text-boxes to help the user to correctly (according to the syntax) enter the query question.

The stages / milestones for this module include the followings:
1. Correctly transform the question from combo-boxes and text-boxes into a single language (CDAL).
2. Keep track of the question as it is being inputed.
3. Perform validation checking on the syntax.
4. Perform validation checking on the medical terms used. This requires the use of SNOMED server.

At the moment, stage 1 has been completed. Stage 2 is under researched (implementing AJAX).

Parser

Since the existing parser (version 1) only deals with 2 tables (ptevent and pteventclass), the code must be re-written / extended to support for multiple tables. This means that each of the 8 modules need to be changed accordingly. These require the following stages / milestones:

1. The current system (version 2) can now enable the user to type in the query question from the interface module.
2. The syntactic parser then splits the question up into each of the above part , and generates separate objects (from patient_class, answer_class and condition_class) accordingly.
3. These objects are then passed to the SQL generator that is responsible to convert the question into the corresponding SQL for the underlying databases. Note that there are 2 sub-tasks here: the realtime database and the archive database.
4. The results are then passed to the the Response Generator that is responsible to display the query results.

At the moment, stage 2 is completed. Stage 1 is under reviewed (see below). Stage 3 is in progress.

Syntax of CDAL

The syntax of CDAL has been extended to consist of the following elements:
1. USING TOC: where TOC = [SNOMED]
2. IN [database_ent]+: where database_ent = [GICU-DB, NSICU-DB, CICU-DB, ISM-DB all DB]+
3. FIND [reference_ent statistical_ent]+ : where reference_ent = [all values, any value]; statistical_ent = [mean, median, sd, max, min]
4. OF medical_ent
5. FOR Patient_Class: where Patient_Class is in the form: PATIENTS WHOSE [demographics_ent operator value]*
6. WITH [medical_ent operator value]* [joint_ent medical_ent operator value]*
7. DURING [time_ent]*
8. IN [location_ent]*

Example: The following question is valid:
1. USING SNOMED
2. IN GICU-DB, ISM-DB
3. FIND all values and mean and sd
4. OF heart rate
5. FOR patients whose age > 40 and sex = male
6. WITH ventilation mode = PS AND PEEP <> 20
7. DURING the last 72 hours
8. IN GICU

Thursday, July 26, 2007

Project Plan Submitted

Project plan has been completed and submitted

Work Summary

It has been decided by the team that the existing parser needs to be extended, by allowing the user to do the following:
1. select more than 1 attribute in a single query
2. use algebraic entity such as '>', '<', '=', etc.
3. use disjunction 'OR'
4. use SNOMED terminology

Work done on 26/7:
* Done Syntactic parser: split question into meaningful parts such as conditions, TOC, database selected, time selected, etc.
* Done the condition part for the Semantic parser: split the conditions into an array of conditions.
* Done the input User Interface: allow user to enter a question using combo-boxes and text-boxes.
* Done the code for the SNOMED dictionary: converting SNOMED terms to database terms
* Planned for work on next week: answer part for the Semantic parser, SQL Query Generator

Wednesday, July 25, 2007

Weekly Schedule

Each project team member (myself and David) are expected to work 20+ hours each week on the project, composing of the following time-slots:
Monday: 9-1pm (RPAH, David)
Monday: 2-5pm (RPAH, team+)
Tuesday: 2-5pm(SIT, team+)
Wednesday: 2-5pm (SIT, Victor)
Thursday: 9-5pm (RPAH, team+)
Weekend: 5+ hrs (Home, individual)
+ team refers to both Victor and David working together at the same time

Sunday, July 8, 2007

Project's Aim and Deliverables

This project aims to define and devleop a general purpose analytical language for use in clinical information systems using a restricted subset of natural language.

Previous work: This project has progressed as far as designing the first version of the CDAL and implementing a subset on the CareVue ICU system at the RPAH.

The project for this semester is split into a few major components or deliverables, as listed below:
1. Expand the system for mapping the CDAL components to the underlying CareVue schema.
2. Clarify the semantics of the CDAL.
3. Expand the scope of the CDAL.
4. Introduce access to the SNOMED CT teminology as part of the CDAL.

The project team consists of the following members:

Jon Patrick (USYD Professor)
Yuzhong Cheng (Semester 1 Student)
Victor Chan (Semester 2 Student)
David Ding (Semester 2 Student)
Robert Herkes (RPAH Staff)
Angela Ryan (RPAH Staff)

Sunday, May 27, 2007

Welcome to Victor's Thesis

Text and Data Mining of Clinical Data (supervised by Jon Patrick and Irena Koprinska)

Intensive Care Unit Data - Royal Prince Alfred Hospital & Children’s Hospital Westmead

Thesis' Description:

The ICU at the Royal Prince Alfred Hospital has been collecting all its clinical data for the last ten in electronic form. This includes all the bedside instrumentation, Laboratory reports, clinical notes and bedside observations. The staff wish to understand the nature in which their outcomes of care have changed over time as their clinical practice has changed with the advent of new care strategies, for example how their strategies to control blood sugar levels in diabetic patients have improved or not after changing medication strategies. The investigation will have a longitudinal time dimension in the machine learning and data mining strategies. This work involves collecting data from their data bases, anonymising and cleaning it and then applying time based data mining strategies. This project provides an opportunity to conduct cutting edge research on real data and to engage with leading health care professionals.

Victor's Thesis