How to use RM for this Paper

DocMusherDocMusher Member Posts: 249   Unicorn
edited November 2018 in Help
Dear RM community,
Is somebody able to help me a bit closer. I know data mining approaches are sometimes different from the way a researcher needs to present his results. This paper uses data from the MIMIC II database which is a clinical database with 40000 ICU patients (https://mimic.physionet.org/). I thinks the authors have done a nice job and I would like to use this approach for the analysis of other attributes. My data is preprocessed but can't find how to use a variance inflation factor, the lowest smooth technique and finally to have the odds ratio calculated and presented in the results.
Hoping someone can help me.
Cheers
Sven

This article is the subject of my question:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095204

In the methods I read: Continuous variables were tested for normality by using Kolmogorov–Smirnov test. Data of normal distribution were expressed as mean±SD and compared using t test. Otherwise, Wilcoxon rank-sum test was used for comparison. Categorical variables were expressed as percentage and compared using Chi square test or Fisher's exact test as appropriate. ICU mortality was used as the study endpoint. To exclude confounding factors that may influence the association of iCa and mortality, logistic regression model was used to adjust for the odds ratios (OR). We built two models separately for Ca0 and Camean during ICU stay. The full model included all variables listed in Table 1.[8] Covariate selection was performed by using stepwise forward selection and backward elimination technique, with Ca0 and Camean remaining in the model. The significance level for selection was predefined as 0.15 and that for elimination was 0.2. After this step the main effect model was built. Lowess smooth technique was used to examine the relationship between iCa and mortality in logit.[9] To facilitate clinical interpretation of our results and to meet the interests of subject-matter audience, we planned to use linear spline function for model building.[10] The knots were chosen according to conventional classification of iCa ranges: relative to the normal range of 1.15–1.25 mmol/L, we defined hypocalcemia as mild, moderate and severe as 0.9–1.15, 0.8–0.9 and <0.8 mmol/L, respectively. Hypercalcemia was divided into mild, moderate and severe as 1.25–1.35, 1.35–1.45 and >1.45 mmol/L, respectively.[11], [12] Potential multicollinearity between covariates in the model were quantified by using variance inflation factor (VIF) which provided an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.[13] As a common rule of thumb, a VIF>5 was considered for the existence of multicollinearity. Furthermore, iCa was categorized into intervals and incorporated into regression models as design variable. Design variable, also known as dummy variable, is one that takes the value of 0 or 1 to indicate the presence or absence of some categorical effect that is expected to shift the outcome. It is frequent used for categorical variables with more than two categories. Normal range between 1.15 and 1.25 mmol/l was used as reference and ORs were reported for other intervals. Receiver operating characteristic curve (ROC) was depicted to show the diagnostic performance of fitted logistic regression models.

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    Hi Sven,

    i read the paper and it is really a different way of thinking.
    Let me sum up: What the other does is creating features and then running a logistic regression with and without one attribute. I see nowhere which validation he uses. Without a validation, this approach is simply wrong.
    I guess i need to think about this a bit more. It is def. no predictive task.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    Okay, i read the article now three times and it is kind of hard to understand. I guess their p-values are kind of tricky because they ignore covariance matricies (which are hard to calculate, i admit).

    So the key question is: Is the mortality depended on the iCa level?

    What a data scientist might do is doing two analysis. One with and one without iCa after wards one can compare the RoC using a X-Val and a T-Test. Then we can answer the question  "Does it help to know the iCa to predict mortality?". Which might be related to the question above. Sadly this is all dependend on preprocessing etc. So i do not know how much to trust those p-values.

    I am further not sure if this paper is a good data mining taast. If the question would be "how high is the mortality for this person?" then rapidminer would be the way to go. This sounds more like traditional statistics combined with mutli variate methods to get more signficance.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 249   Unicorn
    The database comes with scores related to severity of disease. The authors analyse if an additional attribute, here calcium could increase predicability for survival. As such is this valuable. If I could use RM for such a Task, other Medical parameters could be tested for possible additional value.
    Any comments?
    Sven
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    Hi Again,

    i thought a bit deeper. You can of course define a standard analysis to predict mortality. Afterwards you can use a better technique using more information (like iCa) and look if it becomes better. Usually the error of the cross validation should harm your p-values. I think this was forgotten in the mentioned paper.
    All those p-values are p-values calculated like P(Better than before | Condition Preprocessing, Condition Learner,...) so i am not sure how useful the p-values are. T

    What might be way more useful is to use the model to advice. You can design a model calculating mortality with all given measurements. Than you get a function you can minimize for your patient. The confusing point for me is: The advice would be "lower the blood pressure" and not how. But this might be way more beneficial.

    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 249   Unicorn
    There is no value for an individual patiënt, only the fact if some attributes at time x or y could add information on outcome. Like I mentioned before, this database could demonstrate the value of RM. Shouldnt we consider a hackaton type approach?
    Sven
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    In the paper is nothing for the indivudal person, but the database has it, right?
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 249   Unicorn
    The database consists of 40000 patiënt admissions on intensive care with all data(lab, text, ECG tracés) between 2001 and 2008. It is unique in its kind. To have full access you neef to pass an examination of the NIH resulting in a DUA which I have.
    Cheers
    Sven
  • DocMusherDocMusher Member Posts: 249   Unicorn
    The same authors also published these papers using MIMIC II: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120171/ and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4378844/. I really would like to be able to use a similar approach but using RM
    Cheers
    Sven
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    So is there individual data stored in the DB or not?

    And do you have a rapidminer process reading in the data once i have credentials?
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 249   Unicorn
    The database consists of patients with unique subject_id. The database can be downloaded http://physionet.org/mimic2/mimic2_clinical_overview.shtml
    I think full access is only possible if medically certified.
    Some more info: database descriptionhttp://mimic.physionet.org/UserGuide/node18.html
    Once access you get:
    You may download the database from this page, or you may explore it online (see MIMIC II Explorer below). The flat files should be compatible with PostgreSQL version 8.4.8 or later.
    What's New (Changes from 2.5):
    Added Patients:
    5,880 new subjects
    6,556 new hospital admissions
    8,058 new ICU admissions
    Added Data Types:
    Demographics: religion, ethnicity, marital status, insurance type, admission source
    Procedure (CPT) codes
    Diagnosis-related groups (DRGs)
    Elixhauser comorbidity scores
    Microbiology test results
    LOINC coding for lab tests
    Note that in previous releases, timestamps were a mixture of standard and daylight savings times. Starting with version 2.6, timestamps are uniformly expressed in EST (Eastern Standard Time), so that the interval between any two timestamps in a given record is simply the difference between them, even if a daylight savings time change occurred during the interval.
    Added documentation:
    MIMIC II SQL Cookbook: a collection of about 20 "recipes" for useful queries, including calculation of Elixhauser comorbidity scores from DRGs and ICD-9 codes (contributed by Joon Lee).
    Virtual Machine:
    We also provide a virtual machine hosting a complete copy of the MIMIC II database. The virtual machine image contains a bootable Linux system which has been pre-configured to download and import the MIMIC II database. It is particularly suited to researchers who would like to perform intensive processing of the data and require more flexible access than that provided by the MIMIC II Explorer (Query Builder).
    Downloads:
    All downloads are in the form of gzip-compressed tar archives ("tarballs"). See How can I unpack a .tar.gz archive? in the PhysioNet FAQ if you are unfamiliar with this format. The individual flat files, once unpacked, are in CSV format; within each line (table row), fields (columns) are separated by commas, and text strings are surrounded by double quotes. A Linux script is available for downloading all the files from the command line using the wget command.
    Definitions: The definition tables contain information needed to interpret elements of the subject-specific data tables (As well as a folder regarding the database schema in PostgreSQL syntax). They consist of 11 files that can be extracted from mimic2cdb-2.6-Definitions.tar.gz.
    Subject-specific data: All data for a given patient are contained in a set of 33 flat files for that patient. The data archives contain the flat files for about 1000 subjects each. These archives are typically 75-90 Mb each, and expand when decompressed to roughly ten times their size. The decompressed flat files occupy about 31 GB in all.
    mimic2cdb-2.6-00.tar.gz (00001-00999)
    mimic2cdb-2.6-01.tar.gz (01000-01999)
    mimic2cdb-2.6-02.tar.gz (02000-02999)
    mimic2cdb-2.6-03.tar.gz (03000-03999)
    mimic2cdb-2.6-04.tar.gz (04000-04999)
    mimic2cdb-2.6-05.tar.gz (05000-05999)
    mimic2cdb-2.6-06.tar.gz (06000-06999)
    mimic2cdb-2.6-07.tar.gz (07000-07999)
    mimic2cdb-2.6-08.tar.gz (08000-08999)
    mimic2cdb-2.6-09.tar.gz (09000-09999)
    mimic2cdb-2.6-10.tar.gz (10000-10999)
    mimic2cdb-2.6-11.tar.gz (11000-11999)
    mimic2cdb-2.6-12.tar.gz (12000-12999)
    mimic2cdb-2.6-13.tar.gz (13000-13999)
    mimic2cdb-2.6-14.tar.gz (14000-14999)
    mimic2cdb-2.6-15.tar.gz (15000-15999)
    mimic2cdb-2.6-16.tar.gz (16000-16999)
    mimic2cdb-2.6-17.tar.gz (17000-17999)
    mimic2cdb-2.6-18.tar.gz (18000-18999)
    mimic2cdb-2.6-19.tar.gz (19000-19999)
    mimic2cdb-2.6-20.tar.gz (20000-20999)
    mimic2cdb-2.6-21.tar.gz (21000-21999)
    mimic2cdb-2.6-22.tar.gz (22000-22999)
    mimic2cdb-2.6-23.tar.gz (23000-23999)
    mimic2cdb-2.6-24.tar.gz (24000-24999)
    mimic2cdb-2.6-25.tar.gz (25000-25999)
    mimic2cdb-2.6-26.tar.gz (26000-26999)
    mimic2cdb-2.6-27.tar.gz (27000-27999)
    mimic2cdb-2.6-28.tar.gz (28000-28999)
    mimic2cdb-2.6-29.tar.gz (29000-29999)
    mimic2cdb-2.6-30.tar.gz (30000-30999)
    mimic2cdb-2.6-31.tar.gz (31000-31999)
    mimic2cdb-2.6-32.tar.gz (32000-32809)
    The MIMIC Importer: Software for automatically creating a PostgreSQL database from the flat files above is available. Download and unpack MIMIC-Importer-2.6.tar.gz first, then download the definitions and subject-specific tarballs into the MIMIC-Importer-2.6 directory created by unpacking the MIMIC Importer tarball. Detailed instructions for using the software are available (a copy of the README included in the tarball). (Note: MIMIC II user Andrea Bravi has developed a Python version of the MIMIC Importer that Windows users may find simpler to run; find it at Andrea's GitHub page.)
    Definition tables and maps
    The definition tables are:
    D_CAREGIVERS D_CHARTITEMS_DETAIL D_MEDITEMS
    D_CAREUNITS D_IOITEMS D_PARAMMAP_ITEMS
    D_CHARTITEMS D_LABITEMS PARAMETER_MAPPING
    D_CODEDITEMS D_DEMOGRAPHICITEMS * D_WAVEFORM_SIG
    * The D_WAVEFORM_SIG definitions table is not used in this release.
    Subject data tables
    The data archives unpack into directories for each subject. Each subject's directory contains 32 tables (flat files):
    A_CHARTDURATIONS
    ADDITIVES
    ADMISSIONS
    A_IODURATIONS
    A_MEDDURATIONS
    CENSUSEVENTS
    CHARTEVENTS
    COMORBIDITY_SCORES
    DELIVERIES
    DEMOGRAPHIC_DETAIL
    DEMOGRAPHICEVENTS
    D_PATIENTS
    DRGEVENTS
    ICD9
    ICUSTAY_DAYS
    ICUSTAY_DETAIL
    ICUSTAYEVENTS
    IOEVENTS
    LABEVENTS
    MEDEVENTS
    MICROBIOLOGYEVENTS
    NOTEEVENTS
    POE_MED
    POE_ORDER
    PROCEDUREEVENTS
    TOTALBALEVENTS
    * WAVEFORM_METADATA
    * WAVEFORM_SEGMENTS
    * WAVEFORM_SEG_SIG
    * WAVEFORM_SIGNALS
    * WAVEFORM_TRENDS
    * WAVEFORM_TREND_SIGNALS
    * The WAVEFORM_* tables are not included in these flat files, although they are present in the on-line MIMIC II Explorer (see below).
    An empty flat file indicates that patient's record does not include data of the corresponding type.
    MIMIC II Explorer (Query Builder)
    The MIMIC project provides the MIMIC II Explorer, a direct SQL interface to the MIMIC II Clinical Database, hosted on its secure web site.
    To access the MIMIC II Explorer, you must use a MIMIC user name and password. Your PhysioNetWorks user name and password will not work on the MIMIC project's web site.
    First-time users: Please note that your user name and a temporary password for the MIMIC portal were sent to you in two emails from [email protected] with the subject lines Your MIMIC-II User Account and Your MIMIC-II Password. Follow the instructions in the emails to change your MIMIC password (you may change it to match your PhysioNetWorks password if you wish). If you did not receive these emails, your spam filter may have rejected them; please check before writing to [email protected] to request that they be sent again.
    The MIMIC project web site currently uses a self-signed SSL certificate. Your browser will warn you that it does not recognize the certificate the first time you visit; accept it in order to enter the site.
    Go to the MIMIC II Explorer (Query Builder) [link opens in another window].


    Cheers and thanks
    Sven
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,134  RM Data Scientist
    I now have a physionet acc. and applied for accsess. Lets see.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • DocMusherDocMusher Member Posts: 249   Unicorn
    Can't wait to find a complete solution to analyse this database. If we could succeed in that it would be 1-0 Rapidminer vs MIT (Boston) ;)
Sign In or Register to comment.