Importing data and using custom python code to read non-csv/txt file -> graph

pypjpypj Member Posts: 8 Contributor I
edited November 2018 in Help

Hello,

 

I am in the process of learning RM and I have a working piece of python code I'd like to compile and produce a graph with, but I am having a hard time setting everything up and running it properly. The dataset is a non-text, csv file so I can't upload it normally through the user interface but it can be read using a python module called nmrglue. I have moved all the necessary files into the local repository and have checked the extension is setup properly for Python (it is up to date and matching). However, it does not seem to be picking up the imports from class to class within, despite the many combinations I've tried attaching the input/output process tree with.

 

I have attached the raw python script + related file to use and would like a visual instruction on how to properly import it.

 

I just need help setting it up but I think I am missing something obvious in the process.

 

Thanks,

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    i quickly looked into it. Where can i find your rm_main function?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    As @mschmitz pointed out, to use Python inside RapidMiner you'll need to encapsulate it in a function (see example image below) and call Pandas as a default. We need Pandas to generate the dataframes between RM and Python.

    Python Example.png

     

     

     

     

  • pypjpypj Member Posts: 8 Contributor I

    Hello,

     

     I input the function by relabeling my 'main' class to be under rm_main(data), but I constantly went into a whitespace error, as found as such.

  • pypjpypj Member Posts: 8 Contributor I

    Also the next input:

    import nmrglue as ng
    import matplotlib.pyplot as plt
    import pandas as pd
    import scipy.stats
    import numpy as np
    import os
    import pylab

    def rm_main(data):

        def load(path):

        dic, data = ng.fileio.bruker.read_pdata(path)
        udic = ng.bruker.guess_udic(dic,data)

        for k in udic[0].items():
          print(k)
        [udic[n]["size"] for n in range(udic["ndim"])]

        spectrum = data[:]
        # store them as float
        CAR = float(udic[n]["car"])
        SW = float(udic[n]["sw"])
        OBS = float(udic[n]["obs"])

        num_points = float(len(spectrum))

        # needed top divide car by obs to get the carrier in ppm
        freq_max = (.5)                            /float(OBS/SW)+float(CAR/OBS);
        freq_min = (.5-((num_points-1)/num_points))/float(OBS/SW)+float(CAR/OBS);
        step = (freq_max-freq_min)/(num_points-1)

        domain = []
        spectrum_flip = []
        for i in range(len(spectrum)):
            domain.append(freq_min+i*step)
            spectrum_flip.append(spectrum[len(spectrum)-i-1])

        return domain, spectrum_flip

  • pypjpypj Member Posts: 8 Contributor I

    Essentially, what I want to accomplish, is to run my python code, turn that data into a useable matrix that can take advantage of RapidMiner's analysis tools. I can combine all the modules of main, load, draw etc into one process by combining all the Python functions, but I just essentially want the program to run after the import glob module given by looping over the first 'block'.

     

    Thanks,

     

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    you need to return pandas dataframes, not lists. That should do it.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • pypjpypj Member Posts: 8 Contributor I

    Hello,

     

    I replaced all my list comprehensions with pandas.DataFrame along with append() to copy the append() from the original code. However, I am constantly still receiving this error:
    IndentationError:expected an indented block line 13, 17 etc.

     

    I have cut out the white space, tried virtually every combination of indentations below each iteration, but it is to no avail.

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Hey,

     

    I guess you already understood the main concepts, but let me repeat some things to potentially fill in missing information before providing a potential solution.

     

    Using the "Execute Python" operator you're able to execute python code within RapidMiner. It accesses the python version specified unter "Settings" -> "Preferences..." -> "Python Scripting" -> "Path to Python executable"

    Screenshot (15).pngDefine Python Version to use

    Find an example path specification using Python from Anaconda installed under Windows in the screenshot. Now you can access libraries installed for this python version in the operator by importing them like you're used to do in python.

     

    The Python code in the operator uses 4 spaces as one indentation level. So if you receive indentation errors make sure the indentation equals 4 times the indentation level desired. For example, when I copy the code of your "rm_main" it contains a mixture of tabs and spaces, as well as indentations consisting of only 2 spaces. Some editors (like sublime for example) offer the option to display whether tabs or spaces are used.

     

    After dealing with the indentation error make sure to form a proper Pandas DataFrame object. I looked into the "nmrglue" library and the "fileio.bruker.read_pdata" method seems to already return a dictionary of the given data. Fortunately Pandas DataFrames take that as an input. So you might directly create a DataFrame out of the returned object. This even has the advantage, that columns are properly named from the beginning.

     

    Now having your Pandas DataFrame instance you can deliver that in the return statement of your "rm_main" function. Afterwards the "Execute Pyhton" Operator converts the DataFrame to an Example Set (which is used for RapidMiner to manage matrix like data). You can access this Example Set at the Operators Outputport. The first DataFrame returned is delivered at the top most Output Port and so on.

     

    Here is some example code, where you only need to adjust the path to the file you want to read:

    import nmrglue as ng
    import pandas as pd

    def rm_main():
    path = "C:\\my_great_data_file.ending"
    // read data using nmrglue from file located at path
    dic, _ = ng.fileio.bruker.read_pdata(path)

    // create pandas data frame from the given dictionary
    df = pd.DataFrame(dic)

    // check if data frame creation worked
    if not isinstance(df, pd.DataFrame):
    print("Conversion to data frame failed.")

    // deliver the data frame to the operators output port
    return df

    Notes:

    • You always need a function called "rm_main" in the "Execute Python" Operator. If you connect Example Sets to its input port(s) you need to specify the same amount of parameters for the function. For your case you would not need to provide something at the input port, hence you would not need any parameter for "rm_main()".
    • All things printed using pythons "print" function are displayed in RapidMiners Log. You can enable it over the menu option "View" -> "Show Panel" -> "Log".
    • RapidMiner offers Operators to change the type of attributes after having loaded them, if you still need to define these times within the "Execute Python" Operator convert the read data to numpy arrays first. For them you can specify the attribute types by the so called "dtype" parameter. Find some example here.

    Edit:

    If you are using Windows, make sure to escape backslashes, when providing the path. This means, that you need to provide 2 backslashes, I added it to my code example above.

  • pypjpypj Member Posts: 8 Contributor I

    Again, I have set the directory correctly and the code works just fine without RapidMiner. I am also using Linux and I copy-pasted your code with the parameters in mind. I suspect I may have to combine all the class objects into a single block. It does not work. I also use gedit and sublime to keep track of indentations and I removed all spaces/tabs that were unnecessary from within.

     

    Do I have to move the python files into the module folders where my Python install is? For some reason it is not able to detect the class attached to it sequentially to the right.

     

    The script could not be parsed.
    Please check your Python script: Import Error: No module named bin_spectrum

     

    bin_spectrum is the name of my other class with its own set of functions to be called on.

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Have you tried adding the folder containing your lib files to the PYTHONPATH?

     

    There is an environment variable (ref.: pythonpath), that is used to look up python modules. In order for it to recognize scripts as a module you need to provide an `__init__.py` file. It can be empty, but it has to be inside the folder you want to add.

     

    Another option is to create an installable module out of your code using a `setup.py` (ref.: creating a setup file). This allows for an installation via pip. If you're choosing this solution, you might want to use the option `-e` during installation. It allows for continuous work on the python files without having to reinstall the module over and over again. (possible installation call `pip install -e folder_containing_the_setup_py`)

Sign In or Register to comment.