RapidMiner

Reading Excel files directly from your companies OneDrive
RMStaff
RMStaff
3 weeks ago
3 weeks ago

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link). Example Excel Online SheetExample Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in the Extensions menu item) for Spreadsheet Table Extraction. After the installation is complete, you will find a new Operator, called Read Excel Online in your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

Example process extracting data from OneDriveExample process extracting data from OneDrive

The File Path needed is the relative path in your OneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts, yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as the File Path. Note, that no leading slash is used.

The Sheet Name is set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide a Cell Range the whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing a Cell Range is done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as the Cell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your OneDrive credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:Resulting ExampleSet from the extraction processResulting ExampleSet from the extraction process

 

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here: Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check the No column names given option to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in Microsoft SharePoint sites, you can use the SharePoint Connector extension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look at this articles explaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to the Read Excel Online Operator to directly access the data from your SharePoint even without downloading the files. Checking the Read from SharePoint parameter enables you to enter the SharePoint Site ID as well as the SharePoint File ID.

Read from SharePoint optionRead from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:

<process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="sharepoint_connector:list_files_sharepoint" compatibility="0.1.000" expanded="true" height="68" name="List SharePoint Files" width="90" x="246" y="238">
        <parameter key="SharePoint Site" value="SharepointTesting"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="238">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="filename.contains.\.xlsx"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro_from_annotation" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro from Annotation" width="90" x="782" y="238">
        <parameter key="macro" value="siteId"/>
        <parameter key="annotation" value="sharepointSiteId"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="82" name="Loop Values" width="90" x="983" y="238">
        <parameter key="attribute" value="sharepointId"/>
        <parameter key="iteration_macro" value="fileId"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="spreadsheet_table_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">
            <parameter key="Read from SharePoint" value="true"/>
            <parameter key="SharePoint Site ID" value="%{siteId}"/>
            <parameter key="SharePoint File ID" value="%{fileId}"/>
          </operator>
          <connect from_op="Read Excel Online" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="center" color="orange" colored="true" height="201" resized="true" width="223" x="381" y="29">Obtaining the SharePoint site and file IDs from previously set macros&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Make sure to check the Sheet Name and to provide a valid token.</description>
        </process>
      </operator>
      <connect from_op="List SharePoint Files" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro from Annotation" to_port="object"/>
      <connect from_op="Extract Macro from Annotation" from_port="object" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="orange" colored="false" height="322" resized="true" width="403" x="100" y="38">#1 Login to https://developer.microsoft.com/en-us/graph/graph-explorer and copy the URL from your browsers address bar into the &amp;quot;Auth Token&amp;quot; field.&lt;br&gt;#2 Go to your SharePoint and choose a site you want to access.&lt;br&gt;#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.</description>
      <description align="center" color="orange" colored="false" height="322" resized="true" width="219" x="513" y="37">Select only entries containing Excle Sheets (ending with .xlsx).</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="185" x="737" y="37">Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="199" x="928" y="37">Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'</description>
    </process>
  </operator>
</process>

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).