Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.
Below you can see the document section of a SharePoint site created for a little demonstration. This site groups together a project folder and a few documents with varying file format.Demo SharePoint SiteThe first step for integrating your SharePoint data into your Data Mining process is to find out, what the SharePoint URL of your company or organization is. Just have a look into your browsers address bar and extract it along with your sites name. Both things are underlined in the picture above. Now enter these information into the List SharePoint Files Operator, that comes with the SharePoint Connector extension, as shown in the picture below.List SharePoint Files Operator configuration
Since your SharePoint site is an internal resource, you also need to verify, that you have access to the information. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your SharePoint credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.
If you now run the process an ExampleSet is created, that contains information about the files stored in the site you accessed. Below you can see the result from scanning my demonstration SharePoint site shown at the beginning of this post. The author and lastModifiedBy columns are redacted for this post.Result view containing all files and folders found in the site
You gain information about the filename, its location within the site (path), a url for downloading it manually, the author's name, the creation date and time (creationDateTime), the person having modified it last (lastModifiedBy), the date and time of the last change (lastModificationDateTime), a unique sharepointId and the information if the entry is a folder or not. The Operator always scans files at the given folder level. If you need to dig deeper you can use the information derived above together with the Scan specific folder parameter to search for files and folders in a subfolder.
With this information you can for example filter out all entries created by a given author or of a desired file format in order to download them. Therefore you can add the Filter Examples operator or any other Operator to create a more specific list of files you want to download. Providing this list to the Download from SharePoint Operator enables you to download all files to the destination defined in the Download Path parameter or continue working on them by using the collection of files provided at its output port. An example process using this filtering is shown below and provided as a tutorial process, that comes with the Download from SharePoint Operator. File download and integrationTo continue using the files directly in your process you can for example use the Loop Collection Operator to handle each file and use one of RapidMiner's many reading Operators to extract the data into your process. Don't worry, you don't need to provide the Auth Token to the Download from SharePoint Operator again. It will be stored alongside the ExampleSet (as an annotation) so you don't need to handle it again. But if you store the ExampleSet in your repository and want to download files later, your token might expire. Hence the operator offers an option to set a new token. Again you can just provide the URL obtained after logging into Microsoft Graph Explorer.
Philipp for the RapidMiner Research Team
The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).