Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Mining a PDF document

GjorGjor Member Posts: 1 Learner III
edited November 2018 in Help
I'm new to rapid miner. i would like to mine a pdf to create a word and number vector. I using the following operators:
Operators as follows;
1.  Read document ( Content type: PDF and Encoding: system)
2. Process Document from Data  (Prune method: absolute  and datamanagement: double_sparsey_array)
    Inside Process Document from Data
    2.a  Extract information ( Query type:string matching)
    2.b  Tokenize (mode:non letter)
    2.c  Transform case (Transform to: Lower case)

Error Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet


Stack trace:
------------

Exception: java.lang.ClassCastException
Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
  com.rapidminer.operator.text.io.ExampleSetDocumentInputOperator.getTextObjects(ExampleSetDocumentInputOperator.java:110)
  com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:224)
  com.rapidminer.operator.Operator.execute(Operator.java:833)
  com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
  com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
  com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
  com.rapidminer.operator.Operator.execute(Operator.java:833)
  com.rapidminer.Process.run(Process.java:925)
  com.rapidminer.Process.run(Process.java:848)
  com.rapidminer.Process.run(Process.java:807)
  com.rapidminer.Process.run(Process.java:802)
  com.rapidminer.Process.run(Process.java:792)
  com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)





Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The output from the Read Document operator is a document whereas the Process Documents from Data expects an Example Set.

    One option is to insert a Documents to Data operator between them.

    Another better option would be to use the Read Documents from Files operator.

    regards

    Andrew
Sign In or Register to comment.