Grab Meta-keywords, frequency lists

leptserkhanleptserkhan Member Posts: 7 Contributor II
edited November 2018 in Help
Hello.  I am new to rapidminer and I am wondering if rapid mine is suitable for my project.

My project needs to analyze the similarity or dissimilarity between the meta keywords contained in web pages.

My basic questions for this type of analysis are:
  • Can rapid miner take a list of URLs and crawl those domains grabbing ONLY the meta-keywords.  I am not interested in analyzing the content of those entire websites, only the categorization/analysis of the meta keywords contained in the web sites.
  • Can rapdi miner do some standard categorization on the meta keywords providing frequency lists and themes of words?
  • Can it then produce a graph of that analsyis?
  • Can rapid miner be configured to apply more weight to certain words, i.e., the word employment, if contained in a meta keyword on a web page would "weigh" heavier in results than any other words in this analsyis.  If so, how is that feature accomplished?
  • What would be the general steps to take to import the data and provide this analysis?
Thank you.


  • Options
    el_chiefel_chief Member Posts: 63 Contributor II
    1. you need to install the text processing plugin, and the web mining plugin (help menu)

    2. "Can rapid miner take a list of URLs and crawl those domains" yes. web mining:crawl web (or getpages)

    3. "grabbing ONLY the meta-keywords" yes. text mining:keep document parts (regex based)

    4. "Can rapdi miner do some standard categorization on the meta keywords providing frequency lists and themes of words?".

    yes, but it depends what you mean by theme. it can count the occurrences,  relative frequencies, relative frequencies relative to the other documents, or binary occurrence. you might be able to analyze synonyms using SVD

    5. "Can it then produce a graph of that analsyis?"

    What kind of graph?

    6. "Can rapid miner be configured to apply more weight to certain words, i.e., the word employment, if contained in a meta keyword on a web page would "weigh" heavier in results than any other words in this analsyis.  If so, how is that feature accomplished? "

    Text Processing:Process Documents operator -> select attributes and weights

    7. "What would be the general steps to take to import the data and provide this analysis?"

    remove all but meta
    lower case
    process documents
    - vectorize
    - weight attributes

    then Modeling - Similarity - Similarity to Data - Cosine Distance

  • Options
    leptserkhanleptserkhan Member Posts: 7 Contributor II
    Thank you for the quick reply!

    A question I left out that I think is important in all of this.  I know there is a process in text mining (forgot the name) whereby low-value words are given higher significance and high-level words can be given lower significance.  Let me illustrate:

    We pull back a bunch of meta-keywords from websites, maybe a corpus of 1,000 keywords, for example.  The corpus is a bunch of 1,000 keywords from maybe 50 websites.  If the user surfed to 40 websites which constitute maybe word types that represent words that have to do with engineering, but the remaining 10 websites only constitute only a small fraction of the total meta-keywords, say 60 words, but they are the ones I need to stand out the most, how does one go about highlighting the words from those 10 websites (60 words) as being more significant than the words in the remaining 940 corpus?  And vica versa?

    I don't know if text mining has a feature or algorithm for this?
  • Options
    el_chiefel_chief Member Posts: 63 Contributor II
    I am not sure if I understand you correctly.
  • Options
    leptserkhanleptserkhan Member Posts: 7 Contributor II
    Before I pick your brain for information, maybe I better study the product a little more.  It is a handful and I want to make sure I am asking questions that are actually applicable to radpidminer.

    Right now I could use some direction, help, tutorial on using the web crawler features.

    Something step by step somewhere that shows me how to do basic web crawling: grab a page or several pages, download and extract and/or categorize text from those pages.

    That I think would be the best use of my time so I can learn and not sound foolish with questions.

    Thank you.

    Any advice greatly appreciated on how to start learning the web crawl features.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    although I did not fully understand what you are going to accomplish, here are some directions:

    The "process web" operator might suit your needs even better than the crawl operator, because you can extract the information from a website during crawling and only keep this information. This might lower memory consumption of large crawling runs significantly.

    There's documentation available for each operator if you click on the help page. I think there are explained the most important parameters for crawling, like the rule definition. If you hold the mouse cursor a few seconds over each parameter, a tool tip will explain what it does.
    I think with this information it is understandable after some time of experimenting.

    For more detailed informations, please consider taking part in a text mining and web mining course available in our shop. It will give you detailed descriptions as well as demonstrating the hole process of crawling/processing/extracting/learning and applying.

  • Options
    leptserkhanleptserkhan Member Posts: 7 Contributor II
    The idea of attending a seminar or webinar is very exciting.  As a non-profit, community group we don't have the resources right now.  This is just my suggestion but it would help if there were some simple examples provided somewhere of using the various process modules.  The explanations are great, but it does require a lot of experimenting to find results and a lot of start and stop, redoing, etc.  By examples I mean more than a description of the process module, an actual example.

    Although I can see that this product is fantastic and head and shoulders above anything else on the market now, the cost of seminars excludes organizations like mine from participating.  Even for the most basic understanding a user needs a lot of patience and understanding of regex expressions.  The existing documentation gives a general overview of the product with only examples of the more sophisticated uses, which again, requires one to attend not just one, but several seminars/webinars to understand it fully.

    Good luck with this product.  I see that it is still evolving and holds great promise.
  • Options
    el_chiefel_chief Member Posts: 63 Contributor II
    You give up so easily! :)

    Yes, text mining is complicated. Try using GATE...now that is hard. RapidMiner makes it "easy".

    There are plenty of videos on YouTube. Check out VancouverData (me), NeuralMarketTrends1, and DrMarkusHofmann channels.

    In the meantime, here is an *example* process that does a simple similarity check:

    The input data is an excel sheet that looks like this:
    docid	terms
    1 one
    2 one, two
    3 one, two, three
    4 one, two, three, four
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
        <process expanded="true" height="426" width="435">
          <operator activated="true" class="read_excel" compatibility="5.0.8" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="wordlist.xls"/>
            <list key="annotations"/>
          <operator activated="true" class="nominal_to_text" compatibility="5.0.8" expanded="true" height="76" name="Nominal to Text" width="90" x="180" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="terms"/>
          <operator activated="true" class="text:data_to_documents" compatibility="5.0.6" expanded="true" height="60" name="Data to Documents" width="90" x="45" y="165">
            <list key="specify_weights"/>
          <operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents" width="90" x="45" y="300">
            <parameter key="vector_creation" value="Term Frequency"/>
            <parameter key="keep_text" value="true"/>
            <process expanded="true" height="426" width="465">
              <operator activated="true" class="text:tokenize" compatibility="5.0.6" expanded="true" height="60" name="Tokenize" width="90" x="128" y="123"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
          <operator activated="true" class="data_to_similarity" compatibility="5.0.8" expanded="true" height="76" name="Data to Similarity" width="90" x="179" y="300">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
    you can copy and paste the above into rapidminer's XML window and run it

    It can create a pretty graph like this, that shows the similarity:

  • Options
    leptserkhanleptserkhan Member Posts: 7 Contributor II
    thank you for your help!  I haven't given up, just need to pick my battles, as they say.  So much to do, so little time....  sometimes need to follow the path of least resistance.
    I know what you mean, some other data mining tools are extremely difficult to learn and use.  I do think RapidMiner is a great tool, and far easier than most, but for a newbie data miner, text analyzer, it does require a wee bit more learning than I need now.  But I will put my faith in it and keep trying, as you suggest.

    I haven't given up.  I see that rapidminer will do almost anything I need once I learn the skill set.

    It would be very useful if someone could build a video on the use of the web crawler processes -- it seems that all the other videos are excellent in what they present, and in fact do answer many of the questions a new user would have, but the process of web crawler processes has been left out of the available videos.  So I will examine each of the other videos to tease out some useful information that I can then apply to web crawling processes.

    Thank you.
  • Options
    NeuralMarketNeuralMarket Member Posts: 13 Contributor II

    Thanks for sharing the XML code for this keyword similarity process, it helped me look at things a bit different as I'm learning text mining.

    Best Regards,

    PS: nice blog!
Sign In or Register to comment.