Extract Sample Size and Population Type from a group of Article Abstracts

MichaelWhiteMichaelWhite Member Posts: 1 Newbie
edited March 2019 in Help
TL;DR: I want to extract some specific information from a series of similarly (but not identically) formatted paragraphs of text and I hope someone here can point me in the right direction.

I work in the research office of a private university and as part of a project I want to analyze information found in journal article abstracts. Journal databases like Scopus allow me to download a CSV file with article titles, URLs, DOIs (a unique identifier), author names, and abstracts. I want to mine the abstracts to find specific information for each article.

To give a specific example, I have an exported list of around 400 articles, each one is a row in a CSV file, and all of them relate to the development of surveys, questionnaires, scales, or similar academic instruments. I have found that there are some articles that were included by mistake, and relate to "instruments", but not in the sense of questionnaires but rather of machines to measure and quantify data such as meteorological phenomena - I need to ignore these articles. For the relevant articles, they general include the sample size of the group that they administered their instrument to and a brief description of the type of people who participated (university students, children from 9-12 years of age, patients with type II diabetes, etc.)

I want to mine the abstracts and produce two additional columns, one containing the sample size and the other listing the type of people who participated in the study. Thus I need to mine two types of data, one is numerical and the other is textual. But to complicate things, an abstract can include other numbers that are not related to the sample size and there are even some who write out the sample size as text, like "four hundred at thirty."

I have attached a sample file with 12 abstracts and my manual analysis of the sample size and a Yes/No field to show if the study was carried out on students or not (which means any children through university students).

I know very little about text mining and text analysis, but from what I have read the rapidminer platform seems to be the most promising possible solution. I am hoping that someone here could help point me in the right direction to see how this could be done.


  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello, @MichaelWhite

    We will gladly help you. Let me ask you, how proficient are you in terms of technology? Because there is something that RapidMiner won't be capable of, and it is getting information from JavaScript generated sites, like the ones provided by your example. I am building an extension for that, but given the extreeeme lack of time, it has been delayed several times.

    Other than that, everything else seems doable.

    All the best,


  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @MichaelWhite,

    1/ I'm observing that most of your "sample size" are 3 digits. So I used a regex to extract such numbers in your abstracts : 
     It gives me relativ good results on your sample dataset (12 abstracts) :

     Explanations for the errors : 
     - 295 instead of 282 : It is because RM captures only the first matching group (which is 295)
     - 001 instead of 58 : it is because "58" is written in letters 

     - 341 instead of 404 : It is because, 404 is the sum of two numbers (341 and 63) : 

    Difficult (maybe impossible) to extract and sum automatically the numbers in a text mining process...

    2/ Your student's ? attribute : 

    I encourage you to train a classifier after processing your text attribute (tokenize etc.). The target attribute will be "Student ?" with possible values 'Yes', 'No' or 'Mixed'. Then you can apply this model to predict on your unlabelled abstracts if the study/survey was made with students / not students/mixed...
    To train the classifier, you have first to label manually a part of your 400 abstracts (Yes, No, Mixed) like you did for the first 12 abstracts...
    but I think that 12 is too few to train a relevant classifier. You can begin to label 80 abstracts for example (20% of the whole dataset)...
    Note : You must admit that text mining is not an exact science, but a probabilistic science, so the builded classifier
    will necessarily do (significant) errors...

    Hope it helps,



  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi again @MichaelWhite,

    I forgot to share the process to extract the sample size... :/
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="9.2.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
            <parameter key="csv_file" value="C:\Users\Lionel\Downloads\12Abstracts.csv"/>
            <parameter key="column_separators" value=";"/>
            <parameter key="trim_lines" value="false"/>
            <parameter key="use_quotes" value="true"/>
            <parameter key="quotes_character" value="&quot;"/>
            <parameter key="escape_character" value="\"/>
            <parameter key="skip_comments" value="true"/>
            <parameter key="comment_characters" value="#"/>
            <parameter key="starting_row" value="1"/>
            <parameter key="parse_numbers" value="true"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="infinity_representation" value=""/>
            <parameter key="date_format" value=""/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="encoding" value="windows-1252"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="DOI.true.polynominal.attribute"/>
              <parameter key="1" value="Link.true.polynominal.attribute"/>
              <parameter key="2" value="Abstract.true.polynominal.attribute"/>
              <parameter key="3" value="Sample Size.true.integer.attribute"/>
              <parameter key="4" value="Students?.true.polynominal.attribute"/>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          <operator activated="true" class="text:generate_extract" compatibility="8.1.000" expanded="true" height="68" name="Generate Extract" width="90" x="246" y="34">
            <parameter key="source_attribute" value="Abstract"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="Sample_size" value="(\d{3}(?=[^s^\d]))|(\d{3}(?=[^s^\d]))"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          <connect from_op="Read CSV" from_port="output" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>



Sign In or Register to comment.