Options

Classifying English Articles Based on Difficulty

alaa_albarazialaa_albarazi Member Posts: 1 Learner I
edited December 2018 in Help

Hi,Common European Framework of Reference for Languages categorise language difficulty into 3 main level group A,B,C and each level group has two subleve. The levels are (A1 Begginer, A2 elementary ..... C2 Mastery).

I have thousandes of documents that I need to group based on difficulty level using RabidMiner or Python. One concept is to use a document with the most commonly spoken words and see how close the words in an article , for example, to the most common 1000 words. But this approch ignore the gramatical difficulty. In addition to the words difficulty, I need to add Part-of-speech tagging for each article, the length of each sentence and then find a way to consider the article as easy or difficult. It would be great if there is ready to use library that can do this.

What packages could help in this? And what process do you recommend.

 

Tagged:

Answers

  • Options
    kaymankayman Member Posts: 662 Unicorn

    If you are a bit familiar with python I would recommend to use the NLTK kit, this works pretty fine (and fast) for POS functionality

     

    This post shows a practical implementation : https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Filter-Tokens-by-POS-Tags-slow/m-p/43192#M28838

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @alaa_albarazi,

     

    I would go with Python and NLTK too, as @kayman suggested. The RapidMiner extension for Text Mining can help you perform some of the preprocessing required to make it easier to analyze documents once you go with Python, and then you can make use of the Python Scripting extension to connect both. Just make sure you have the Anaconda Python Distribution installed, it already contains the packages for nltk and pattern that can help you.

     

    All the best,

     

    Rodrigo.

Sign In or Register to comment.