Classifying English Articles Based on Difficulty

alaa_albarazi · October 2018

Hi,Common European Framework of Reference for Languages categorise language difficulty into 3 main level group A,B,C and each level group has two subleve. The levels are (A1 Begginer, A2 elementary ..... C2 Mastery).

I have thousandes of documents that I need to group based on difficulty level using RabidMiner or Python. One concept is to use a document with the most commonly spoken words and see how close the words in an article , for example, to the most common 1000 words. But this approch ignore the gramatical difficulty. In addition to the words difficulty, I need to add Part-of-speech tagging for each article, the length of each sentence and then find a way to consider the article as easy or difficult. It would be great if there is ready to use library that can do this.

What packages could help in this? And what process do you recommend.

kayman · October 2018

If you are a bit familiar with python I would recommend to use the NLTK kit, this works pretty fine (and fast) for POS functionality

This post shows a practical implementation : https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Filter-Tokens-by-POS-Tags-slow/m-p/43192#M28838

rfuentealba · October 2018

Hi @alaa_albarazi,

I would go with Python and NLTK too, as @kayman suggested. The RapidMiner extension for Text Mining can help you perform some of the preprocessing required to make it easier to analyze documents once you go with Python, and then you can make use of the Python Scripting extension to connect both. Just make sure you have the Anaconda Python Distribution installed, it already contains the packages for nltk and pattern that can help you.

All the best,

Rodrigo.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Classifying English Articles Based on Difficulty

Answers