Music Lyrics Analyzer: how to handle repeated lyrics?

mt_12345mt_12345 Member Posts: 4 Contributor I
edited December 2018 in Help

Hey guys,

I'm currently working on an automatic Music Lyrics Analyzer. The MLA uses text analytics methods based on an established platform to analyze the vocabulary used in song lyrics of different interpreters / genres and build clusters of songs based on their lyrics. In many songs, some sections of lyrics are repeated twice, indicated by a string string “x2".


In my opinion, I have to account for those repetition to avoid screwed classification model's results. Do you agree? If yes, how to handle this? Which operators should I choose?


Many thanks for your help! Have a good day!



  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hmm I'm not really sure about whether or not you should be weighting the repetitions or not but if you use tokenization and TFIDF, the repetitions will be weighted accordingly anyway.



  • Options
    mt_12345mt_12345 Member Posts: 4 Contributor I

    Thanks a lot for your answer. I will try it out!



  • Options
    mt_12345mt_12345 Member Posts: 4 Contributor I

    Just to make sure that everyone gets my question right: The repetitions are only indicated by a string x2, the text itself is not included twice in the songtext. So we have to do some transformations that the text really appears twice, right? Any ideas how we can do this? 

    I think what Scott suggested is what comes one step later. 


  • Options
    David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 297 RM Research



    it depends a bit on how the lyrics are returned. One token per line or stanza. If this is the case you can play with regular expressions and the replace Operator.

    Perhaps a bit cumbersome, but something like this should do the trick:

    • Replace what: (.+) x2
    • Replace with: $1 $1

    Then you can repeat that pattern for x3, x4, ... 


    Hope this helps.


  • Options
    kaymankayman Member Posts: 662 Unicorn

    Regular expressions are probably the best approach here indeed, but the quality will depend on your original data. The one given by David would work already to some extend but since it's greedy it can strip too much data if you have multiple x2's in your data. If your structure is as follows (so with linebreaks) :


    some sentence

    another sentence x2

    yet again another sentence

    and some other x2


    The regular expression that will work best in that case is (?m)^(.*?) x2$


    Roughly translated this means for any line you see start at the beginning and then group everything that appears untill the first time you see x2.


    So replace it then with $1 $1 will give you the same string twice. If there is no x2 in the strin/line it will simply keep the original.


    if everything is in one line (.*?) x2 will do fine also, but ensure you use the questionmark if you have more than one time x2 in your string. This will ensure the capture stops as soon as it finds an x2, otherwise it will take everything untill the last time it finds an x2


    Note that if your 2x would be in parantheses it will become like this (.*?) \(x2\)

  • Options
    mt_12345mt_12345 Member Posts: 4 Contributor I

    Thanks a lot guys! I need to try it out to see if the results are satisfying. 

Sign In or Register to comment.