Automatic Text Signal Finder for Binary Response

NoobieNoobie Member Posts: 2 Contributor I
edited October 2019 in Help
I have 2 datasets:

Dataset 1 - this has the response variable and some potential categorical predictors (the response is 1 or 0). Each entity has a unique record (let's call them entities A to Z)

Dataset 2 - this has thousands of records with lots of text for each entity. So each entity could have thousands of rows, each with paragraphs of information

I want to predict the response in Dataset 1 based on the text information in Dataset 2. So here is what I think should happen next:

1) Concatenating the thousands of rows for each entity in Dataset 2 such that the resulting table is one row per entity (with a ton of text information per record). 

2) Join Dataset 1 with Dataset 2 based on entity ID

Assuming above is correct so far (please correct if better way as I haven't done this yet), I am wondering if there's a ML algorithm that could find me all the words/phrases/fuzzy combos that are predictive of the response variable in dataset 1. Please advise!




  • kaymankayman Member Posts: 662 Unicorn
    Not sure how you define the concatenation, there is no problem at all to have multiple rows with the same entity, this is actually what models are expecting. The whole idea is that a machine is trained to make an educated guess based on this data, so if you have 10 lines with the same entity the machine needs to be trained to understand why this entity (or label) is given instead of another one. 

    Typical approach would be to use the process documents from data operator, split your sentences in tokens, strip all stop words and create a TF-IDF vectorset. Be sure to prune enough, if you have plenty of data you can set the boundaries pretty big, but experience a bit with it.

    This should give you the most meaningful words for your record set, and this reduced content set is what you can then use to setup a predictive model, where your entities will become your label. What model will work the best is depending on some variables, but  SVM or a Naieve Bayesian are typically good starting points for this type of challenge.

    All a bit dry and technical but there are quite some examples floating around so hopefully it get's you started.
  • NoobieNoobie Member Posts: 2 Contributor I
    Just to be clear, I am looking to predict a value of 1 or 0 associated with the entity, not the entity itself. The entity itself is like an ID for lack of better description. But, that might not change your response I suppose. 

    In terms of the split you speak of, does this allow the flexibility of phrases? Also, sometimes some words/phrases are not entered in the same order or spelled consistently; is there a way to find predictors that are approximately the same text/phrase?

    Thanks for the response
  • kaymankayman Member Posts: 662 Unicorn
    edited October 2019
    Hi @Noobie, For phrases you could use n-grams, or use the part of speech configuration. The latter would for instance allow you to filter on multiple nouns, which typically indicate a phrase. But it's a bit on the slow side so don't use it for big sets, or port it to python. 

    As for the label, it doesn't make a difference indeed if you have 2 possible options or more, it just changes the models you can use as you go from binary to multi label, but Bayes handles multioptions pretty good also. 
Sign In or Register to comment.