partial matching of strings in different two datasets to obtain a match with higher frequency

User submissions are the sole responsibility of contributors, with TuteeHUB disclaiming liability for accuracy, copyrights, or consequences of use; content is for informational purposes only and not professional advice.

df1 <- data.frame(A=c(.87,.11,.44,.45), B=c("I have a beard", "I slept for two hours", "I have had two courses","this is not true")) df2 <- data.frame(X=c(127,10,433,344,890,4),Y=c("have","beard","syllabus","true","three","maths"))

A B X Y .87 I have a beard 127 have .11 I slept for two hours NA NA .44 I have had two courses 127 have .45 this is not true 344 true

This dplyr method doesn't need a join (which is reasonable as you don't have a common column to join on). It combines the 2 datasets and finds the matches. As long as you don't have thousands of rows it will work fast enough. Of course you can make the script smaller, but you can run this step by step to see how it works.

df1<- data.frame(A=c(.87,.11,.44,.45), B=c("I have a beard", "I slept for two hours", "I have had two courses","this is not true"))

df2<- data.frame(X=c(127,10,433,344,890,4),Y=c("have","beard","syllabus","true","three","maths"))

library(dplyr)

df1 %>% 
  rowwise() %>%
  do(data.frame(.,df2)) %>%                    # combine datasets
  do(data.frame(.,flag = grepl(.$Y,.$B))) %>%  # for each row check if there's a match and name it flag
  ungroup %>%
  group_by(A,B) %>%                            # for each A and B
  mutate(N=sum(flag)) %>%                      # count how many matches you have
  filter(flag==TRUE | N == 0) %>%              # keep only A,B where you have some matches or no match at all
  top_n(1,X) %>%                               # pick one row based on max value of X
  ungroup %>%
  mutate(Y = ifelse(flag==FALSE,NA,as.character(Y)),   # if there's no match replace Y with NA
         X = ifelse(flag==FALSE,NA,X)) %>%             # if there's no match replace X with NA
  select(-c(flag,N)) 


#      A                      B   X    Y
# 1 0.87         I have a beard 127 have
# 2 0.11  I slept for two hours  NA   NA
# 3 0.44 I have had two courses 127 have
# 4 0.45       this is not true 344 true

Try to experiment and change various column values to see how it works. You might be able to spot any bugs in advance.

manpreet Best Answer 3 years ago

I have strings in two datasets and i would like to do a partial match. Here is the code that I have written

I want to do a pmatch and I am expecting output as follows

I would like to a partial match with a left join on df1. I want to get the higher of the two matches(for example in "I have a beard" string "have" match has 127 and "beard" has 10 and i want to get the higher match. Any suggestions?

0 views

0 shares

manpreet 3 years ago

0 views 0 shares

Popular Categories

partial matching of strings in different two datasets to obtain a match with higher frequency

Manpreet Singh

Answers (2)

manpreet Best Answer 3 years ago

manpreet 3 years ago

Similar Forum

Neet 2019 syllabus will change?

Syllabus in LaTeX

Does Hogwarts follow any specific syllabus for DADA?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important Course Queries Links

Join Our Community Today