Skip to contents

Given a dataframe from the user, the function extracts a network of citations by searching for patterns. The function will 1st construct a pattern by adding a prefix and a suffix to each text from the pattern_varname column Then these pattern are searched in the content_varname column, returning a df with "line number" where match have occured

Usage

get_citations_network_from_df(
  df,
  content_varname = "content",
  pattern_varname = "first_match",
  prefix_for_regex_from_string = "",
  suffix_for_regex_from_string = "",
  keep_only_row_without_a_pattern = TRUE,
  varname_for_matches = "matches"
)

Arguments

df

A data frame containing the data to be processed.

content_varname

character, default = "content" A character string specifying the name of the column containing the text to be searched. Default is "content".

pattern_varname,

default = "first_match" A character string specifying the name of the column containing the patterns that will be matched. Default is "first_match".

prefix_for_regex_from_string

character, default = "" A character string to be used as a prefix in the regex pattern.

suffix_for_regex_from_string

character, default = "" A character string to be used as a suffix in the regex pattern.

keep_only_row_without_a_pattern

logical, default = TRUE If TRUE, keeps only rows with an initial entry for constructing the pattern (i.e. lines with a character in the pattern_varname column of the df passed by the user will be filter out)

varname_for_matches

character, default = "matches" A character string specifying the name of the column of matches in the returned df.

Value

A data frame with the extracted citations network.

Details

The returned data frame has 5 columns:

row_number

The row number of the original data frame where the text is matched.

matches

The text matched by the pattern, e.g., name of a person.

content

The text content where the pattern was searched, i.e. the column that is identified with content_varname

first_match

The original pattern searched for (filled with NA if keep_only_row_without_a_pattern is TRUE)

Examples

if (FALSE) { # \dontrun{
df <- data.frame(content = c("Citation (Bob, 2021)", "Another Bob"), first_match = c("Bob" , NA))
get_citations_network_from_df(df  ) # Return only the 2nd line (match 'Bob')
get_citations_network_from_df(df,  keep_only_row_without_a_pattern = FALSE)
#will return the lines (matching 'Bob')
} # }