Skip to contents

Given a Language, a folder path(s) and/or github repo(s), return a list of 4 dataframes. The list have an additionnal corpus.list class. The df are : (1) codes and (2) comments with text-metrics about each line; (3) files with global metrics over the files, (4) functions with metrics about the functions of the programming project, (5) files.network and (6) functions.network (networks of internal dependencies).

Usage

construct_corpus(
  folders = NULL,
  languages = "R",
  repos = NULL,
  .verbose = F,
  pattern_to_exclude = NULL,
  fn_to_exclude = "warning",
  ...
)

Arguments

folders

character. Default = NULL. A character vector of local folder paths to scan for code files.

languages

character. Default = "R". A character vector specifying the programming language(s) to include in the corpus.

repos

character. Default = NULL. A character vector of GitHub repository URLs or repository identifiers to extract files from (e.g., "user/repo").

.verbose

logical. Default = TRUE. A logical used to silent the message in console.

pattern_to_exclude

character. Default = NULL. A character chain with a regex (used to filter out files path)

fn_to_exclude

character A vector of values that will not be returned such as a match. (prefix or suffix, nchar to append a suffix, etc.).

...

Arguments passed on to add_doc_network_to_corpus

corpus

character A corpus.list object from the construct_corpus function

matches_colname

character, default = 'name' The name of the column of the functions df that will be used for construct a regex.

content_colname

character, default = 'code' The name of the column of the functions df that will be used for search a match and extract text.

prefix_for_2nd_matches

character A string representing the prefix to add to each 1st match that will be turned into a new regular expressions. The default is an empty string.

suffix_for_2nd_matches

character A string representing a regex to add as a suffix of each match, in order to have a complete regular expression. The default is an empty string.

filter_egolink_within_a_file

logical, default = TRUE. A logical value indicating whether to filter results based on "ego links" (a document referring to itself)

exclude_quoted_content

logical, default = FALSE. A logical value indicating if the quoted content should be take into consideration. If set to TRUE, text within " or ' over the same line will be suppressed, before to realize the matches

Value

A list of data.frame containing the corpus of collected files. The data frames includes columns such as:

file_path

character The local file path or constructed GitHub URL.

line_number

integer The line number of the file.

content

character The content in a line for the corpus.lines df, or the full content of the file.

file_ext

character File extension of the file.

n_char

integer Number of characters - including spacing - in the entire file (files df), a line of the file (codes and comments df), or within the function code (functions df).

n_char_wo_space

integer Number of characters - without spacing - in the entire file (files df), a line of the file (codes and comments df), or within the function code (functions df)

n_word

integer Number of words in the entire file (files df), a line of the file (codes and comments df), or within the function code (functions df).

n_vowel

integer Number of voyel in the entire file (files df), a line of the file (codes and comments df), or within the function code (functions df).

n_total_lines

integer Number of commented lines (comments df), code lines (codes df), within the file (files df), or the function code (functions df).

comments

logical TRUE if the entire line is commented. Set to FALSE for the codes df and TRUE for the comments df.

commented

character (only in the codes df) Inlines comments or NA if there is no inline comments.

parameters

character (only in the functions df) The content that define the default parameters of a function.

code

character (only in the functions df) The code of a function.

n_func

integer (only in the files df) The number of exposed functions within a file.

n_params

integer (only in the functions df) The number of parameters of a function.

freq

integer (only in the files.network df) The number of functions defined in the 'to' file that are called within a 'from' file.

from

character (only in the citations.network df) The function that call another function (functions.network df) or the local file path or GitHub URL that call a function defined in another file (files.network df).

to

character (only in the citations.network df) The function called (functions.network df) or the local file path or constructed GitHub URL where the function called is defined (files.network df).

file_path_from

character (only in the functions.network df) The file path of the function that call another function.

file_path_to

character (only in the functions.network df) The file path where the function called is defined.

indeg_fn

integer (only in the files and functions df) Number of functions that call this function (functions df) or number of files with functions that call the functions of this file (files df).

outdeg_fn

integer (only in the files and functions df) Number of functions called by this function (functions df) or number of files where the functions called by the functions of this file are defined (files df).

Details

  • If folders is provided (one or a list), the function scans the directories and retrieves file paths matching the specified languages.

  • If repos is provided (one or a list), it constructs URLs to the raw content of files from the specified GitHub repositories.

  • Both local paths and GitHub URLs can be combined in the final output.

The returned list is tagged with the class corpus.list, and contains the following attributes:

  • date_creation : Date a Date indicating when the corpus list was created (as Sys.Date()).

  • have_citations_network : a logical indicating if a network of internal dependancies was processed (construct_corpus don't return a citations_network so it will be set to FALSE)

  • languages_patterns : a list with the default patterns associated with the requested languages.

  • duplicated_corpus_lines, logical. If TRUE, line(s) of the codes data.frame are duplicated (must be to FALSE in near to all cases)

Examples

# Example 1: Construct a corpus from local folders
 corpus <- construct_corpus(folders = "~", languages = c( "R", "Python"))
#> Error in sub(re, "", x, perl = TRUE): input string 8 is invalid UTF-8
if (FALSE) { # \dontrun{
# Example 2: Construct a corpus from GitHub repositories (default is R)
cr2 <- construct_corpus(repos = c("tidyverse/stringr", "tidyverse/readr") )

# Example 3: Combine local folders and GitHub repositories
cr3 <- construct_corpus("~", "Python", "prabhupant/python-ds", .verbose = TRUE)
} # }