Construct a list of Data Frames of Lines Readed From Files Within a Local GitHub Repositories and/or Local Folders
construct_corpus.RdGiven a Language, a folder path(s) and/or github repo(s),
return a list of 4 dataframes. The list have an additionnal corpus.list class. The df are :
(1) codes and (2) comments with text-metrics about each line;
(3) files with global metrics over the files, (4) functions with metrics about the functions of the programming project,
(5) files.network and (6) functions.network (networks of internal dependencies).
Usage
construct_corpus(
folders = NULL,
languages = "R",
repos = NULL,
.verbose = F,
pattern_to_exclude = NULL,
fn_to_exclude = "warning",
...
)Arguments
- folders
character. Default =NULL. A character vector of local folder paths to scan for code files.- languages
character. Default ="R". A character vector specifying the programming language(s) to include in the corpus.- repos
character. Default =NULL. A character vector of GitHub repository URLs or repository identifiers to extract files from (e.g.,"user/repo").- .verbose
logical. Default =TRUE. A logical used to silent the message in console.- pattern_to_exclude
character. Default =NULL. A character chain with a regex (used to filter out files path)- fn_to_exclude
characterA vector of values that will not be returned such as a match. (prefix or suffix, nchar to append a suffix, etc.).- ...
Arguments passed on to
add_doc_network_to_corpuscorpuscharacterAcorpus.listobject from the construct_corpus functionmatches_colnamecharacter, default ='name'The name of the column of thefunctionsdf that will be used for construct a regex.content_colnamecharacter, default ='code'The name of the column of thefunctionsdf that will be used for search a match and extract text.prefix_for_2nd_matchescharacterA string representing the prefix to add to each 1st match that will be turned into a new regular expressions. The default is an empty string.suffix_for_2nd_matchescharacterA string representing a regex to add as a suffix of each match, in order to have a complete regular expression. The default is an empty string.filter_egolink_within_a_filelogical, default =TRUE. A logical value indicating whether to filter results based on "ego links" (a document referring to itself)exclude_quoted_contentlogical, default =FALSE. A logical value indicating if the quoted content should be take into consideration. If set toTRUE, text within " or ' over the same line will be suppressed, before to realize the matches
Value
A list of data.frame containing the corpus of collected files. The data frames includes columns such as:
file_pathcharacterThe local file path or constructed GitHub URL.line_numberintegerThe line number of the file.contentcharacterThe content in a line for thecorpus.linesdf, or the full content of the file.file_extcharacterFile extension of the file.n_charintegerNumber of characters - including spacing - in the entire file (filesdf), a line of the file (codesandcommentsdf), or within the function code - without commented content (functionsdf).n_char_wo_spaceintegerNumber of characters - without spacing - in the entire file (filesdf), a line of the file (codesandcommentsdf), or within the function code - without commented content (functionsdf)n_wordintegerNumber of words in the entire file (filesdf), a line of the file (codesandcommentsdf), or within the function code - without commented content (functionsdf).n_vowelintegerNumber of voyel in the entire file (filesdf), a line of the file (codesandcommentsdf), or within the function code - without commented content (functionsdf).n_total_linesintegerNumber of commented lines (commentsdf), code lines (codesdf), within the file (filesdf), or the function code - without commented content (functionsdf).commentslogicalTRUEif the entire line is commented. Set toFALSEfor thecodesdf andTRUEfor thecommentsdf.commentedcharacter(only in thecodesdf) Inlines comments or NA if there is no inline comments.parameterscharacter(only in thefunctionsdf) The content that define the default parameters of a function.codecharacter(only in thefunctionsdf) The code of a function.n_funcinteger(only in thefilesdf) The number of exposed functions within a file.n_paramsinteger(only in thefunctionsdf) The number of parameters of a function.freqinteger(only in thefiles.networkdf) The number of functions defined in the 'to' file that are called within a 'from' file.fromcharacter(only in thecitations.networkdf) The function that call another function (functions.networkdf) or the local file path or GitHub URL that call a function defined in another file (files.networkdf).tocharacter(only in thecitations.networkdf) The function called (functions.networkdf) or the local file path or constructed GitHub URL where the function called is defined (files.networkdf).file_path_fromcharacter(only in thefunctions.networkdf) The file path of the function that call another function.file_path_tocharacter(only in thefunctions.networkdf) The file path where the function called is defined.indeg_fninteger(only in thefilesandfunctionsdf) Number of functions that call this function (functionsdf) or number of files with functions that call the functions of this file (filesdf). Internal links are excluded from the indegree and outdegree metrics.outdeg_fninteger(only in thefilesandfunctionsdf) Number of functions called by this function (functionsdf) or number of files where the functions called by the functions of this file are defined (filesdf). Internal links are excluded from the indegree and outdegree metrics.
Details
If
foldersis provided (one or a list), the function scans the directories and retrieves file paths matching the specified languages.If
reposis provided (one or a list), it constructs URLs to the raw content of files from the specified GitHub repositories.Both local paths and GitHub URLs can be combined in the final output. The returned list is tagged with the class corpus.list, and contains the following attributes:
date_creation:Datea Date indicating when the corpus list was created (asSys.Date()).have_citations_network: alogicalindicating if a network of internal dependancies was processed (construct_corpus don't return a citations_network so it will be set toFALSE)languages_patterns: a list with the default patterns associated with the requested languages.duplicated_corpus_lines,logical. IfTRUE, line(s) of thecodesdata.frame are duplicated (must be toFALSEin near to all cases)
Examples
# Example 1: Construct a corpus from local folders
corpus <- construct_corpus(folders = "~", languages = c( "R", "Python"))
#> Error in sub(re, "", x, perl = TRUE): input string 8 is invalid UTF-8
if (FALSE) { # \dontrun{
# Example 2: Construct a corpus from GitHub repositories (default is R)
cr2 <- construct_corpus(repos = c("tidyverse/stringr", "tidyverse/readr") )
# Example 3: Combine local folders and GitHub repositories
cr3 <- construct_corpus("~", "Python", "prabhupant/python-ds", .verbose = TRUE)
} # }