Construct a list of Data Frames of Lines Readed From Files Within a Local GitHub Repositories and/or Local Folders
construct_corpus.Rd
Given a Language, a folder path(s) and/or github repo(s),
return a list
of 4 dataframes. The list have an additionnal corpus.list
class. The df are :
(1) codes
and (2) comments
with text-metrics about each line;
(3) files
with global metrics over the files, (4) functions
with metrics about the functions of the programming project,
(5) files.network
and (6) functions.network
(networks of internal dependencies).
Usage
construct_corpus(
folders = NULL,
languages = "R",
repos = NULL,
.verbose = F,
pattern_to_exclude = NULL,
fn_to_exclude = "warning",
...
)
Arguments
- folders
character
. Default =NULL
. A character vector of local folder paths to scan for code files.- languages
character
. Default ="R"
. A character vector specifying the programming language(s) to include in the corpus.- repos
character
. Default =NULL
. A character vector of GitHub repository URLs or repository identifiers to extract files from (e.g.,"user/repo"
).- .verbose
logical
. Default =TRUE
. A logical used to silent the message in console.- pattern_to_exclude
character
. Default =NULL
. A character chain with a regex (used to filter out files path)- fn_to_exclude
character
A vector of values that will not be returned such as a match. (prefix or suffix, nchar to append a suffix, etc.).- ...
Arguments passed on to
add_doc_network_to_corpus
corpus
character
Acorpus.list
object from the construct_corpus functionmatches_colname
character
, default ='name'
The name of the column of thefunctions
df that will be used for construct a regex.content_colname
character
, default ='code'
The name of the column of thefunctions
df that will be used for search a match and extract text.prefix_for_2nd_matches
character
A string representing the prefix to add to each 1st match that will be turned into a new regular expressions. The default is an empty string.suffix_for_2nd_matches
character
A string representing a regex to add as a suffix of each match, in order to have a complete regular expression. The default is an empty string.filter_egolink_within_a_file
logical
, default =TRUE
. A logical value indicating whether to filter results based on "ego links" (a document referring to itself)exclude_quoted_content
logical
, default =FALSE
. A logical value indicating if the quoted content should be take into consideration. If set toTRUE
, text within " or ' over the same line will be suppressed, before to realize the matches
Value
A list
of data.frame
containing the corpus of collected files. The data frames includes columns such as:
file_path
character
The local file path or constructed GitHub URL.line_number
integer
The line number of the file.content
character
The content in a line for thecorpus.lines
df, or the full content of the file.file_ext
character
File extension of the file.n_char
integer
Number of characters - including spacing - in the entire file (files
df), a line of the file (codes
andcomments
df), or within the function code (functions
df).n_char_wo_space
integer
Number of characters - without spacing - in the entire file (files
df), a line of the file (codes
andcomments
df), or within the function code (functions
df)n_word
integer
Number of words in the entire file (files
df), a line of the file (codes
andcomments
df), or within the function code (functions
df).n_vowel
integer
Number of voyel in the entire file (files
df), a line of the file (codes
andcomments
df), or within the function code (functions
df).n_total_lines
integer
Number of commented lines (comments
df), code lines (codes
df), within the file (files
df), or the function code (functions
df).comments
logical
TRUE
if the entire line is commented. Set toFALSE
for thecodes
df andTRUE
for thecomments
df.commented
character
(only in thecodes
df) Inlines comments or NA if there is no inline comments.parameters
character
(only in thefunctions
df) The content that define the default parameters of a function.code
character
(only in thefunctions
df) The code of a function.n_func
integer
(only in thefiles
df) The number of exposed functions within a file.n_params
integer
(only in thefunctions
df) The number of parameters of a function.freq
integer
(only in thefiles.network
df) The number of functions defined in the 'to' file that are called within a 'from' file.from
character
(only in thecitations.network
df) The function that call another function (functions.network
df) or the local file path or GitHub URL that call a function defined in another file (files.network
df).to
character
(only in thecitations.network
df) The function called (functions.network
df) or the local file path or constructed GitHub URL where the function called is defined (files.network
df).file_path_from
character
(only in thefunctions.network
df) The file path of the function that call another function.file_path_to
character
(only in thefunctions.network
df) The file path where the function called is defined.indeg_fn
integer
(only in thefiles
andfunctions
df) Number of functions that call this function (functions
df) or number of files with functions that call the functions of this file (files
df).outdeg_fn
integer
(only in thefiles
andfunctions
df) Number of functions called by this function (functions
df) or number of files where the functions called by the functions of this file are defined (files
df).
Details
If
folders
is provided (one or a list), the function scans the directories and retrieves file paths matching the specified languages.If
repos
is provided (one or a list), it constructs URLs to the raw content of files from the specified GitHub repositories.Both local paths and GitHub URLs can be combined in the final output.
The returned list is tagged with the class corpus.list, and contains the following attributes:
date_creation
:Date
a Date indicating when the corpus list was created (asSys.Date()
).have_citations_network
: alogical
indicating if a network of internal dependancies was processed (construct_corpus don't return a citations_network so it will be set toFALSE
)languages_patterns
: a list with the default patterns associated with the requested languages.duplicated_corpus_lines
,logical
. IfTRUE
, line(s) of thecodes
data.frame are duplicated (must be toFALSE
in near to all cases)
Examples
# Example 1: Construct a corpus from local folders
corpus <- construct_corpus(folders = "~", languages = c( "R", "Python"))
#> Error in sub(re, "", x, perl = TRUE): input string 8 is invalid UTF-8
if (FALSE) { # \dontrun{
# Example 2: Construct a corpus from GitHub repositories (default is R)
cr2 <- construct_corpus(repos = c("tidyverse/stringr", "tidyverse/readr") )
# Example 3: Combine local folders and GitHub repositories
cr3 <- construct_corpus("~", "Python", "prabhupant/python-ds", .verbose = TRUE)
} # }