Turn a programming project into a corpus
vignette_construct_corpus.Rmd
construct_corpus
Minimum Requirements.
codexplor::construct_corpus
turn a programming project into
a text-mining corpus, given :
- folder(s) path(s) and/or github repo(s) of the project ;
- and programming language(s) - default is
'R'
.
Supported languages. By default
construct_corpus
will analyze .R files. One or several
languages
could be indicated. Recognized languages are R,
Python.
Examples
library(codexplor)
# Construct a corpus from github
corpus <- construct_corpus(repos = "tidyverse/stringr")
str(corpus, max.level = 1)
#> List of 6
#> $ codes :Classes 'corpus.lines' and 'data.frame': 2089 obs. of 10 variables:
#> $ comments :Classes 'corpus.lines' and 'data.frame': 1460 obs. of 9 variables:
#> $ files :Classes 'corpus.nodelist' and 'data.frame': 37 obs. of 11 variables:
#> $ functions :Classes 'corpus.nodelist' and 'data.frame': 159 obs. of 14 variables:
#> $ functions.network:Classes 'citations.network', 'internal.dependancies' and 'data.frame': 516 obs. of 5 variables:
#> $ files.network :Classes 'citations.network', 'internal.dependancies' and 'data.frame': 94 obs. of 3 variables:
#> - attr(*, "class")= chr [1:2] "list" "corpus.list"
#> - attr(*, "date_creation")= Date[1:1], format: "2025-04-01"
#> - attr(*, "languages_patterns")=List of 1
#> - attr(*, "repos")= chr "tidyverse/stringr"
# pseudo-example : find dependencies within functions
corpus$functions$name[grepl(x = corpus$functions$code, "cli\\:\\:")]
#> [1] "str_like" "str_ilike" "str_match"
#> [4] "str_match_all" "type.character" "type.default"
#> [7] "str_transform_all" "str_split_i" "str_trunc"
#> [10] "no_boundary" "no_empty" "str_view_highlighter"
#> [13] "str_view_special" "print.stringr_view"
# e.g., cli:: dependencies within stringr func'
# or find the functions with more than 6 parameters and print their params
tm_params <- corpus$functions[corpus$functions$n_params > 6, c("name", "params")]
knitr::kable(head(tm_params, 2)) # only 2 lines of example hereafter
name | params | |
---|---|---|
15 | stop_input_type | x, what, …, allow_na = FALSE, allow_null = FALSE, show_value = TRUE, arg = caller_arg(x), call = caller_env() |
58 | check_string | x, …, allow_empty = TRUE, allow_na = FALSE, allow_null = FALSE, arg = caller_arg(x), call = caller_env() |
construct_corpus()
deal with several local folders
and/or repos.
corpus <- construct_corpus(c("~", "M:/")
,languages = c("Python", "R")
, repos = c("clement-LVD/codexplor","secdev/scapy"))
Understand the corpus.list
List of dataframes. construct_corpus
return a corpus.list
object, a standard list
of dataframes :
-
codes
(classescorpus.lines
&data.frame
) -
comments
(classescorpus.lines
&data.frame
) -
files
(classescorpus.nodelist
&data.frame
) -
functions
(classescorpus.nodelist
&data.frame
) -
files.network
(classescitations.network
,internal.dependencies
&data.frame
) - and a
functions.network
(classescitations.network
,internal.dependencies
&data.frame
)
Returned data.frame. This corpus.list
of df offers insights on the programming project, at various levels
:
Name | Level | e.g. |
---|---|---|
codes |
Line of code | Identify problematics lines, e.g., longest ones |
comments |
Commented line | . |
functions |
Function-level metrics | e.g., see the number of parameters, number of internal dependencies, length of the code, etc. for each function |
files |
File-level metrics | e.g., quantify number of functions within files and critical internal dependencies |
files.network |
Network (file-level) | Documents network / add metrics to the
files df |
functions.network |
Network (function-level) | Functions network / add metrics to the
functions df |
The files.network
and functions.network
df
have classes citations.network
of
internal.dependencies
. See the vignette
of these citations.network
dataframes
.
Technical Details
Match functions names. construct_corpus
rely on some default patterns, in order to analyze a programming project
in a standardized way. Accordingly to these patterns,
construct_corpus
will :
- read the files accordingly to their extension, e.g., R language is
associated with
'.R'
files. - process commented lines accordingly to each language definition,
e.g., R & Python use
'#'
and don’t allow multi-lines comments, contrary to JavaScript. - isolate the exposed content, e.g., not quoted and not within a
'{ }'
for most of the programming languages. - extract functions names from the exposed content, in order to list the internally exposed functions of the programming project.
Examples. The hereafter table show some examples of
a hello
function defined in several languages :
construct_corpus
have to match 'hello'
,
despite of the differences between languages.
Example | Definition Keyword | Operator After Keyword | Operator After Name | Start Instructions operator | prefix to exclude | regex fn parameters after names | anonymous | |
---|---|---|---|---|---|---|---|---|
Python | def hello(): pass | def | ( | : | ( | FALSE | ||
R | hello <- function() { } | function | ( | <-|= | { | FUN|error | (<-|=)function( | TRUE |
Here, R is the only language that assign an anonymous function to an object. Thus, contrary to other languages, we have to find the name of R functions before the keyword ‘function’, R parameters names are within the parenthesis after ‘function(’, etc.
Unmatched definition
R. Some way of assigning a function in R are currently not supported. These caveat are planned feature, such as:
- The right-hand assignment style is not recognized yet by
codexplor::construct_corpus()
. According to the Google’s R Style Guide :
This convention differs substantially from practices in other languages and makes it harder to see in code where an object is defined. E.g. searching for
foo <-
is easier than searching forfoo <-
and-> foo
(https://google.github.io/styleguide/Rguide.html#pipes).
-
The re-assigning of a R function to a new object is not supported yet, see hereafter.
aaa <- function(i, a){i + a} # construct_corpus match the 'aaa' function my_func_a <- aaa # but don't notice the creation of an alias my_func_a(1,2) # will not be matched during the post-processing
Depending on the number of letters of the function’ name, this will
lead codexplor::construct_corpus()
to a little mistake
about this function, i.e. recognize the 'aaa'
call as a
function call instead of a proper catching of this scenario as a
‘reassigning’ and not a normal call of the function.
Python. The function ‘hello’ defined as a lambda
(anonymous) function within the Python line hereafter is not matched yet
by construct_corpus
:
hello = lambda: None
One-liner code-extraction limits
Usually the code of a function is within ‘{’ and ‘}’ brackets (e.g.,
R, C, C++, C#, Java, JavaScript, Swift, Kotlin, Go).
codexplor::construct_corpus
rely on these brackets, in
order to extract the code of a function in the functions
data.frame
of the corpus.list
. Thus, the
one-liners codes are excluded from the functions
data.frame
if they are not within brackets. Contrary to
this standardized pattern, Python use indentation to create a block and
thus one-liner function code is matched.