Skip to contents

construct_corpus

Minimum Requirements. codexplor::construct_corpus turn a programming project into a text-mining corpus, given :

  • folder(s) path(s) and/or github repo(s) of the project ;
  • and programming language(s) - default is 'R'.

Supported languages. By default construct_corpus will analyze .R files. One or several languages could be indicated. Recognized languages are R, Python.

Examples

library(codexplor)
# Construct a corpus from github
corpus <- construct_corpus(repos =  "tidyverse/stringr")
str(corpus, max.level = 1) 
#> List of 6
#>  $ codes            :Classes 'corpus.lines' and 'data.frame':    2089 obs. of  10 variables:
#>  $ comments         :Classes 'corpus.lines' and 'data.frame':    1460 obs. of  9 variables:
#>  $ files            :Classes 'corpus.nodelist' and 'data.frame': 37 obs. of  11 variables:
#>  $ functions        :Classes 'corpus.nodelist' and 'data.frame': 159 obs. of  14 variables:
#>  $ functions.network:Classes 'citations.network', 'internal.dependancies' and 'data.frame':  516 obs. of  5 variables:
#>  $ files.network    :Classes 'citations.network', 'internal.dependancies' and 'data.frame':  94 obs. of  3 variables:
#>  - attr(*, "class")= chr [1:2] "list" "corpus.list"
#>  - attr(*, "date_creation")= Date[1:1], format: "2025-04-01"
#>  - attr(*, "languages_patterns")=List of 1
#>  - attr(*, "repos")= chr "tidyverse/stringr"
# pseudo-example : find dependencies within functions
corpus$functions$name[grepl(x = corpus$functions$code, "cli\\:\\:")]
#>  [1] "str_like"             "str_ilike"            "str_match"           
#>  [4] "str_match_all"        "type.character"       "type.default"        
#>  [7] "str_transform_all"    "str_split_i"          "str_trunc"           
#> [10] "no_boundary"          "no_empty"             "str_view_highlighter"
#> [13] "str_view_special"     "print.stringr_view"
# e.g., cli:: dependencies within stringr func'

# or find the functions with more than 6 parameters and print their params
 tm_params <- corpus$functions[corpus$functions$n_params > 6, c("name", "params")]
 knitr::kable(head(tm_params, 2)) # only 2 lines of example hereafter
name params
15 stop_input_type x, what, …, allow_na = FALSE, allow_null = FALSE, show_value = TRUE, arg = caller_arg(x), call = caller_env()
58 check_string x, …, allow_empty = TRUE, allow_na = FALSE, allow_null = FALSE, arg = caller_arg(x), call = caller_env()

construct_corpus() deal with several local folders and/or repos.

 corpus <- construct_corpus(c("~", "M:/") 
,languages = c("Python", "R")
, repos =  c("clement-LVD/codexplor","secdev/scapy"))

Understand the corpus.list

List of dataframes. construct_corpus return a corpus.list object, a standard list of dataframes :

  • codes (classes corpus.lines & data.frame)
  • comments (classes corpus.lines & data.frame)
  • files (classes corpus.nodelist & data.frame)
  • functions (classes corpus.nodelist & data.frame)
  • files.network (classes citations.network, internal.dependencies & data.frame)
  • and a functions.network (classes citations.network, internal.dependencies & data.frame)

Returned data.frame. This corpus.list of df offers insights on the programming project, at various levels :

Name Level e.g.
codes Line of code Identify problematics lines, e.g., longest ones
comments Commented line .
functions Function-level metrics e.g., see the number of parameters, number of internal dependencies, length of the code, etc. for each function
files File-level metrics e.g., quantify number of functions within files and critical internal dependencies
files.network Network (file-level) Documents network / add metrics to the files df
functions.network Network (function-level) Functions network / add metrics to the functions df

The files.network and functions.network df have classes citations.network of internal.dependencies. See the vignette of these citations.network dataframes.

Technical Details

Match functions names. construct_corpus rely on some default patterns, in order to analyze a programming project in a standardized way. Accordingly to these patterns, construct_corpus will :

  1. read the files accordingly to their extension, e.g., R language is associated with '.R' files.
  2. process commented lines accordingly to each language definition, e.g., R & Python use '#' and don’t allow multi-lines comments, contrary to JavaScript.
  3. isolate the exposed content, e.g., not quoted and not within a '{ }' for most of the programming languages.
  4. extract functions names from the exposed content, in order to list the internally exposed functions of the programming project.

Examples. The hereafter table show some examples of a hello function defined in several languages : construct_corpus have to match 'hello', despite of the differences between languages.

Example Definition Keyword Operator After Keyword Operator After Name Start Instructions operator prefix to exclude regex fn parameters after names anonymous
Python def hello(): pass def ( : ( FALSE
R hello <- function() { } function ( <-|= { FUN|error (<-|=)function( TRUE

Here, R is the only language that assign an anonymous function to an object. Thus, contrary to other languages, we have to find the name of R functions before the keyword ‘function’, R parameters names are within the parenthesis after ‘function(’, etc.

Unmatched definition

R. Some way of assigning a function in R are currently not supported. These caveat are planned feature, such as:

This convention differs substantially from practices in other languages and makes it harder to see in code where an object is defined. E.g. searching for foo <- is easier than searching for foo <- and -> foo (https://google.github.io/styleguide/Rguide.html#pipes).

  • The re-assigning of a R function to a new object is not supported yet, see hereafter.

    aaa <- function(i, a){i + a} # construct_corpus match the 'aaa' function
     my_func_a <- aaa    #  but don't notice the creation of an alias
    my_func_a(1,2) # will not be matched during the post-processing 

Depending on the number of letters of the function’ name, this will lead codexplor::construct_corpus() to a little mistake about this function, i.e. recognize the 'aaa' call as a function call instead of a proper catching of this scenario as a ‘reassigning’ and not a normal call of the function.

Python. The function ‘hello’ defined as a lambda (anonymous) function within the Python line hereafter is not matched yet by construct_corpus :

 hello = lambda: None
 

One-liner code-extraction limits

Usually the code of a function is within ‘{’ and ‘}’ brackets (e.g., R, C, C++, C#, Java, JavaScript, Swift, Kotlin, Go). codexplor::construct_corpus rely on these brackets, in order to extract the code of a function in the functions data.frame of the corpus.list. Thus, the one-liners codes are excluded from the functions data.frame if they are not within brackets. Contrary to this standardized pattern, Python use indentation to create a block and thus one-liner function code is matched.