README.md
In lucasan93/dadosCVM: Obtains investment funds' data from CVM

dadosCVM

The goal of dadosCVM is to obtain and adjust data of investments funds registered on Comissão de Valores Mobiliários (Security and Exchange Commissions from Brazil). Data is collected from files available at http://dados.cvm.gov.br/dados/.

Main goals/ideas are descripted below:

cad_fi(): Download the latest Registration Data (‘Dados Cadastrais’) available on CVM at http://dados.cvm.gov.br/dados/FI/CAD/DADOS/ - 90%;
dados_diarios(): Download the daily fund information (CNPJ, fund raisings and withdraws, investors’ amount, equity value, share value) of selected funds between two dates. Data is available at http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/ - under development;
dados_fidc(): Develop a function that downloads monthly FIDC data between two dates from a specified table of the FIDC files located at http://dados.cvm.gov.br/dados/FIDC/DOC/INF_MENSAL/DADOS/ - under development;
simpl_names(): Develop a function that simplifies funds names (i.e. transforms ‘FUNDO DE INVESTIMENTO EM DIREITOS CREDITORIOS’ to ‘FIDC’) - 0%.

You can install the development version of dadosCVM from GitHub with:

# install.packages("devtools")
devtools::install_github("lucasan93/dadosCVM")

The function cad_fi() downloads the latest registration data available on CVM.

library(dadosCVM)
library(dplyr)
library(kableExtra)

Below the top-10 funds in operation are extracted, sorted by their equity and identified by their CNPJ (Brazil National Registry of Legal Entities number). Note that the data.frame contains 40 columns and more than 60 thousand observations.

dados_cadastrais <- cad_fi() %>%
                      filter(situacao == 'EM FUNCIONAMENTO NORMAL') %>% 
                      select(cnpj,
                             classe,
                             tipo,
                             inicio_atv,
                             pl_data,
                             pl) %>% 
                      arrange(desc(pl)) %>% 
                      slice_head(n = 10) %>% 
                      mutate(pl = format(pl, big.mark = ','))

kable(dados_cadastrais)

cnpj classe tipo inicio\_atv pl\_data pl 01.608.573/0001-65 Fundo de Renda Fixa FI 1997-01-02 2022-03-23 159,651,227,089 27.146.328/0001-77 Fundo de Renda Fixa FI 2017-05-03 2022-03-23 137,429,549,362 07.593.972/0001-86 Fundo de Renda Fixa FI 2005-09-21 2022-03-23 122,607,768,840 22.985.157/0001-56 FIP Multi FIP 2015-09-01 2021-12-31 107,568,311,434 01.597.187/0001-15 Fundo de Renda Fixa FI 1996-12-30 2022-03-23 95,925,774,092 00.822.055/0001-87 Fundo de Renda Fixa FI 1995-10-02 2022-03-23 94,753,353,219 42.592.302/0001-46 Fundo de Renda Fixa FI 2021-09-30 2022-03-23 72,417,562,362 42.592.315/0001-15 Fundo de Renda Fixa FI 2021-09-30 2022-03-23 72,412,668,571 04.288.966/0001-27 Fundo de Renda Fixa FI 2002-07-15 2022-03-23 59,114,466,417 03.737.219/0001-66 Fundo de Renda Fixa FI 2005-08-24 2022-03-23 55,208,583,595

The function dados_diarios() downloads daily data (CNPJ, portfolio value, share value, equity, withdraws, fund raisings, and number of investors) for specifics fund (identified by theis CNPJs) between two given dates. Note that daily data is only available for funds of type ‘FI’ and that initial date must be greater than 2005-01-01.

library(dadosCVM)
library(dplyr)
library(ggplot2)
library(scales)

Below we obtain data from funds displayed in the previous example and plot their equity value over time:

dados_diarios(cnpj  = dados_cadastrais$cnpj,
              start = as.Date('2005-01-01'),
              end   = as.Date('2022-03-01')) %>% 
      select(data, cnpj, pl) %>% 
      ggplot() +
      aes(x = data, y = pl, color = cnpj) +
      geom_line() +
      scale_y_continuous(labels = unit_format(unit = "Bi", scale = 1e-9)) +
      xlab('Date') +
      ylab('Equity (BRL)') +
      theme_minimal()
#> [1] "Obtaining data between 2021-01-01 and 2022-03-01"
#> [1] "Obtaining data between 2005-01-01 and 2022-03-01"

The function dados_fidc() downloads monthly data for a group of funds (identified by theis CNPJs) between two given dates and for a specified database (identified in CVM’s FIDC’s META file) out of 16 different databases. Initial date must be greater than 2013-01-01.

This function is currently under development. I’m working on simplifying column names and adding the opting to pivot the data.frame format from wide to long.

library(dadosCVM)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(scales)
library(kableExtra)

First, let’s select the top-10 FIDCs in operation with the highest equity value and extract their CNPJs and the operations’ start date using the cad_fi function:

infos <- cad_fi() %>% 
          filter(situacao == 'EM FUNCIONAMENTO NORMAL',
                 tipo     == 'FIDC') %>% 
          arrange(desc(pl)) %>% 
          slice_head(n = 10) %>% 
          select(cnpj,
                 inicio_atv)

Now we obtain the database which provides us with information about the top-10 funds’ aggregated portfolio. CVM has a ‘META’ file describing the content of each FIDC database. A quick look at it and we find that the one we are interested in is the ‘II’ table. Soon the package will provide a reference database so this search can be done quicker. Note that although the oldest fund started in 2009, the function automatically set the start date to 2013-01-01 since there’s no data prior to that date. Note that the function pivots the CVM databases to longer and provides two identification columns (‘segment’ and ‘item’), which greatly facilitates analysis and plots.

dados_fidc(cnpj  = infos$cnpj,
           start = min(infos$inicio_atv),
           end   = as.Date('2022-02-01'),
           table = 'II') %>% 
        replace(is.na(.), 0) %>% 
        filter(value > 0) %>% 
        group_by(data, segment) %>% 
        summarise(value = sum(value)) %>% 
        ggplot() +
        aes(x = data, y = value, fill = segment) +
        geom_bar(position = 'stack', stat     = 'identity') +
        scale_y_continuous(labels = unit_format(unit = "Bi", scale = 1e-9)) +
        xlab('Date') +
        ylab('Value (BRL)') +
        theme_minimal()
#> [1] "Obtaining data between 2019-01-01 and 2022-02-01"
#> [1] "Obtaining data between 2013-01-01 and 2018-12-31"

If we are interested in analyzing the items within a specific segment, let’s say ‘financeiro’, we can plot the following:

dados_fidc(cnpj  = infos$cnpj,
           start = min(infos$inicio_atv),
           end   = as.Date('2022-02-01'),
           table = 'II') %>% 
        replace(is.na(.), 0) %>% 
        filter(value > 0, segment == 'financeiro') %>% 
        group_by(data, item) %>% 
        summarise(value = sum(value)) %>% 
        ggplot() +
        aes(x = data, y = value, fill = item) +
        geom_bar(position = 'stack', stat     = 'identity') +
        scale_y_continuous(labels = unit_format(unit = "Bi", scale = 1e-9)) +
        xlab('Date') +
        ylab('Value (BRL)') +
        theme_minimal()
#> [1] "Obtaining data between 2019-01-01 and 2022-02-01"
#> [1] "Obtaining data between 2013-01-01 and 2018-12-31"

And in order to verify how the segments and its items are organized based on their CVM categories, take a look at the defs_fidc database within the package. Note that the observations in which item == ‘total’ are excluded when running the dados_fidc() function in order to avoid double counting.

defs_fidcs %>% 
    filter(str_detect(category, '^TAB_II_')) %>% 
    kable(align = c('l', 'l')) %>% 
    kable_styling(latex_options = 'striped')

category table id base segment class item TAB\_II\_VL\_CARTEIRA II II carteira carteira total total TAB\_II\_A\_VL\_INDUST II II.A carteira industrial industrial industrial TAB\_II\_B\_VL\_IMOBIL II II.B carteira imobiliario imobiliario imobiliario TAB\_II\_C\_VL\_COMERC II II.C carteira comercial total total TAB\_II\_C1\_VL\_COMERC II II.C.1 carteira comercial comercial comercial TAB\_II\_C2\_VL\_VAREJO II II.C.2 carteira comercial varejo varejo TAB\_II\_C3\_VL\_ARREND II II.C.3 carteira comercial arrendamento mercantil arrendamento mercantil TAB\_II\_D\_VL\_SERV II II.D carteira servicos total total TAB\_II\_D1\_VL\_SERV II II.D.1 carteira servicos servicos servicos TAB\_II\_D2\_VL\_SERV\_PUBLICO II II.D.2 carteira servicos servicos publicos servicos publicos TAB\_II\_D3\_VL\_SERV\_EDUC II II.D.3 carteira servicos servicos educacionais servicos educacionais TAB\_II\_D4\_VL\_ENTRET II II.D.4 carteira servicos servicos entretenimento servicos entretenimento TAB\_II\_E\_VL\_AGRONEG II II.E carteira agronegocio agronegocio agronegocio TAB\_II\_F\_VL\_FINANC II II.F carteira financeiro total total TAB\_II\_F1\_VL\_CRED\_PESSOA II II.F.1 carteira financeiro credito pessoal credito pessoal TAB\_II\_F2\_VL\_CRED\_PESSOA\_CONSIG II II.F.2 carteira financeiro credito pessoal consignado credito pessoal consignado TAB\_II\_F3\_VL\_CRED\_CORP II II.F.3 carteira financeiro credito corporativo credito corporativo TAB\_II\_F4\_VL\_MIDMARKET II II.F.4 carteira financeiro middle market middle market TAB\_II\_F5\_VL\_VEICULO II II.F.5 carteira financeiro veiculos veiculos TAB\_II\_F6\_VL\_IMOBIL\_EMPRESA II II.F.6 carteira financeiro carteira imobiliaria empresarial carteira imobiliaria empresarial TAB\_II\_F7\_VL\_IMOBIL\_RESID II II.F.7 carteira financeiro carteira imobiliaria residencial carteira imobiliaria residencial TAB\_II\_F8\_VL\_OUTRO II II.F.8 carteira financeiro outros outros TAB\_II\_G\_VL\_CREDITO II II.G carteira cartao de credito cartao de credito cartao de credito TAB\_II\_H\_VL\_FACTOR II II.H carteira factoring total total TAB\_II\_H1\_VL\_PESSOA II II.H.1 carteira factoring pessoal pessoal TAB\_II\_H2\_VL\_CORP II II.H.2 carteira factoring corporativo corporativo TAB\_II\_I\_VL\_SETOR\_PUBLICO II II.I carteira setor publico total total TAB\_II\_I1\_VL\_PRECAT II II.I.1 carteira setor publico precatorios precatorios TAB\_II\_I2\_VL\_TRIBUT II II.I.2 carteira setor publico creditos tributarios creditos tributarios TAB\_II\_I3\_VL\_ROYALTIES II II.I.3 carteira setor publico royalties royalties TAB\_II\_I4\_VL\_OUTRO II II.I.4 carteira setor publico outros outros TAB\_II\_J\_VL\_JUDICIAL II II.J carteira acoes judiciais acoes judiciais acoes judiciais TAB\_II\_K\_VL\_MARCA II II.K carteira prop intelectual prop intelectual prop intelectual

Now let’s start developing a fund’s balance sheet. We can see through the tabs_fidc database available in the dadosCVM package that the assets’ data are located in the first table:

tabs_fidcs %>% 
kable(align = c('l', 'l')) %>% 
    kable_styling(latex_options = 'striped')

tabela conteudo I\_TODOS informacoes cadastrais dos fundos e ativos II classificacoes da carteira III passivos IV patrimonio liquido V direitos creditorios classificados por prazo de vencimento inadimplencia e pagamentos antecipados VI direitos creditorios classificados por prazo de vencimento inadimplencia e pagamentos antecipados VII aquisicoes substituicoes e recompras IX taxas de desconto e de juros das compras e vendas X\_1 informacoes sobre cotistas por classe e serie X\_1\_1 classificacoes dos cotistas X\_2 informacoes sobre as cotas X\_3 rentabilidade X\_4 transacoes de cotas do fundo X\_5 ativos classificados pela liquidez X\_6 taxas de desempenho real e esperada X\_7 garantias

Let’s then gather the largest FIDC’s assets data and plot its main segments:

dados_fidc(cnpj  = infos$cnpj[1],
           start = infos$inicio_atv[1],
           end   = as.Date('2022-03-31'),
           table = 'I') %>%
      filter(base == 'ativo', value != 0) %>% 
      group_by(data, segment) %>% 
              summarise(valor = sum(value)) %>% 
      ggplot() +
      aes(x = data, y = valor, fill = segment) +
      geom_bar(position = 'stack', stat     = 'identity') +
      scale_y_continuous(labels = unit_format(unit = "Bi", scale = 1e-9)) +
      xlab('Date') +
      ylab('Value (BRL)') +
      theme_minimal()
#> [1] "Obtaining data between 2019-01-01 and 2022-02-28"
#> [1] "Obtaining data between 2016-04-20 and 2018-12-31"

We can also analyze the items within each asset segment:

dados_fidc(cnpj  = infos$cnpj[1],
           start = infos$inicio_atv[1],
           end   = as.Date('2022-03-31'),
           table = 'I') %>%
      filter(base == 'ativo', value != 0) %>% 
      group_by(data, segment, item) %>% 
              summarise(valor = sum(value)) %>% 
      ggplot() +
      aes(x = data, y = valor, fill = item) +
      geom_bar(position = 'stack', stat     = 'identity') +
      scale_y_continuous(labels = unit_format(unit = "MM", scale = 1e-6)) +
      xlab('Date') +
      ylab('Value (BRL)') +
      theme_minimal() +
      facet_wrap(~segment, scales = 'free_y')
#> [1] "Obtaining data between 2019-01-01 and 2022-02-28"
#> [1] "Obtaining data between 2016-04-20 and 2018-12-31"