cfda provides functions for the analysis of categorical functional data.

The main contribution is the computation of an optimal encoding (real functional variable) of each state of the categorical functional data. This can be done using the compute_optimal_encoding function that takes in arguments the data in a specific format and a basis of functions created using the fda package (cf. create.basis). The output can be analysed with summary.fmca, plot.fmca, get_encoding, plotEigenvalues and plotComponent.

Moreover, cfda contains functions to visualize and compute some statistics about categorical functional data. A summary of the dataset is available with summary_cfd. plotData shows a graphical representation of the dataset. Basic statistics can be computed: the number of jumps (compute_number_jumps), the duration (compute_duration), the time spent in each state (compute_time_spent), the probability to be in each state at any given time (estimate_pt), the transition table (statetable).

The parameters of a Markov process can be estimated using estimate_Markov function.

In order to test the different functions, a real dataset is provided (biofam2) as well as two functions for generating data: (generate_Markov and generate_2State).

Details

See the vignette for a detailed example and mathematical background: RShowDoc("cfda", package = "cfda")

References

  • Deville J.C. (1982) Analyse de données chronologiques qualitatives : comment analyser des calendriers ?, Annales de l'INSEE, No 45, p. 45-104.

  • Deville J.C. et Saporta G. (1980) Analyse harmonique qualitative, DIDAY et al. (editors), Data Analysis and Informatics, North Holland, p. 375-389.

  • Saporta G. (1981) Méthodes exploratoires d'analyse de données temporelles, Cahiers du B.U.R.O, Université Pierre et Marie Curie, 37-38, Paris.

  • Preda C, Grimonprez Q, Vandewalle V. Categorical Functional Data Analysis. The cfda R Package. Mathematics. 2021; 9(23):3074. https://doi.org/10.3390/math9233074

Examples

# Simulate the Jukes-Cantor model of nucleotide replacement
K <- 4
Tmax <- 5
PJK <- matrix(1 / 3, nrow = K, ncol = K) - diag(rep(1 / 3, K))
lambda_PJK <- c(1, 1, 1, 1)
d_JK <- generate_Markov(n = 10, K = K, P = PJK, lambda = lambda_PJK, Tmax = Tmax)
d_JK2 <- cut_data(d_JK, Tmax)

# create basis object
m <- 5
b <- create.bspline.basis(c(0, Tmax), nbasis = m, norder = 4)

# compute encoding
encoding <- compute_optimal_encoding(d_JK2, b, computeCI = FALSE, nCores = 1)
#> ######### Compute encoding #########
#> Number of individuals: 10
#> Number of states: 4
#> Basis type: bspline
#> Number of basis functions: 5
#> Number of cores: 1
#> 
  |                                                  | 0 % elapsed=00s   
  |==========                                        | 20% elapsed=00s, remaining~00s
  |====================                              | 40% elapsed=00s, remaining~00s
  |==============================                    | 60% elapsed=00s, remaining~00s
  |========================================          | 80% elapsed=00s, remaining~00s
  |==================================================| 100% elapsed=00s, remaining~00s
#> 
#> DONE in 0.08s
#> ---- Compute U matrix:
#> 
  |                                                  | 0 % elapsed=00s   
  |====                                              | 7 % elapsed=00s, remaining~00s
  |=======                                           | 13% elapsed=00s, remaining~00s
  |==========                                        | 20% elapsed=00s, remaining~00s
  |==============                                    | 27% elapsed=00s, remaining~00s
  |=================                                 | 33% elapsed=00s, remaining~00s
  |====================                              | 40% elapsed=00s, remaining~00s
  |========================                          | 47% elapsed=00s, remaining~00s
  |===========================                       | 53% elapsed=00s, remaining~00s
  |==============================                    | 60% elapsed=00s, remaining~00s
  |==================================                | 67% elapsed=00s, remaining~00s
  |=====================================             | 73% elapsed=00s, remaining~00s
  |========================================          | 80% elapsed=00s, remaining~00s
  |============================================      | 87% elapsed=00s, remaining~00s
  |===============================================   | 93% elapsed=00s, remaining~00s
  |==================================================| 100% elapsed=00s, remaining~00s
#> 
#> DONE in 0.41s
#> ---- Compute encoding: 
#> DONE in 0s
#> Run Time: 0.5s
summary(encoding)
#> #### FMCA
#> 
#> ## Data 
#> Number of individuals: 10 
#> Number of states: 4 
#> Time Range: 0 to 5 
#> States:  1 2 3 4 
#> 
#> ## Basis 
#> Type: bspline 
#> Number of basis functions: 5 
#> 
#> ## Outputs
#> Eigenvalues:
#>   2.829354 2.282478 1.799415 1.469461 0.8318814 0.5187159 
#> 
#> Explained variance:
#>   0.279 0.504 0.682 0.827 0.909 0.96 
#> 
#> Optimal encoding:
#>                1           2          3           4
#> [1,]  0.07724101 -0.08412610 -0.9027121  2.12720845
#> [2,] -0.25779381 -0.02334159 -0.1051384 -0.06511869
#> [3,]  0.67368417  1.95592737 -0.8359562 -0.52846472
#> [4,]  0.37502000  0.52170320  0.7508660 -0.91307196
#> [5,]  0.60975716 -0.85600285  0.5711580 -0.46221971
#> 
#> Principal components:
#>          [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
#> 1  0.39384758  0.5060123 -0.5843695  0.10808683 -0.11784780 -0.92015914
#> 2  2.68821991  0.2163696  3.0389829 -0.72581432 -0.06233262  0.05730442
#> 3 -1.95315107  0.8408190  0.4365884 -0.44306349  0.47350431 -1.06325243
#> 4  0.03479717 -3.8778186 -0.5971048 -1.00341518  0.69688666 -0.08205782
#> 5 -1.47492477 -1.0091604  0.4422508  0.71320378 -2.28927560  0.22667263
#> 6 -1.81422297  0.6278327  0.3942427 -0.06794839  0.82622991  1.54360109
#> 
#> Total elapsed time: 0.495 s

# plot eigenvalues
plotEigenvalues(encoding, cumulative = TRUE, normalize = TRUE)


# plot the two first components
plotComponent(encoding, comp = c(1, 2))


# plot the encoding using the first harmonic
plot(encoding)
#> Warning: Removed 50 rows containing missing values (`geom_line()`).


# extract the encoding using the first harmonic
encod <- get_encoding(encoding)