--- title: "Other Utility Functions in bulkreadr" output: rmarkdown::html_vignette author: "Ezekiel Ogundepo and Ernest Fokoué" vignette: > %\VignetteIndexEntry{Other functions in bulkreadr} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} description: > The `bulkreadr` package includes specialized functions beyond bulk data reading, aimed at enhancing data analysis efficiency. These functions are designed to operate on individual vectors, except for `inspect_na()` and `fill_missing_values()`, which work on data frames. editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, comment = "#>", fig.path = "man/figures/", out.width = "100%") options(tibble.print_min = 5, tibble.print_max = 5) options(rmarkdown.html_vignette.check_title = FALSE) ``` The `bulkreadr` package includes specialized functions beyond bulk data reading, aimed at enhancing data analysis efficiency. These functions are designed to operate on individual vectors, except for `inspect_na()` and `fill_missing_values()`, which work on data frames. ## pull_out() `pull_out()` extracts or replaces parts of vectors, matrices, arrays, or lists. It works seamlessly with magrittr (`%>%`) or base (`|>`) operators. ```{r example4} library(bulkreadr) library(dplyr) top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu") # Extract specific elements from the list top_10_richest_nig |> pull_out(c(1, 5, 2)) # Exclude specific elements from the list top_10_richest_nig |> pull_out(-c(1, 5, 2)) ``` ## convert_to_date() `convert_to_date()` efficiently parses dates from various formats into `POSIXct` date objects, enabling smooth date handling and analysis. ```{r example 5} # heterogeneous dates dates <- c( 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, "11 07 2023", "2023-4" ) # Convert to POSIXct or Date object convert_to_date(dates) # Convert date-time object to date object convert_to_date(lubridate::now()) ``` ## Handling Missing Values with `inspect_na()` and `fill_missing_values()` - `inspect_na()`: Quickly checks for missing data across a dataframe. - `fill_missing_values()`: Offers multiple imputation strategies for filling missing values. ```{r} # Inspect missing data in the 'airquality' dataset inspect_na(airquality) ``` `inspect_na()` also works with grouped data frames, allowing you to inspect missing values within each group. For example, to check for missing values in the `airquality` dataset grouped by `Month`, you can use: ```{r} airquality %>% group_by(Month) %>% inspect_na() ``` **Imputing Missing Values** `fill_missing_values()` addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. It supports various imputation methods for continuous variables, including `minimum`, `maximum`, `mean`, `median`, `harmonic mean`, and `geometric mean`. For categorical variables, missing values are replaced with the `mode` of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making. ```{r example 6} df <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c( "setosa", NA, "versicolor", "setosa", NA, "virginica", "setosa" ) ) df ``` **Impute using the mean method for continuous variables** ```{r} result_df_mean <- fill_missing_values(df, method = "mean") result_df_mean ``` **Impute using the geometric mean for continuous variables and specify variables `Petal_Length` and `Petal_Width`** ```{r} result_df_geomean <- fill_missing_values(df, selected_variables = c ("Petal_Length", "Petal_Width"), method = "geometric") result_df_geomean ``` **Impute missing values (NAs) in a grouped data frame** You can use the `fill_missing_values()` in a grouped data frame by using other grouping and map functions. Here is an example of how to do this: ```{r} sample_iris <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", "setosa", "versicolor", "setosa", "virginica", "virginica", "setosa") ) sample_iris ``` ```{r} sample_iris %>% group_by(Species) %>% group_split() %>% map_df(fill_missing_values, method = "median") ```