Title: | Tool for Diagnosis of Tables Joins and Complementary Join Features |
---|---|
Description: | Tool for diagnosing table joins. It combines the speed of `collapse` and `data.table`, the flexibility of `dplyr`, and the diagnosis and features of the `merge` command in `Stata`. |
Authors: | R.Andres Castaneda [aut, cre], Zander Prinsloo [aut], Rossana Tatulli [aut] |
Maintainer: | R.Andres Castaneda <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.3 |
Built: | 2024-11-20 05:21:16 UTC |
Source: | https://github.com/randrescastaneda/joyn |
This is a joyn
wrapper that works in a similar fashion to
dplyr::anti_join
anti_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", relationship = "many-to-many", y_vars_to_keep = FALSE, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
anti_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", relationship = "many-to-many", y_vars_to_keep = FALSE, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
keep |
Should the join keys from both
|
na_matches |
Should two |
multiple |
Handling of rows in
|
relationship |
Handling of the expected relationship between the keys of
|
y_vars_to_keep |
character: Vector of variable names in |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
... |
Arguments passed on to
|
An data frame of the same class as x
. The properties of the output
are as close as possible to the ones returned by the dplyr alternative.
Other dplyr alternatives:
full_join()
,
inner_join()
,
left_join()
,
right_join()
# Simple anti join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) anti_join(x1, y1, relationship = "many-to-one")
# Simple anti join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) anti_join(x1, y1, relationship = "many-to-one")
tabulate one variable frequencies
freq_table(x, byvar, digits = 1, na.rm = FALSE)
freq_table(x, byvar, digits = 1, na.rm = FALSE)
x |
data frame |
byvar |
character: name of variable to tabulate. Use Standard evaluation. |
digits |
numeric: number of decimal places to display. Default is 1. |
na.rm |
logical: report NA values in frequencies. Default is FALSE. |
data.table with frequencies.
library(data.table) x4 = data.table(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) freq_table(x4, "id1")
library(data.table) x4 = data.table(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) freq_table(x4, "id1")
This is a joyn
wrapper that works in a similar
fashion to dplyr::full_join
full_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
full_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
keep |
Should the join keys from both
|
na_matches |
Should two |
multiple |
Handling of rows in
|
unmatched |
How should unmatched keys that would result in dropped rows be handled?
|
relationship |
Handling of the expected relationship between the keys of
|
y_vars_to_keep |
character: Vector of variable names in |
update_values |
logical: If TRUE, it will update all values of variables
in x with the actual of variables in y with the same name as the ones in x.
NAs from y won't be used to update actual values in x. Yet, by default,
NAs in x will be updated with values in y. To avoid this, make sure to set
|
update_NAs |
logical: If TRUE, it will update NA values of all variables
in x with actual values of variables in y that have the same name as the
ones in x. If FALSE, NA values won't be updated, even if |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
... |
Arguments passed on to
|
An data frame of the same class as x
. The properties of the output
are as close as possible to the ones returned by the dplyr alternative.
Other dplyr alternatives:
anti_join()
,
inner_join()
,
left_join()
,
right_join()
# Simple full join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) full_join(x1, y1, relationship = "many-to-one")
# Simple full join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) full_join(x1, y1, relationship = "many-to-one")
This function aims to display and store info on joyn options
get_joyn_options(env = .joynenv, display = TRUE, option = NULL)
get_joyn_options(env = .joynenv, display = TRUE, option = NULL)
env |
environment, which is joyn environment by default |
display |
logical, if TRUE displays (i.e., print) info on joyn options and corresponding default and current values |
option |
character or NULL. If character, name of a specific joyn option. If NULL, all joyn options |
joyn options and values invisibly as a list
JOYn options functions
set_joyn_options()
## Not run: # display all joyn options, their default and current values joyn:::get_joyn_options() # store list of option = value pairs AND do not display info joyn_options <- joyn:::get_joyn_options(display = FALSE) # get info on one specific option and store it joyn.verbose <- joyn:::get_joyn_options(option = "joyn.verbose") # get info on two specific option joyn:::get_joyn_options(option = c("joyn.verbose", "joyn.reportvar")) ## End(Not run)
## Not run: # display all joyn options, their default and current values joyn:::get_joyn_options() # store list of option = value pairs AND do not display info joyn_options <- joyn:::get_joyn_options(display = FALSE) # get info on one specific option and store it joyn.verbose <- joyn:::get_joyn_options(option = "joyn.verbose") # get info on two specific option joyn:::get_joyn_options(option = c("joyn.verbose", "joyn.reportvar")) ## End(Not run)
This is a joyn
wrapper that works in a similar fashion to
dplyr::inner_join
inner_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
inner_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
keep |
Should the join keys from both
|
na_matches |
Should two |
multiple |
Handling of rows in
|
unmatched |
How should unmatched keys that would result in dropped rows be handled?
|
relationship |
Handling of the expected relationship between the keys of
|
y_vars_to_keep |
character: Vector of variable names in |
update_values |
logical: If TRUE, it will update all values of variables
in x with the actual of variables in y with the same name as the ones in x.
NAs from y won't be used to update actual values in x. Yet, by default,
NAs in x will be updated with values in y. To avoid this, make sure to set
|
update_NAs |
logical: If TRUE, it will update NA values of all variables
in x with actual values of variables in y that have the same name as the
ones in x. If FALSE, NA values won't be updated, even if |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
... |
Arguments passed on to
|
An data frame of the same class as x
. The properties of the output
are as close as possible to the ones returned by the dplyr alternative.
Other dplyr alternatives:
anti_join()
,
full_join()
,
left_join()
,
right_join()
# Simple full join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) inner_join(x1, y1, relationship = "many-to-one")
# Simple full join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) inner_join(x1, y1, relationship = "many-to-one")
Check if the data frame is balanced by group of columns, i.e., if it contains every combination of the elements in the specified variables
is_balanced(df, by, return = c("logic", "table"))
is_balanced(df, by, return = c("logic", "table"))
df |
data frame |
by |
character: variables used to check if |
return |
character: either "logic" or "table". If "logic", returns |
logical, if return == "logic", else returns data frame of unbalanced observations
x1 = data.frame(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) is_balanced(df = x1, by = c("id", "t"), return = "table") # returns combination of elements in "id" and "t" not present in df is_balanced(df = x1, by = c("id", "t"), return = "logic") # FALSE
x1 = data.frame(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) is_balanced(df = x1, by = c("id", "t"), return = "table") # returns combination of elements in "id" and "t" not present in df is_balanced(df = x1, by = c("id", "t"), return = "logic") # FALSE
by
variablereport if dt is uniquely identified by by
var or, if report = TRUE, the duplicates in by
variable
is_id(dt, by, verbose = getOption("joyn.verbose"), return_report = FALSE)
is_id(dt, by, verbose = getOption("joyn.verbose"), return_report = FALSE)
dt |
either right of left table |
by |
variable to merge by |
verbose |
logical: if TRUE messages will be displayed |
return_report |
logical: if TRUE, returns data with summary of duplicates.
If FALSE, returns logical value depending on whether |
logical or data.frame, depending on the value of argument return_report
library(data.table) # example with data frame not uniquely identified by `by` var y <- data.table(id = c("c","b", "c", "a"), y = c(11L, 15L, 18L, 20L)) is_id(y, by = "id") is_id(y, by = "id", return_report = TRUE) # example with data frame uniquely identified by `by` var y1 <- data.table(id = c("1","3", "2", "9"), y = c(11L, 15L, 18L, 20L)) is_id(y1, by = "id")
library(data.table) # example with data frame not uniquely identified by `by` var y <- data.table(id = c("c","b", "c", "a"), y = c(11L, 15L, 18L, 20L)) is_id(y, by = "id") is_id(y, by = "id", return_report = TRUE) # example with data frame uniquely identified by `by` var y1 <- data.table(id = c("1","3", "2", "9"), y = c(11L, 15L, 18L, 20L)) is_id(y1, by = "id")
This is the primary function in the joyn
package. It executes a full join,
performs a number of checks, and filters to allow the user-specified join.
joyn( x, y, by = intersect(names(x), names(y)), match_type = c("1:1", "1:m", "m:1", "m:m"), keep = c("full", "left", "master", "right", "using", "inner", "anti"), y_vars_to_keep = ifelse(keep == "anti", FALSE, TRUE), update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = FALSE, verbose = getOption("joyn.verbose"), suffixes = getOption("joyn.suffixes"), allow.cartesian = deprecated(), yvars = deprecated(), keep_y_in_x = deprecated(), na.last = getOption("joyn.na.last"), msg_type = getOption("joyn.msg_type") )
joyn( x, y, by = intersect(names(x), names(y)), match_type = c("1:1", "1:m", "m:1", "m:m"), keep = c("full", "left", "master", "right", "using", "inner", "anti"), y_vars_to_keep = ifelse(keep == "anti", FALSE, TRUE), update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = FALSE, verbose = getOption("joyn.verbose"), suffixes = getOption("joyn.suffixes"), allow.cartesian = deprecated(), yvars = deprecated(), keep_y_in_x = deprecated(), na.last = getOption("joyn.na.last"), msg_type = getOption("joyn.msg_type") )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
match_type |
character: one of "m:m", "m:1", "1:m", "1:1". Default is "1:1" since this the most restrictive. However, following Stata's recommendation, it is better to be explicit and use any of the other three match types (See details in match types sections). |
keep |
atomic character vector of length 1: One of "full", "left",
"master", "right",
"using", "inner". Default is "full". Even though this is not the
regular behavior of joins in R, the objective of |
y_vars_to_keep |
character: Vector of variable names in |
update_values |
logical: If TRUE, it will update all values of variables
in x with the actual of variables in y with the same name as the ones in x.
NAs from y won't be used to update actual values in x. Yet, by default,
NAs in x will be updated with values in y. To avoid this, make sure to set
|
update_NAs |
logical: If TRUE, it will update NA values of all variables
in x with actual values of variables in y that have the same name as the
ones in x. If FALSE, NA values won't be updated, even if |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
suffixes |
A character(2) specifying the suffixes to be used for making non-by column names unique. The suffix behaviour works in a similar fashion as the base::merge method does. |
allow.cartesian |
logical: Check documentation in official web site.
Default is |
yvars |
|
keep_y_in_x |
|
na.last |
|
msg_type |
character: type of messages to display by default |
a data.table joining x and y.
Using the same wording of the Stata manual
1:1: specifies a one-to-one match merge. The variables specified in
by
uniquely identify single observations in both table.
1:m and m:1: specify one-to-many and many-to-one match merges,
respectively. This means that in of the tables the observations are
uniquely identify by the variables in by
, while in the other table many
(two or more) of the observations are identify by the variables in by
m:m refers to many-to-many merge. variables in by
does not uniquely
identify the observations in either table. Matching is performed by
combining observations with equal values in by
; within matching values,
the first observation in the master (i.e. left or x) table is matched with
the first matching observation in the using (i.e. right or y) table; the
second, with the second; and so on. If there is an unequal number of
observations within a group, then the last observation of the shorter group
is used repeatedly to match with subsequent observations of the longer
group.
If reporttype = "numeric"
, then the numeric values have the following
meaning:
1: row comes from x
, i.e. "x" 2: row comes from y
, i.e. "y" 3: row from
both x
and y
, i.e. "x & y" 4: row has NA in x
that has been updated
with y
, i.e. "NA updated" 5: row has valued in x
that has been updated
with y
, i.e. "value updated" 6: row from x
that has not been updated,
i.e. "not updated"
NA
s are placed either at first or at last in the
resulting data.frame depending on the value of getOption("joyn.na.last")
.
The Default is FALSE
as it is the default value of
data.table::setorderv.
# Simple join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) x2 = data.table(id = c(1, 1, 2, 3, NA), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) y2 = data.table(id = c(1, 2, 5, 6, 3), yd = c(1, 2, 5, 6, 3), y = c(11L, 15L, 20L, 13L, 10L), x = c(16:20)) joyn(x1, y1, match_type = "m:1") # Bad merge for not specifying by argument or match_type joyn(x2, y2) # good merge, ignoring variable x from y joyn(x2, y2, by = "id", match_type = "m:1") # update NAs in x variable form x joyn(x2, y2, by = "id", update_NAs = TRUE, match_type = "m:1") # Update values in x with variables from y joyn(x2, y2, by = "id", update_values = TRUE, match_type = "m:1")
# Simple join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) x2 = data.table(id = c(1, 1, 2, 3, NA), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) y2 = data.table(id = c(1, 2, 5, 6, 3), yd = c(1, 2, 5, 6, 3), y = c(11L, 15L, 20L, 13L, 10L), x = c(16:20)) joyn(x1, y1, match_type = "m:1") # Bad merge for not specifying by argument or match_type joyn(x2, y2) # good merge, ignoring variable x from y joyn(x2, y2, by = "id", match_type = "m:1") # update NAs in x variable form x joyn(x2, y2, by = "id", update_NAs = TRUE, match_type = "m:1") # Update values in x with variables from y joyn(x2, y2, by = "id", update_values = TRUE, match_type = "m:1")
display type of joyn message
joyn_msg(msg_type = getOption("joyn.msg_type"), msg = NULL)
joyn_msg(msg_type = getOption("joyn.msg_type"), msg = NULL)
msg_type |
character: one or more of the following: all, basic, info, note, warn, timing, or err |
msg |
character vector to be parsed to |
returns data frame with message invisibly. print message in console
Messages functions
clear_joynenv()
,
joyn_msgs_exist()
,
joyn_report()
,
msg_type_dt()
,
store_msg()
,
style()
,
type_choices()
library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) df <- joyn(x1, y1, match_type = "m:1") joyn_msg("basic") joyn_msg("all")
library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) df <- joyn(x1, y1, match_type = "m:1") joyn_msg("basic") joyn_msg("all")
Print JOYn report table
joyn_report(verbose = getOption("joyn.verbose"))
joyn_report(verbose = getOption("joyn.verbose"))
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
invisible table of frequencies
Messages functions
clear_joynenv()
,
joyn_msg()
,
joyn_msgs_exist()
,
msg_type_dt()
,
store_msg()
,
style()
,
type_choices()
library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) d <- joyn(x1, y1, match_type = "m:1") joyn_report(verbose = TRUE)
library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = 1:2, y = c(11L, 15L)) d <- joyn(x1, y1, match_type = "m:1") joyn_report(verbose = TRUE)
This is a joyn
wrapper that works in a similar
fashion to dplyr::left_join
left_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = NULL, y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
left_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = NULL, y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
keep |
Should the join keys from both
|
na_matches |
Should two |
multiple |
Handling of rows in
|
unmatched |
How should unmatched keys that would result in dropped rows be handled?
|
relationship |
Handling of the expected relationship between the keys of
|
y_vars_to_keep |
character: Vector of variable names in |
update_values |
logical: If TRUE, it will update all values of variables
in x with the actual of variables in y with the same name as the ones in x.
NAs from y won't be used to update actual values in x. Yet, by default,
NAs in x will be updated with values in y. To avoid this, make sure to set
|
update_NAs |
logical: If TRUE, it will update NA values of all variables
in x with actual values of variables in y that have the same name as the
ones in x. If FALSE, NA values won't be updated, even if |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
... |
Arguments passed on to
|
An data frame of the same class as x
. The properties of the output
are as close as possible to the ones returned by the dplyr alternative.
Other dplyr alternatives:
anti_join()
,
full_join()
,
inner_join()
,
right_join()
# Simple left join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) left_join(x1, y1, relationship = "many-to-one")
# Simple left join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) left_join(x1, y1, relationship = "many-to-one")
This is a joyn wrapper that works in a similar fashion to base::merge and data.table::merge, which is why merge masks the other two.
merge( x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE, allow.cartesian = getOption("datatable.allow.cartesian"), match_type = c("m:m", "m:1", "1:m", "1:1"), keep_common_vars = TRUE, ... )
merge( x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE, allow.cartesian = getOption("datatable.allow.cartesian"), match_type = c("m:m", "m:1", "1:m", "1:1"), keep_common_vars = TRUE, ... )
x , y
|
|
by |
A vector of shared column names in |
by.x , by.y
|
Vectors of column names in |
all |
logical; |
all.x |
logical; if |
all.y |
logical; analogous to |
sort |
logical. If |
suffixes |
A |
no.dups |
logical indicating that |
allow.cartesian |
See |
match_type |
character: one of "m:m", "m:1", "1:m", "1:1". Default is "1:1" since this the most restrictive. However, following Stata's recommendation, it is better to be explicit and use any of the other three match types (See details in match types sections). |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
... |
Arguments passed on to
|
data.table merging x and y
x1 = data.frame(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.frame(id = c(1,2, 4), y = c(11L, 15L, 16)) joyn::merge(x1, y1, by = "id") # example of using by.x and by.y x2 = data.frame(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) y2 = data.frame(id = c(1, 2, 5, 6, 3), id2 = c(1, 1, 2, 3, 4), y = c(11L, 15L, 20L, 13L, 10L), x = c(16:20)) jn <- joyn::merge(x2, y2, match_type = "m:m", all.x = TRUE, by.x = "id1", by.y = "id2") # example with all = TRUE jn <- joyn::merge(x2, y2, match_type = "m:m", by.x = "id1", by.y = "id2", all = TRUE)
x1 = data.frame(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.frame(id = c(1,2, 4), y = c(11L, 15L, 16)) joyn::merge(x1, y1, by = "id") # example of using by.x and by.y x2 = data.frame(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) y2 = data.frame(id = c(1, 2, 5, 6, 3), id2 = c(1, 1, 2, 3, 4), y = c(11L, 15L, 20L, 13L, 10L), x = c(16:20)) jn <- joyn::merge(x2, y2, match_type = "m:m", all.x = TRUE, by.x = "id1", by.y = "id2") # example with all = TRUE jn <- joyn::merge(x2, y2, match_type = "m:m", by.x = "id1", by.y = "id2", all = TRUE)
Identify possible variables uniquely identifying x
possible_ids( dt, exclude = NULL, include = NULL, verbose = getOption("possible_ids.verbose") )
possible_ids( dt, exclude = NULL, include = NULL, verbose = getOption("possible_ids.verbose") )
dt |
data frame |
exclude |
character: Exclude variables to be selected as identifiers. It could be either the name of the variables of one type of the variable prefixed by "_". For instance, "_numeric" or "_character". |
include |
character: Name of variable to be included, that might belong
to the group excluded in the |
verbose |
logical: If FALSE no message will be displayed. Default is TRUE |
list with possible identifiers
library(data.table) x4 = data.table(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) possible_ids(x4)
library(data.table) x4 = data.table(id1 = c(1, 1, 2, 3, 3), id2 = c(1, 1, 2, 3, 4), t = c(1L, 2L, 1L, 2L, NA_integer_), x = c(16, 12, NA, NA, 15)) possible_ids(x4)
Rename to syntactically valid names
rename_to_valid(name, verbose = getOption("joyn.verbose"))
rename_to_valid(name, verbose = getOption("joyn.verbose"))
name |
character: name to be coerced to syntactically valid name |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
valid character name
joyn:::rename_to_valid("x y")
joyn:::rename_to_valid("x y")
This is a joyn
wrapper that works in a similar
fashion to dplyr::right_join
right_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
right_join( x, y, by = intersect(names(x), names(y)), copy = FALSE, suffix = c(".x", ".y"), keep = NULL, na_matches = c("na", "never"), multiple = "all", unmatched = "drop", relationship = "one-to-one", y_vars_to_keep = TRUE, update_values = FALSE, update_NAs = update_values, reportvar = getOption("joyn.reportvar"), reporttype = c("factor", "character", "numeric"), roll = NULL, keep_common_vars = FALSE, sort = TRUE, verbose = getOption("joyn.verbose"), ... )
x |
data frame: referred to as left in R terminology, or master in Stata terminology. |
y |
data frame: referred to as right in R terminology, or using in Stata terminology. |
by |
a character vector of variables to join by. If NULL, the default,
joyn will do a natural join, using all variables with common names across
the two tables. A message lists the variables so that you can check they're
correct (to suppress the message, simply explicitly list the variables that
you want to join). To join by different variables on x and y use a vector
of expressions. For example, |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
keep |
Should the join keys from both
|
na_matches |
Should two |
multiple |
Handling of rows in
|
unmatched |
How should unmatched keys that would result in dropped rows be handled?
|
relationship |
Handling of the expected relationship between the keys of
|
y_vars_to_keep |
character: Vector of variable names in |
update_values |
logical: If TRUE, it will update all values of variables
in x with the actual of variables in y with the same name as the ones in x.
NAs from y won't be used to update actual values in x. Yet, by default,
NAs in x will be updated with values in y. To avoid this, make sure to set
|
update_NAs |
logical: If TRUE, it will update NA values of all variables
in x with actual values of variables in y that have the same name as the
ones in x. If FALSE, NA values won't be updated, even if |
reportvar |
character: Name of reporting variable. Default is ".joyn". This is the same as variable "_merge" in Stata after performing a merge. If FALSE or NULL, the reporting variable will be excluded from the final table, though a summary of the join will be display after concluding. |
reporttype |
character: One of "character" or "numeric". Default is "character". If "numeric", the reporting variable will contain numeric codes of the source and the contents of each observation in the joined table. See below for more information. |
roll |
double: to be implemented |
keep_common_vars |
logical: If TRUE, it will keep the original variable from y when both tables have common variable names. Thus, the prefix "y." will be added to the original name to distinguish from the resulting variable in the joined table. |
sort |
logical: If TRUE, sort by key variables in |
verbose |
logical: if FALSE, it won't display any message (programmer's option). Default is TRUE. |
... |
Arguments passed on to
|
An data frame of the same class as x
. The properties of the output
are as close as possible to the ones returned by the dplyr alternative.
Other dplyr alternatives:
anti_join()
,
full_join()
,
inner_join()
,
left_join()
# Simple right join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) right_join(x1, y1, relationship = "many-to-one")
# Simple right join library(data.table) x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) y1 = data.table(id = c(1,2, 4), y = c(11L, 15L, 16)) right_join(x1, y1, relationship = "many-to-one")
This function is used to change the value of one or more joyn options
set_joyn_options(..., env = .joynenv)
set_joyn_options(..., env = .joynenv)
... |
pairs of option = value |
env |
environment, which is joyn environment by default |
joyn new options and values invisibly as a list
JOYn options functions
get_joyn_options()
joyn:::set_joyn_options(joyn.verbose = FALSE, joyn.reportvar = "joyn_status") joyn:::set_joyn_options() # return to default options
joyn:::set_joyn_options(joyn.verbose = FALSE, joyn.reportvar = "joyn_status") joyn:::set_joyn_options() # return to default options