Title: | Matrix Profile for R |
---|---|
Description: | This is the core functions needed by the 'tsmp' package. The low level and carefully checked mathematical functions are here. These are implementations of the Matrix Profile concept that was created by CS-UCR <http://www.cs.ucr.edu/~eamonn/MatrixProfile.html>. |
Authors: | Francisco Bischoff [aut, cre] , Michael Yeh [res, ccp, ctb] , Diego Silva [res, ccp, ctb] , Yan Zhu [res, ccp, ctb] , Hoang Dau [res, ccp, ctb] , Michele Linardi [res, ccp, ctb] |
Maintainer: | Francisco Bischoff <[email protected]> |
License: | GPL-3 |
Version: | 0.1.7.9000 |
Built: | 2024-10-13 04:48:53 UTC |
Source: | https://github.com/matrix-profile-foundation/matrixprofiler |
This algorithm will use a rolling window, to computes the distance thorough the whole data. This means that the
minimum distance found is the motif and the maximum distance is the discord on that time series. Attention
you need first to create an object using mass_pre()
. Read below.
mass( pre_obj, data, query = data, index = 1, version = c("v3", "v2"), n_workers = 1 ) mass_pre( data, window_size, query = NULL, type = c("normalized", "non_normalized", "absolute", "weighted"), weights = NULL )
mass( pre_obj, data, query = data, index = 1, version = c("v3", "v2"), n_workers = 1 ) mass_pre( data, window_size, query = NULL, type = c("normalized", "non_normalized", "absolute", "weighted"), weights = NULL )
pre_obj |
Required. This is the object resulting from |
data |
Required. Any 1-dimension series of numbers ( |
query |
Optional. Accepts the same types as |
index |
An |
version |
A |
n_workers |
An |
window_size |
Required. An integer defining the rolling window size. |
type |
This changes how the MASS algorithm will compare the rolling window and the data. (See details). |
weights |
Optional. It is used when the |
There are currently four ways to compare the window with the data:
normalized: this normalizes the data and the query window. This is the most frequently used.
non_normalized: this won't normalize the query window. The data still being normalized.
absolute: this won't normalize both the data and the query window.
weighted: this normalizes the data and query window, and also apply a weight vector on the query.
mass()
returns a list
with the distance_profile
and the last_product
that is only useful for computing the
Matrix Profile.
mass_pre()
returns a list
with several precomputations to be used on MASS later. Attention use this before
mass()
.
pre <- mass_pre(motifs_discords_small, 50) dist_profile <- mass(pre, motifs_discords_small) pre <- mass_pre(motifs_discords_small, 50) dist_profile <- mass(pre, motifs_discords_small)
pre <- mass_pre(motifs_discords_small, 50) dist_profile <- mass(pre, motifs_discords_small) pre <- mass_pre(motifs_discords_small, 50) dist_profile <- mass(pre, motifs_discords_small)
This package is derived from the former package tsmp
. It is intended to make a clear separation of what is the
Matrix Profile computation and what are the data mining process we can do using Matrix Profile.
The Matrix Profile, has the potential to revolutionize time series data mining because of its generality, versatility, simplicity and scalability. In particular it has implications for time series motif discovery, time series joins, shapelet discovery (classification), density estimation, semantic segmentation, visualization, rule discovery, clustering etc.
This package uses RcppParallel in order to do multithreading computations. By default it uses the 'TBB' backend.
If por any reason you want to change the backend to 'tinythread', you may use:
Sys.setenv(RCPP_PARALLEL_BACKEND = "tinythread")
. To configure back to 'TBB', use
Sys.setenv(RCPP_PARALLEL_BACKEND = "tbb")
.
Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, et al. Matrix profile I: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. Proc - IEEE Int Conf Data Mining, ICDM. 2017;1317-22.
Zhu Y, Imamura m, Nikovski D, Keogh E. Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. Knowl Inf Syst. 2018 Jun 2;1-27.
Zhu Y, Zimmerman Z, Senobari NS, Yeh CM, Funning G. Matrix Profile II : Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. Icdm. 2016 Jan 22;54(1):739-48.
Website: http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Just a synthetic dataset for testing
motifs_discords_small
motifs_discords_small
A vector
with 875 observations
These functions do not handle NA values
mov_mean( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_var( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_sum( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_max(data, window_size) mov_min(data, window_size) mov_std(data, window_size, rcpp = TRUE) movmean_std(data, window_size, rcpp = TRUE) muinvn(data, window_size, n_workers = 1) zero_crossing(data, window_size)
mov_mean( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_var( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_sum( data, window_size, type = c("ogita", "normal", "weighted", "fading"), eps = 0.9 ) mov_max(data, window_size) mov_min(data, window_size) mov_std(data, window_size, rcpp = TRUE) movmean_std(data, window_size, rcpp = TRUE) muinvn(data, window_size, n_workers = 1) zero_crossing(data, window_size)
data |
A |
window_size |
An |
type |
A |
eps |
A |
rcpp |
A |
n_workers |
An |
Some functions may use different algorithms to compute the results. The available types are:
ogita: This is the default. It uses the Ogita et al., Accurate Sum, and Dot Product for precision. It is not the fastest algorithm, but the time spent vs. guarantee of precision worth it.
normal: This uses the cumsum
method that is faster, but unreliable in some situations (I have to find the
references, but is true).
weighted: This uses Rodrigues P., et al. algorithm that uses a weighted window for online purposes. The
eps
argument controls the factor. (The function is not online yet)
fading: This also uses Rodrigues P., et al. algorithm that in this case, uses a fading factor, also for
online purposes. he eps
argument controls the factor. (The function is not online yet)
Another important detail is that the standard deviation we use for all computations is the population (i.e.:
divided by n
), not the sample (i.e.: divided by n - 1
). That is why we also provide the internally the
:::std()
function that computes the population, differently from stats::sd()
that is the sample kind. Further
more, movmean_std()
shall be used when you need both results in one computation. This is faster than call
mov_mean()
followed by mov_std()
. Finally, muinvn()
is kept like that for historical reasons, as it is the
function used by mpx()
. It returns the sig
(stable inverse centered norm) instead of std
(sig
is equals to
1 / (std * sqrt(window_size))
).
mov_mean()
returns a vector
with moving avg
.
mov_var()
returns a vector
with moving var
.
mov_sum()
returns a vector
with moving sum
.
mov_max()
returns a vector
with moving max
.
mov_min()
returns a vector
with moving min
.
mov_std()
returns a vector
with moving sd
.
movmean_std()
returns a list
with vectors
of the moving avg
, sd
, sig
, sum
and sqrsum
.
muinvn()
returns a list
with vectors
of moving avg
and sig
.
zero_crossing()
returns a vector
of times the data crossed the 'zero' line inside a rolling window.
mov <- mov_mean(motifs_discords_small, 50) mov <- mov_var(motifs_discords_small, 50) mov <- mov_sum(motifs_discords_small, 50) mov <- mov_max(motifs_discords_small, 50) mov <- mov_min(motifs_discords_small, 50) mov <- mov_std(motifs_discords_small, 50) mov <- movmean_std(motifs_discords_small, 50) mov <- muinvn(motifs_discords_small, 50) zero_cross <- zero_crossing(motifs_discords_small, 50)
mov <- mov_mean(motifs_discords_small, 50) mov <- mov_var(motifs_discords_small, 50) mov <- mov_sum(motifs_discords_small, 50) mov <- mov_max(motifs_discords_small, 50) mov <- mov_min(motifs_discords_small, 50) mov <- mov_std(motifs_discords_small, 50) mov <- movmean_std(motifs_discords_small, 50) mov <- muinvn(motifs_discords_small, 50) zero_cross <- zero_crossing(motifs_discords_small, 50)
STAMP Computes the best so far Matrix Profile and Profile Index for Univariate Time Series.
STOMP is a faster implementation with the caveat that is not anytime as STAMP or SCRIMP.
SCRIMP is a faster implementation, like STOMP, but has the ability to return anytime results as STAMP.
MPX is by far the fastest implementation with the caveat that is not anytime as STAMP or SCRIMP.
stamp( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, n_workers = 1, progress = TRUE ) stomp( data, window_size, query = NULL, exclusion_zone = 0.5, n_workers = 1, progress = TRUE ) scrimp( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, pre_scrimp = 0.25, n_workers = 1, progress = TRUE ) mpx( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, idxs = TRUE, distance = c("euclidean", "pearson"), n_workers = 1, progress = TRUE )
stamp( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, n_workers = 1, progress = TRUE ) stomp( data, window_size, query = NULL, exclusion_zone = 0.5, n_workers = 1, progress = TRUE ) scrimp( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, pre_scrimp = 0.25, n_workers = 1, progress = TRUE ) mpx( data, window_size, query = NULL, exclusion_zone = 0.5, s_size = 1, idxs = TRUE, distance = c("euclidean", "pearson"), n_workers = 1, progress = TRUE )
data |
Required. Any 1-dimension series of numbers ( |
window_size |
Required. An integer defining the rolling window size. |
query |
(not yet on |
exclusion_zone |
A numeric. Defines the size of the area around the rolling window that will be ignored to avoid
trivial matches. Default is |
s_size |
A numeric. Used on anytime algorithms (stamp, scrimp, mpx) if only part of the computation is needed.
Default is |
n_workers |
An integer. The number of threads using for computing. Defaults to |
progress |
A logical. If |
pre_scrimp |
A numeric. If not zero, pre_scrimp is computed, using a fraction of the data. Default is |
idxs |
( |
distance |
( |
The Matrix Profile, has the potential to revolutionize time series data mining because of its generality, versatility, simplicity and scalability. In particular it has implications for time series motif discovery, time series joins, shapelet discovery (classification), density estimation, semantic segmentation, visualization, rule discovery, clustering etc.
progress
, it is really recommended to use it as feedback for long computations. It indeed adds some
(neglectable) overhead, but the benefit of knowing that your computer is still computing is much bigger than the
seconds you may lose in the final benchmark. About n_workers
, for Windows systems, this package uses TBB for
multithreading, and Linux and macOS, use TinyThread++. This may or not raise some issues in the future, so we must be
aware of slower processing due to different mutexes implementations or even unexpected crashes. The Windows version
is usually more reliable. The data
and query
parameters will be internally converted to a single vector using
as.numeric()
, thus, bear in mind that a multidimensional matrix may not work as you expect, but most 1-dimensional
data types will work normally. If query
is provided, expect the same pre-procesment done for data
; in addition,
exclusion_zone
will be ignored and set to 0
. Both data
and query
doesn't need to have the same size and they
can be interchanged if both are provided. The difference will be in the returning object. AB-Join returns the Matrix
Profile 'A' and 'B' i.e., the distance between a rolling window from query to data and from data to query.
The anytime STAMP computes the Matrix Profile and Profile Index in such manner that it can be stopped before its complete calculation and return the best so far results allowing ultra-fast approximate solutions.
The STOMP uses a faster implementation to compute the Matrix Profile and Profile Index. It can be stopped earlier by
the user, but the result is not considered anytime, just incomplete. For a anytime algorithm, use stamp()
or
scrimp()
.
The SCRIMP algorithm was the anytime solution for stomp. It is as fast as stomp but allows the user to cancel the computation and get an approximation of the final result. This implementation uses the SCRIMP++ code. This means that, at first, it will compute the pre-scrimp (a very fast and good approximation), and continue improving with scrimp. The exception is if you use multithreading, that skips the pre-scrimp stage.
This algorithm was developed apart from the main Matrix Profile branch that relies on Fast Fourier Transform (FFT) at least in one part of the process. This algorithm doesn't use FFT at all and is several times faster. It also relies on Ogita's work for better precision computing mean and standard deviation (part of the process).
Returns a list
with the matrix_profile
, profile_index
(if idxs
is TRUE
in mpx()
), and some
information about the settings used to build it, like ez
and partial
when the algorithm is finished early.
Last updated on 2021-11-23 using R version 4.1.2.
Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, et al. Matrix profile I: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. Proc - IEEE Int Conf Data Mining, ICDM. 2017;1317-22.
Zhu Y, Imamura m, Nikovski D, Keogh E. Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. Knowl Inf Syst. 2018 Jun 2;1-27.
Zhu Y, Zimmerman Z, Senobari NS, Yeh CM, Funning G. Matrix Profile II : Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. Icdm. 2016 Jan 22;54(1):739-48.
Website: http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
mass()
for the underlying algorithm that finds best match of a query.
mpxab()
for the forward and reverse join-similarity.
mp <- stamp(motifs_discords_small, 50) mp <- stomp(motifs_discords_small, 50) mp <- scrimp(motifs_discords_small, 50) mp <- mpx(motifs_discords_small, 50)
mp <- stamp(motifs_discords_small, 50) mp <- stomp(motifs_discords_small, 50) mp <- scrimp(motifs_discords_small, 50) mp <- mpx(motifs_discords_small, 50)
znorm()
: Normalizes data for mean Zero and Standard Deviation One
ed_corr()
: Converts euclidean distances into correlation values
corr_ed()
: Converts correlation values into euclidean distances
mode()
: Returns the most common value from a vector of integers
std()
: Population SD, as R always calculate with n-1 (sample), here we fix it.
normalize()
: Normalizes data to be between min and max.
complexity()
: Computes the complexity index of the data
binary_split()
: Creates a vector with the indexes of binary split.
znorm(data, rcpp = TRUE) ed_corr(data, w, rcpp = TRUE) corr_ed(data, w, rcpp = TRUE) mode(x, rcpp = FALSE) std(data, na.rm = FALSE, rcpp = TRUE) normalize(data, min_lim = 0, max_lim = 1, rcpp = FALSE) complexity(data) binary_split(n, rcpp = TRUE)
znorm(data, rcpp = TRUE) ed_corr(data, w, rcpp = TRUE) corr_ed(data, w, rcpp = TRUE) mode(x, rcpp = FALSE) std(data, na.rm = FALSE, rcpp = TRUE) normalize(data, min_lim = 0, max_lim = 1, rcpp = FALSE) complexity(data) binary_split(n, rcpp = TRUE)
data |
a |
rcpp |
A |
w |
the window size |
x |
a |
na.rm |
A logical. If |
min_lim |
A number |
max_lim |
A number |
n |
size of the vector |
znorm()
: Returns the normalized data
ed_corr()
: Returns the converted values from euclidean distance to correlation values.
corr_ed()
: Returns the converted values from euclidean distance to correlation values.
mode()
: Returns the most common value from a vector of integers.
std()
: Returns the corrected standard deviation from sample to population.
normalize()
: Returns the normalized data between min and max.
complexity()
: Returns the complexity index of the data provided (normally a subset).
complexity()
: Returns a vector
with the binary split indexes.
normalized <- znorm(motifs_discords_small) fake_data <- c(rep(3, 100), rep(2, 100), rep(1, 100)) correlation <- ed_corr(fake_data, 50) fake_data <- c(rep(0.5, 100), rep(1, 100), rep(0.1, 100)) euclidean <- corr_ed(fake_data, 50) fake_data <- c(1, 1, 4, 5, 2, 3, 1, 7, 9, 4, 5, 2, 3) mode <- mode(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 2, 9, 4.3, 5, 2.1, 3) res <- std(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 1, 9, 4.3, 5, 2.1, 3) res <- normalize(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 8, 9, 4.3, 5, 2.1, 3) res <- complexity(fake_data) fake_data <- c(10) res <- binary_split(fake_data)
normalized <- znorm(motifs_discords_small) fake_data <- c(rep(3, 100), rep(2, 100), rep(1, 100)) correlation <- ed_corr(fake_data, 50) fake_data <- c(rep(0.5, 100), rep(1, 100), rep(0.1, 100)) euclidean <- corr_ed(fake_data, 50) fake_data <- c(1, 1, 4, 5, 2, 3, 1, 7, 9, 4, 5, 2, 3) mode <- mode(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 2, 9, 4.3, 5, 2.1, 3) res <- std(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 1, 9, 4.3, 5, 2.1, 3) res <- normalize(fake_data) fake_data <- c(1, 1.4, 4.3, 5.1, 2, 3.6, 1.24, 8, 9, 4.3, 5, 2.1, 3) res <- complexity(fake_data) fake_data <- c(10) res <- binary_split(fake_data)