본문 바로가기

Data Mining & R

Trading strategy: Making the most of the out of sample data

https://www.r-bloggers.com/trading-strategy-making-the-most-of-the-out-of-sample-data/


Trading strategy: Making the most of the out of sample data

August 19, 2016
By 

(This article was first published on R – The R Trader, and kindly contributed to R-bloggers)

In the chart below the blue area represents the out of sample performance for one of my strategies.

system1Performance

A simple visual inspection reveals a good fit between the in and out of sample performance but what degree of confidence do I have in this? At this stage not much and this is the issue. What is truly needed is a measure of similarity between the in and out of sample data sets. In statistical terms this could be translated as the likelihood that the in and out of sample performance figures coming from the same distribution. There is a non-parametric statistical test that does exactly this: the Kruskall-Wallis Test. A good definition of this test could be found on R-Tutor “A collection of data samples are independent if they come from unrelated populations and the samples do not affect each other. Using the Kruskal-Wallis Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution.” The added benefit of this test is not assuming a normal distribution.

It exists other tests of the same nature that could fit into that framework. The Mann-Whitney-Wilcoxon test or the Kolmogorov-Smirnov tests would perfectly suits the framework describes here however this is beyond the scope of this article to discuss the pros and cons of each of these tests. A good description along with R examples can be found here.

Here’s the code used to generate the chart above and the analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
################################################
## Making the most of the OOS data
##
## thertrader@gmail.com - Aug. 2016
################################################
library(xts)
library(PerformanceAnalytics)
 
thePath <- "myPath" #change this
theFile <- "data.csv"
data <- read.csv(paste0(thePath,theFile),header=TRUE,sep=",")
data <- xts(data[,2],order.by=as.Date(as.character(data[,1]),format = "%d/%m/%Y"))
 
##----- Strategy's Chart
par(mex=0.8,cex=1)
thePeriod <- c("2012-02/2016-05")
chart.TimeSeries(cumsum(data),
 main = "System 1",
 ylab="",
 period.areas = thePeriod,
 grid.color = "lightgray",
 period.color = "slategray1")
 
##----- Kruskal tests
pValue <- NULL
i <- 1
while (i < 1000){
 isSample <- sample(isData,length(osData))
 pValue <- rbind(pValue,kruskal.test(list(osData, isSample))$p.value)
 i <- i + 1
}
 
##----- Mean of p-values
mean(pValue)

In the example above the in sample period is longer than the out of sample period therefore I randomly created 1000 subsets of the in sample data each of them having the same length as the out of sample data. Then I tested each in sample subset against the out of sample data and I recorded the p-values. This process creates not a single p-value for the Kruskall-Wallis test but a distribution making the analysis more robust. In this example the mean of the p-values is well above zero (0.478) indicating that the null hypothesis should be accepted: there are strong evidences that the in and out of sample data is coming from the same distribution.

As usual what is presented in this post is a toy example that only scratches the surface of the problem and should be tailored to individual needs. However I think it proposes an interesting and rational statistical framework to evaluate out of sample results.