Hi Gregg,
that is interesting. Maybe I did not understand exactly what you were trying to do. To explain what the logic in the stack does:
It counts the recurrences of a number pattern that has a number as second part.
The first occurrence of the number pattern is not counted and the second part is not summed up to do an average of the second parts (word 2) of the line. This is only done on recurrences of the pattern.
The sum of word 2 of the recurrences is divided by the count of the recurrences. I.e. the total count of the number pattern - 1.
I tested with a very small subset of the type of data you indicated:
first:
1,1,1 0.393107
1,1,1 0.425075
1,1,1 0.45005
1,1,1 -0.117383
this gives me:
1,1,1 3
= three recurrences of pattern 1,1,1
and
1,1,1 0.757742
= the sum of the recurrences of pattern 1,1,1 is 0.757742
and
1,1,1 0.252581
the average of the recurrences is 0.252581
If I calculate this manually (spreadsheet) I get
sum of
0.425075
0.45005
-0.117383
is 0.757742
and the average is
0.757742 divided by three = 0,252580666666667
This looks ok if my assumptions are right.
maybe you could test this with my test set:
1,1,1 0.393107
1,1,1 0.425075
1,1,1 0.45005
1,1,1 -0.117383
1,1,2 0.274226
1,1,2 0.409091
1,1,2 0.337163
1,2,1 0.317183
2,1,1 -0.215285
2,1,1 0.04046
2,1,1 -0.349151
2,1,2 -0.19031
2,1,2 -0.27023
2,2,1 -0.422078
2,2,2 -0.272228
for the test set my averages are
1,1,1 0.252581
1,1,2 0.373127
1,2,1 0.317183
2,1,1 -0.154345
2,1,2 -0.27023
2,2,1 -0.422078
2,2,2 -0.272228
and tell me what is going on.
I am glad it works for you but I also would like to know where the difference in results comes from. I would have to check the logic of the algorithm again. But from my understanding of the problem it seems to work.
Kind regards
Bernd