Crosstab of every possible combination of columns in Excel or SPSS

1

For a research, I have the following dataset in Excel (simplified for illustration purposes):

Article   | Boolean1    Boolean2    Boolean3    Boolean4|
--------------------------------------------------------|
1         | Yes         No          Yes         No      |
2         | No          Yes         No          Yes     |
3         | Yes         No          No          Yes     |
4         | Yes         Yes         No          Yes     |
5         | Yes         No          No          No      |

I want to calculate the Jaccard Index for each possible combination of booleans, and thus first make a large crosstab worksheet that looks something like this:

             | Boolean2 No   Boolean2 Yes | Boolean3 No   Boolean3 Yes |
-----------------------------------------------------------------------|
Boolean1 No  |          0               1 |           1              0 |
Boolean1 Yes |          3               1 |           3              1 |
-----------------------------------------------------------------------|
Boolean4 No  |          2               0 |           1              3 |
Boolean4 Yes |          1               2 |           1              0 |

However, I want all combinations in such a table, so in this case boolean1-boolean2, boolean1-boolean3, boolean1-boolean4, boolean2-boolean3, boolean2-boolean4, and boolean3-boolean4 (1-1, 2-2, etc. I can filter out myself).

Now, this could be done by using the 'Crosstabs' on all combinations in SPSS or making a PivotTable in Excel for every possible combination; however, as I have 144 articles and 29 variables (all boolean), this would be quite time-consuming to copy to a separate worksheet. As PivotTables 'stack' the variables, putting all variables in the same rows/columns in a PivotTable also doesn't make sense. The answer that came closest to what I was looking for was this answer; however, it did not work for me on Excel 2016.

Is there any way to create a giant table containing all possible crosstabs more efficiently than copying manually to a new or existing worksheet in Excel, or perhaps in another tool like SPSS?

Duc

Posted 2018-11-06T15:45:53.310

Reputation: 11

Answers

0

I would do this with COUNTIFS .

First make the table heading rows and columns, which is fairly straightforward, in particular if the variables are numbered as in your example (Boolean1, Boolean2, ...), because you just use Autofill. Otherwise you have to do some string manipulation/extraction to get a nice table :

             |   Boolean2        Boolean2 |    Boolean3       Boolean3 |
             |         No             Yes |         No             Yes |
-----------------------------------------------------------------------|
Boolean1 No  |          X                 |                            |
Boolean1 Yes |                            |                            |
-----------------------------------------------------------------------|
Boolean4 No  |                            |                            |
Boolean4 Yes |                            |                            |

You also have to define your dataset as an Excel table (Ctrl-L) with header rows (I used Table1 as default name).

And in X (assuming this is cell C3) you put formula :
=COUNTIFS(INDIRECT("Table1["& C$1 &"]");C$2;(INDIRECT("Table1["& $A3 &"]"));$B3)
Which you of course can autofill across the entire area.

The Table1[Boolean2] refers to that column in your dataset (the INDIRECT just avoids that you have to hardcode the column reference and you can use the value in cell C1). So the overall formula count the instances where Table1[Boolean2] is "No" and Table1[Boolean1] is also "No" (using the two conditions in the COUNTIFS).

The output will be fully symmetric, so depending on how you want to calculate the Jaccard Index you have to only take the output below the diagonal (discarding the duplicates (N,M) & (M,N), and the trivial (N,N) ones). If you want a clean output at once, you can of course add an additional IF to only calculate the number when the variable index in the row his greater than the variable index in the column. You can probably also do this in the same formula, but for visibility, I would just add an additional row and column with the index number of the variable (from 1 to 29), and test on these.

Peter K.

Posted 2018-11-06T15:45:53.310

Reputation: 282

Dear Peter, thank you for your response. However, the formula doesn't work as it gives the popup of "not trying to type a formula?". I presume there's just a syntax error in there, as the INDIRECT reference doesn't seem to work; however, the formula is too syntactically difficult for me to understand exactly what is going on (sorry, I'm a noob). – Duc – 2018-11-13T11:27:29.353

Depending on your region, you have to replace the ; (semicolon) with a , (comma). Other than that, which part of the formula do you not understand ? – Peter K. – 2018-11-13T13:02:35.883