How to build a Sankey diagram in Tableau without any data prep beforehand

I first published this blog post here: https://til.bi/2oZrRD2

First of all I want to say this post is hugely indebted to Olivier Catherin and his post four years ago on building Sankeys in Tableau and Jeff Schaffer whose work Olivier built off and countless others I probably don’t even realise.

This post sets out how to build a Sankey Diagram without any data prep before Tableau. The viz below is built off the vanilla Superstore data packaged with Tableau, but I must warn you there’s a labyrinth of table calculations to get to this point.

There are 20 calculations to get through to get to this stage so what makes it worth it? All previous solutions I’ve seen for building Sankeys have needed to multiply the size of the data by at least a factor of two beforehand. Some data sets are so large that we don’t want to do that and sometimes we come across data sources such as published data sources on Tableau Server where we’re not the owners and this isn’t an option. So if we can work off the vanilla data set it’d be worth the effort.

So without further ado let’s get going.

Step 1: Dimension and measure parameterisation

Set up 3 calculations, 2 for your dimensions and 1 for your measure. It’s useful down the line to reference these in case you want to change which dimensions you want in the view part way through building and you can edit the calcs rather than fiddling with “replace references”. In my example I used these calculations where [Select Dimension 1] and [Select Dimension 2] are parameters I made to fit the case statement:

Dimension 1

CASE [Select Dimension 1]
WHEN 1 THEN [Region]
WHEN 2 THEN [Category]
WHEN 3 THEN [Sub-Category]
WHEN 4 THEN [Segment]
WHEN 5 THEN [Ship Mode]
END

Dimension 2

CASE [Select Dimension 1]
WHEN 1 THEN [Region]
WHEN 2 THEN [Category]
WHEN 3 THEN [Sub-Category]
WHEN 4 THEN [Segment]
WHEN 5 THEN [Ship Mode]
END

Chosen Measure

[Sales]

Step 2: Create a frame for data densification

Comparing our measure to the fixed min of the measure we can ensure two data points to hang our data densification from.

Path Frame

IF [Chosen Measure] = {FIXED : MIN([Chosen Measure])} THEN 0 ELSE 97 END

Path Frame (bin)

For this create bins of size 1 from Path Frame

 

Step 3: Index

Path Index

Index()

This is computed along Path Frame (bin) and allows us to do calculations across the Path Frame range.

Step 4: Sigmoid set up

First we have a variable, T, and then the sigmoid curve is calculated using it.

T

IF [Path Index] < 50
THEN (([Path Index]-1)%49)/4-6
ELSE 12 - (([Path Index]-1)%49)/4-6
END

Sigmoid

1/(1+EXP(1)^-[T])

Step 5: Sankey arm sizing

This gives us the size of each Sankey arm as a percentage of the full data set.

Sankey Arm Size

SUM([Chosen Measure])/TOTAL(SUM([Chosen Measure]))

Step 6: Top line calculations

This method requires separate lines for the top and bottoms of each Sankey arm and throughout these calculations “Position 1” will refer to Dimension 1 on the left hand side and “Position 2” to Dimension 2 on the right. Here are the calculations for the tops:

Max Position 1

RUNNING_SUM([Sankey Arm Size])

Max Position 1 Wrap

WINDOW_SUM([Max Position 1])

Max Position 2

RUNNING_SUM([Sankey Arm Size])

Max Position 2 Wrap

WINDOW_SUM([Max Position 2])

Step  7: Bottom line calculations

These calculations help generate the bottom lines:

Max for Min Position 1

RUNNING_SUM([Sankey Arm Size])

Min Position 1

RUNNING_SUM([Max for Min Position 1])-[Sankey Arm Size]

Min Position 1 Wrap

WINDOW_SUM([Min Position 1])

Max for Min Position 2

RUNNING_SUM([Sankey Arm Size])

Min Position 2

RUNNING_SUM([Max for Min Position 2])-[Sankey Arm Size]

Min Position 2 Wrap

WINDOW_SUM([Min Position 2])

Step 8: Sankey polygon calculation

This calculation brings together all of the above once set up correctly. The table calculation set up itself will come in later steps.

Sankey Polygons

IF [Path Index] > 49
THEN [Max Position 1 Wrap]+([Max Position 2 Wrap]-[Max Position 1 Wrap])*[Sigmoid]
ELSE [Min Position 1 Wrap]+([Min Position 2 Wrap]-[Min Position 1 Wrap])*[Sigmoid]
END

Step 9: Prepare the sheet

Put Path Frame (bin) onto Rows, then right click it and select “show missing values” as below

 

This should give you something like below.  This is the step that enables our data densification within Tableau.

Then move Path Frame (bin) to detail and add Dimension 1 and Dimension 2 to detail too. Next, put [T] on Columns and calculate it along Path Frame (bin).  Finally change mark type to Polygon, and add [Path Index] to path and calculate it along Path Frame (bin). This should leave you with something like this:

Step 10: Add Sankey polygons

Add [Sankey Polygons] to Rows. This is going to look a bit messy until it is calculated correctly across all nested calculations.

This first image shows all of the nested calculations involved and we need to go through each one setting them up correctly.

Below I’ve shown the settings for each of the 12 nested calculations to make this work. If it doesn’t look right at the end, make sure that not only the correct fields are ticked in each box but that they’re ordered correctly as well.

 

Now hopefully you have something that looks like this! (If not go back and check each table calculation step carefully)

Step  11: Dashboarding

Now we’re ready to put this onto a dashboard with stacked bars either side for dimensions 1 and 2 to complete our Sankey.

This graph is filterable if you want it to be, just make sure they are on context so that they happen before the Fixed LOD calculation in Tableau’s order of operations, otherwise the filters might knock out the Path Frame calculation that all of this hinges upon.

 

I hope you’ve made it to the end and have been able to replicate this Sankey. Thanks for reading 🙂

How Do I Prove Tableau Online is in Dublin?

I first published this blog post here: https://til.bi/2ojbot4

When we sign up for Tableau Online we get a choice of three locations to host our site at. We can be hosted on the west coast of America, the east coast of America or in Dublin, Ireland if we’re in the EU. So how do we show which one we’re connecting to?

When we’re visiting Tableau Online in our browser or vizzing with extracts we’ll always hit Tableau’s load balancer and the IPs from that can point to a variety of locations in Northern Europe and none of them are the safelisted ones in Tableau’s documentation. What we need is a live connection where we can check the IP addresses of the requester.

At The Information Lab we have a hosted Mircrosft SQL Server where we hold many of our larger demo data sets and this is one of Tableau’s many supported live connections for Tableau Online. If we connect to this live data source through Tableau Online, it will send a request from one of Tableau’s whitelisted IP addresses and we can use the SQL Server’s system tables to see where the request came from.

To do this with our Microsoft SQL Server we can use this custom SQL code to return the time of connection and the client’s IP address.

Or in text form so you can copy & paste it:

SELECT connect_time, client_net_address
FROM sys.dm_exec_connections
WHERE session_id = @@SPID;

This gives us a two fields with a row of data about our connection. When we connect from Desktop it will show our own location but once published to Online and viewed in browser then the request will come from Tableau itself.

Here we can see the connection time and the requesting IP address to the SQL Server.

Tableau Online Connection

This address looks familiar. It’s one of Tableau’s whitelisted IP addresses! And if we look it up using an IP locator, it does indeed tell us its in Dublin, Ireland.

Thanks for reading 🙂

How to Create Template Project Folders with Alteryx – Cleaning Up my Computer Part 2

I first published this blog post here:  https://til.bi/2oijsdD

I’m lazy and inconsistent. When I’m starting new data projects it’s usually on the fly. I get some data, I spin up an Alteryx workflow or two, save a handful of parsed .yxdb and .tde files and maybe after that I have a Tableau workbook or two connecting to these .tdes. It’s usually only after all this that I create a folder to box this content and sometimes I partition it into sub folders and sometimes I don’t. To stay organised, I needed to create a tool that makes setting up projects ludicrously easy.

This Alteryx App I’ve made creates a new project folder in a given location, filled with a set of pre-determined sub-folder names and it also creates a template Alteryx Workflow to use in the Alteryx Workflows sub-folder. It’s on the  public gallery here and is free to download and reconfigure however you please, including the the template workflow that’s packaged with it.

https://gallery.alteryx.com/#!app/Create-New-Project/598b1cf3effc2a0f60df8ce0

Like my previous workflow for deleting files with Alteryx, this app utilises Windows Command Prompt to create folders in your directory and also to write a template Alteryx Workflow line by line.  The workflow has 3 primary action streams unioned together that are used to create and run a single batch file.

Stage 1: Set the Project Location and Name

Stage 1

This section has boilerplate text inputs that are edited with app interface tools. The first column “Action” contains actions in Command Prompt – ‘cd’ navigates to a folder and ‘md’ makes a folder in your current location. So Line 1 navigates to where our project will be saved, line 2 then makes the project folder, Line 3 navigates inside the project folder and then lines 4 onward follow a similar pattern of making folders and navigating deeper. At the moment my default sub-folders are set to ‘Data’, ‘Alteryx Workflows’, ‘Tableau Workbooks’ and ‘Gallery & Documentation’. These can easily be changed by editing the respective lines in the Text Input. Within Alteryx Workflows there is another folder for Archived Workflows.

Text Input to Create and Navigate Through New Folders

The Formula Tool after the text input uses the following code to bring together the Action and Target fields to a command that Command Prompt can understand with a space between them and the target wrapped in quotation marks

[Action] + ' "' + [Target] + '"'

Stage 2: Write a Template Workflow

I’ve created a template workflow that looks like this:

My Template Workflow

This workflow is packaged with the app on gallery and can be edited however we want as the app writes reads in our template and rewrites its xml elsewhere. It has all of my key documentation laid out already and once opened up I can just double click on any text to edit it. At clients I can change up the colour scheme and add internal branding images as we see fit for them. I love clear and consistent documentation and this nudges me and other users into setting up workflows with a common structure everyone can understand. For personal work I also love the board style To-Do notes at the top for picking up workflows the next morning when I’ve forgotten what I did the day before.

To write this template to our new project, we read in its XML then prepare it to be written line by line by the command prompt.

Stage 2

For the XML input of the workflow, we read the .yxmd file as a .csv with no delimiter (\0) to ensure each line of text comes through as a new row in the field “Field_1”, with “First Row Contains Field Names” unchecked.

Bringing .yxmd XML into Alteryx

After this, we use regex to throw in escape hats (^) in front of these characters ^&<>\| to be used in Command Prompt. Then the formula tool wraps each line in the following code:

"echo " + [Field_1] + " >> " +'"Workflow1.yxmd"'

Using echo [text] >> file.format, Command Prompt is adding new lines to a text file one by one. One of the great things about this syntax is if the file doesn’t exist to append a new line to then Windows will create said file so the first row of code doesn’t even need any alteration . By saving the text file as .yxmd windows knows automatically to treat it as an Alteryx Workflow when you go to open it.

Stage 3: Open the Project Folder

Stage 3

This is just two lines to add a nice final touch to the code. ‘cd ..’navigates up one level to take us back up to the parent folder and then ‘start .’ will open the folder in File Explorer. If we created deeper subfolders we would need to use more ‘cd ..’ expressions to move up further levels.

Text Input to Open the Parent Folder

Bringing the Code Together

Now we’ve got the three blocks of code we need to bring them together into one batch file. The union brings the streams together on the field “Command” and then we can concatenate them together with a line break separator (\n) using the Summarize tool. This gives us a great big string blob in a single cell we can write to a batch file.

Summarize Configuration

Setting Up the Batch File

As in Part 1 of this series, we need to trick Alteryx into thinking it is writing a csv file and replace the extension with .bat

Below is how we want to configure the Run Command tool. Writing a .bat file to an Output location and then running it immediately. Whatever file path we write our bat file to will have to be the same filepath that we enter in the Command field just below.

Run Command Config Part 1

Within the Output configuration, we need to set it up like this. File format needs to be *.csv but in the write location we need to replace “.csv” with “.bat”. This works because both file formats are essentially text files interpreted differently by a computer so we can write as one and change the file format and Windows will interpret it differently.

To stop the csv chucking in any accidental commas,  we need  to change the delimiter to \0.

And finally we need to untick “First Row Contains Field Names” so that the first line of text in the batch file isn’t “Field_1” or whatever and we only get the body of the text coming through.

Run Command Config Part 2

And there we have it. When we run the app it should perform like this:

How to Delete Files with Alteryx – Cleaning Up my Computer Part 1

I first wrote this blog post here: https://til.bi/2oiQz0G

***WARNING THIS POST IS GEARED TOWARDS DELETING FILES WITH COMMAND PROMPT, PLAY WITH CARE***

My downloads folder is a mess. Its full of random files. Data files, workbooks from Tableau Public, images, gifs, pdfs, application installers, you name it and its there. And nearly all of it is single use files I have no need for anymore. My documents folder is a mess too. My big projects are well grouped together but smaller flights of fancy are a disorganised jumble of data, workflows, workbooks and screenshots that I have no need for any more. But now I’ve got an Alteryx that helps me keep folders in check and purge old unwanted files.

The base of the workflow is on the Alteryx Gallery here but will need some re-configuring to set up for your own use:
https://gallery.alteryx.com/#!app/Delete-Old-Files/5989dc85f499c716ec3725a7

The gist of the workflow is to utilise Alteryx’s Directory tool to get the last used date of every file in a folder and to then call the Windows Command Prompt to delete old files.

Step 1: Directory Input

This gives us a list of all files in the folder and the full filepath for each. If we want to look for all files types set the file spec to * and if we want to search within folders we need to tick “Include SubDirectories”.

Directory Configuration

From this, these are the important fields that help us determine old files:

Step 2: Filter to Old Files

Now we need to use a filter tool to keep only files used, or written since a given date. In my example below with, it keeps files used or written after the 1st of January this year and I take through all of the old files from the False path of the filter. You can configure this formula however you see fit.

DateTimeFormat(MAX([LastWriteTime],[LastAccessTime]),"%Y-%m-%d") >= "2017-01-01"

Configuration for the Filter Tool

Step 3: Wrap Command Prompt Code Around the Files to Delete

This code below in a Formula tool will allow command prompt to delete a file in a given file path.

'del /f ' +'"' + [FullPath] +'"'

 

Configuration for the Formula Tool

Step 4: Concatenate Every Row of Your Current List into a Text Blob

For this stage we want to use a Summarize tool and concatenate the field we created with the string formula using \n as our concatenation separator.

Configuration for the Summarize Tool

This concatenates every database row into a text blob with the \n creating a new line in the blob for each row.

Step 5: Write and Run this Code as a Batch File

To write batch files with Alteryx, we need to trick it into thinking it is writing a csv file but replace the extension with .bat

Below is how we want to configure the Run Command tool. Writing a .bat file to Output and then running it immediately. Whatever file path we write our bat file to will have to be the same filepath that we enter in the Command field.

Run Command Configuration

Within the Output configuration, we need to set it up like this:

File format needs to be *.csv but in the write location we need to replace “.csv” with “.bat”. Both file formats are just text files which are interpreted differently by a computer so we can write as one and change the file format and Windows will interpret it differently.

To stop the csv chucking in any accidental commas,  we need  to change the delimiter to \0.

And finally we need to untick “First Row Contains Field Names” so that the first row in the batch file isn’t “Field_1” or whatever and we only get the body of the text.

Output Configuration for Run Command

 

and there we go, running this workflow in my downloads purged any file I’ve not used this year.

Thanks for reading 🙂

How to Calculate Working Days Between Dates in Tableau

I first wrote this blog post here: https://til.bi/2ogP9UA

This is a question that comes up quite regularly when working with dates in business analytics. How can I calculate date differences in working days? Here’s the code for the Tableau calculation in full and I break it down beneath.

(DATEDIFF('day',[Day1],[Day2]) -
(7-DATEPART('weekday',[Day1])) -
DATEPART('weekday',[Day2])
) / 7*5
+ MIN(5,(7-(DATEPART('weekday',[Day1]))))
+ MIN(5,(DATEPART('weekday',[Day2])-1))
-1

Part 1: Working Days in Full Weeks Spanned

This gives us the number of working days in full weeks spanned. First of all it calculates the absolute date difference between the start and end date here:

(DATEDIFF('day',[Day1],[Day2])

Then we trim off days in the first partial week of the time frame. Subtracting 7 minus the weekday number takes us up to the end of the first Sunday.

- (7-DATEPART('weekday',[Day1]))

Then we trim off the days in the partial week at the end of the time frame to take us back to end of the last full week.

- DATEPART('weekday',[Day2])

Now we have a number that is a multiple of 7 and will reflect the number of days in the full weeks between the start and end date. We divide this by 7 and multiply it by 5 to get the number of working days in these weeks.

) / 7*5

Part 2: Working Days in the First Partial Week

By default Tableau counts Saturday as the 7th day of the week and we want to count backwards how many working days have led up to it from our start date [day1] in our first partial week. This means if we take away the weekday number from 7 we will get the number of days up to Saturday. to stop us getting a count of 6 working days when looking at Sunday through to Saturday, the datepart snippet is wrapped in a ‘min’ calculation with the number 5 so we get no more than 5 working days in a week.

+ MIN(5,(7-(DATEPART('weekday',[Day1]))))

Part 3: Working Days in the Final Week

As Tableau counts Sunday as day 1 in its weekday calculations we need to subtract 1 from our weekday calculation on our final date [day2] for the final week working days. This makes the Sunday through to Saturday counting go from 0 to 6. Like before in the first partial week, we need to then wrap the weekday calculation in a ‘min’ formula with the number 5 so we get no more than 5 working days in a week.

+ MIN(5,(DATEPART('weekday',[Day2])-1))

Part 4: Making the Calculation Exclusive of the Start Date

At the moment the calculation is inclusive of both the start and end dates. This means a project starting and ending on the same day has a working days count of 1 day. If we want to look at full working days so that something starting on Wednesday and finishing on Thursday counts as 1, and a same day project counts as 0 days then we need to subtract 1 from our calculation.

-1

Thanks for reading 🙂

How to do Box Plot Calculations in Tableau

I first wrote this blog post here: https://til.bi/2ohXeIL

Box plots are great for displaying distribution and in Tableau they’re incredibly easy to make. In fact the simplest box plot in Tableau takes only 4 clicks. Click a dimension, hold ctrl & click a measure, click the “Show Me” tab and select the box plot function and there you go, you have a box plot!

The World's Fastest Box Plot in Tableau

The Constituent Parts

Box plots are made of five key components: the median, the upper and lower hinges, and the upper and lower whiskers. The median is a standard statistical measure and is included in Tableau’s library of formulae, it’s the 50th percentile value, the middle number in a data set. The hinges however are harder to work out because they are near the 25th and 75th percentiles but not there exactly and the distance away from these values depends on the size of the data set.

Tukey Hinges

The hinges Tableau uses are Tukey inclusionary hinges, so named after John Tukey the person who first created box plots. Tukey hinges are the midway points in the first and second halves of data. If there are an even number of data points then the data is split straight down the middle. However if there are an odd number of data points then the median value is duplicated and used by both halves. This is illustrated below where the number sets of 1-5, 1-7, and 1-9 each duplicate their median value. Hinges are then the medians of their respective halves of the data.

The Whiskers

The whiskers then stretch out to the furthest data points within 1.5 interquartile ranges of the hinges, where the interquartile range [IQR] is the distance between the two hinges.

So How Do We Calculate This?

First of all, the data being plotted needs to be at the most granular level. This method uses table calculations that requires index formulae computed using the measure itself so before you start, untick the “Aggregate Measures” in the analysis menu. Now for the calculations coming, we are using Superstore data to generate box plots of Quantity by State to create the view below. When you want tailor them to your own calculations, just replace any references to Quantity with your measure of choice and make sure each of your table calculations are computed using it. If you can’t see Quantity as an option for ‘compute by’, make sure the measure is not aggregated and set it as a continuous dimension.

Lower Hinge

WINDOW_MEDIAN( SUM([Quantity]),
FIRST(),
INT((WINDOW_COUNT(SUM([Quantity]))+1)/2) - INDEX()
)

The first part of the window median says that we are taking the median of quantity values, the middle part says the range is starting with the first value, and the last part anchors the end of the range in the middle of the data set, inclusive of the median.

Upper Hinge

WINDOW_MEDIAN( SUM([Quantity]),
INT(ROUND((WINDOW_COUNT(SUM([Quantity]))+1)/2,0)) - INDEX(),
LAST()
)

The first part of the window median says that we are taking the median of quantity values, the middle part says the range is starting in the middle of the data set, inclusive of the median, and the last part anchors the end of the range in the middle of the data set, inclusive of the median.

IQR (Interquartile Range)

This calculation is a good descriptor of the spread of a data set and also helps to make the whisker calculations clearer.

[Upper Hinge] - [Lower Hinge]

Lower Whisker

WINDOW_MIN
IF SUM([Quantity]) > [Lower Hinge] - 1.5*[IQR] OR SUM([Quantity]) = [Lower Hinge]
THEN SUM([Quantity])
END
)

This takes the minimum value for any data within 1.5 IQRs of the lower hinge or the lower hinge itself.

 

Upper Whisker

WINDOW_MAX(
IF SUM([Quantity]) < [Upper Hinge] + 1.5*[IQR] OR SUM([Quantity]) = [Upper Hinge]
THEN SUM([Quantity])
END
)

This takes the maximum value for any data within 1.5 IQRs of the upper hinge or the upper hinge itself.

Sorting

Now for a final touch of flair, if you want to sort your dimension by these values create an integer parameter from 1 to 5 for the statistics and another from 1 to 2 for ascending or descending and then create a sorting calculation. This new calculation then needs to go before your first dimension with its header hidden. The calculation should be roughly as below, also computed along your designated measure.

CASE [Ascending or Descending]
WHEN 1 THEN
CASE [Sort States by]
WHEN 1 THEN [Lower Whisker]
WHEN 2 THEN [Lower Hinge]
WHEN 3 THEN [Median]
WHEN 4 THEN [Upper Hinge]
WHEN 5 THEN [Upper Whisker]
END
WHEN 2 THEN
CASE [Sort States by]
WHEN 1 THEN -[Lower Whisker]
WHEN 2 THEN -[Lower Hinge]
WHEN 3 THEN -[Median]
WHEN 4 THEN -[Upper Hinge]
WHEN 5 THEN -[Upper Whisker]
END
END

 

And there you go, that’s box plot calculations in Tableau!