TRIZ: A Creative Problem-Solving System – Data, DataViz, and Stats with the Stars

Agenda!

Orange? What is this Orange stuff, anyhow?
Throwing it All Away with Brad Pitt: Data Summaries
Counting Letters with Sherlock Holmes: Bar Charts
Nursery Rhymes with Ben Affleck: Line Charts
Being a Mermaid with Katie Ledecky: Box Plots
Jack and Rose lived happily ever after: Mosaic Plots
The Art of Surprise with Gabbar Singh: Permutation Tests

Orange? What is this Orange stuff, anyhow?

Orange is a visual drag-and-drop tool for

Data visualization
Statistical Tests
Machine Learning
Data mining

and much more. You can download and install Orange from here:

https://orangedatamining.com/download/

Basic Usage of Orange

Widgets and Channels

The Orange Visual Interface

Let us create some simple visualizations using Orange.

Use the File Widget to import the iris dataset into your session
Use the Data Table Widget to look at the data, and note its variable names
Use the Visualization Widgets ( Scatter Plot, Bar Plot, and Distributions) to look at the properties of the variables, and examine relationships between them.

What does Data Look like?

Orange Practice Session#1

Let’s use the Datasets widget
Click on it to select one of the built-in CSV files : Auto MPG
Let’s look at the Data using the Data Table widget
And create a Scatter Plot with the Scatter Plot widget (Horsepower vs Displacement)
Try the menu options on the left side to see how they alter the plot

Brad Pitt: Throwing it All Away

A Pillar of Statistical Wisdom

Steven Stigler (2016) in “The Seven Pillars of Statistical Wisdom”:

One of the Big Ideas in Statistics is: Aggregation
How is it revolutionary?
By stipulating that, given a number of observations, you can actually gain information by throwing information away
In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Brad Pitt: Throwing it All Away

What was he throwing away?

All the “Variables”

Age
Previous Seasons
Waist Size
Treadmill Test Score
Bat Speed?
Smoke Weed?
Girlfriend?
Girlfriend Looks Rating?
Waddles like a Duck?
Looks Weird?

And he was looking ONLY at…

How do we throw away data?

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

Dataset length: How many rows/observations?
Dataset breadth: How many columns/variables?
How many Quant variables?
How many Qual variables?
Quant variables: min, max, mean, median, sd
Qual variables: levels, counts per level
Both: means, medians for each level of a Qual variable…

Orange Practice Session #2

So what do we throw away now? And how?

Let’s look at the existing dataset, Auto MPG
Pull in the Feature Statistics widget
Connect the Datasets widget to the Feature Statistics widget
Appreciate the Summary Statistics that are generated

Wait, But Why?

To summarize is to understand.
Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.
Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of “Funes the Memorious,”, there were only details.”
Brad Pitt aka Billy Beane was throwing away the details, and looking at the aggregated picture to pick his future Oakland A’s team.

Counting Letters with Sherlock Holmes

Sherlock Holmes: The Adventure of the Dancing Men

In the Sherlock Holmes story, The Adventure of the Dancing Men, a criminal known to one of the characters communicates with her using a childish/child-like drawing which looks like this:

How would Holmes decipher this message?

Sherlock Holmes: The Adventure of the Dancing Men

Using Conjectures:
Symbols -> Letters
Based on well-known Counts of letters (Zipf’s Law)
Holmes deduces that the most common letter in the message is “E”
He then deduces that the second most common letter is “T”

What Charts work for counting?

Variable #1	Variable #2	Chart Names	Chart Shape
Qual	None	Bar Chart

Bar charts are used to show “counts” and “tallies” with respect to Qual variables.
For instance, in a survey, how many people vs Gender?
In a Target Audience survey on Weekly Consumption, how many low, medium, or high expenditure people?
NOTE: we count with Qual variables! (Why? We count labels, like letters.)
Which is why the Bars are not touching each other.

Orange Practice Session #3

Banned Books!!!

OK, Let’s get some data to count:

Rows: 1586 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Author, Title, Type of Ban, State, District, Date of Challenge/Remo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s for now use a pre-set Workflow in Orange

Download the Orange Barchart Workflow

We will look at the data
Make a Data dictionary
Identify the Qual and Quant variables
Prepare Counts and Bar Charts wrt Qual variables
In Orange! Point, Click, and See!

Data Dictionary

Qualitative Variables

Author: Author of the book (Qual)
Title: Title of the book (Qual)
Origin: Origin of the Challenge (Qual)
Type of Ban: Type of ban on the book (Qual)
State: State in which the book was banned (Qual)
District: District in which the book was banned (Qual)
Origin: Origin of the Challenge (Qual)

Quantitative Variables

None

Counting our Data

Research Question

Do some States ban more books than some others?

What is the Story Here?

Texas is the worst at book banning!
Texas, Florida, Oklahoma, Kansas, Indiana,..are next in line
Is there a “Bible Belt” story here?

Counting our Data

Research Question

What are the kinds of bans that are being imposed on books? How many books banned by each type of ban?

What is the Story Here?

Four reasons for banning books
“Investigation” is the commonest kind of ban
How does one “investigate” a book???

Wait, But Why?

Counts first give you an absolute sense of how much data you have.
Counts by different Qual variables give you a sense of the combinations you have in your data: (State) * (District) * (Ban)
Counts then give an idea whether your data is lop-sided
Since the X-axis in bar charts is Qualitative (the bars don’t touch, remember!) it is possible to sort the bars at will.

Nursery Rhymes with Ben Affleck

In “The Accountant,” Christian Wolff is heard reciting “Solomon Grundy,”
The nursery rhyme tells the life and death of a man named Solomon Grundy, all within a single week.
It was innocently used to help children learn their days of the week.
However, when we look into the fact that Thursday through Sunday detail the tragic end of Mr. Grundy due to an unspecified illness…
it’s hard to ignore the dark undertones.

What is the Data here? And the Chart?

The data is the days of the week.
The data is the number of events that happen on each day.
The y-variable is a Quant variable, a number
The x-variable is also Quant variable, a time variable

Note

Tourist: Any famous people born around here?
Guide: No sir, best we can do is babies.

Timing our Data

OK, Let’s get some data to chart:

Rows: 5479 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): year, month, date_of_month, day_of_week, births

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s for now use a pre-set Workflow in Orange

Download Orange Time Series Workflow

We will look at the data
Make a Data dictionary
Identify the Qual and Quant variables
Prepare Time Series Charts with Quant variables, and Qual variables

Data Dictionary

Qualitative Variables

year: Year of birth (Qual)
month: Month of the year (Qual)
day_of_month: Day of the month (Qual)
day_of_week: Day of the week (Qual)

Quantitative Variables

births: Number of births on that day (Quant)

Rows: 5479 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): year, month, date_of_month, day_of_week, births

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 6 × 5
   year month date_of_month day_of_week births
  <dbl> <dbl>         <dbl>       <dbl>  <dbl>
1  2000     1             1           6   9083
2  2000     1             2           7   8006
3  2000     1             3           1  11363
4  2000     1             4           2  13032
5  2000     1             5           3  12558
6  2000     1             6           4  12466

Orange Practice Session#4

The data is the number of births in the USA, by day, month, and year
Let us use the Group By widget to group by day_of_week
AND compute the mean(births) in the same widget
We plot the mean(births) vs month, and colour by day_of_week

Wait, But Why?

Time series data is a special kind of Quantitative data, where the x-variable is a time variable.
The y-variable is a Quant variable, a number.
The x-variable is a Quant variable, a time variable.
By colouring by day_of_week, we can see how the number of births varies by day of the week.

What is the Story Here?

Could there be a staffing shortage at hospitals on weekends?
Is this a “revealed preference” thing?
Or should we watch Grey’s Anatomy, or even House?

Being a Mermaid with Katie Ledecky

Katie Ledecky is a swimmer, and a mermaid.
She has won 7 Olympic gold medals, and 15 World Championship gold medals.
She is the world record holder in the 400, 800, and 1500 meter freestyle events, and in the 4x100 meter freestyle relay, and the 4x200 meter freestyle relay.
What does that make her? An Outlier…

So how do we find, and show, outliers?

Outliers are data points that are significantly different from the rest of the data.
They can be identified using box plots, which show the distribution of the data.
Box plots show the median, quartiles, and outliers of the data.
Of course, Ledecky was in the water! Well in.

Being an Outlier with Katie Ledecky

Let’s get some data to plot:

Rows: 397 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): rank, discipline, sex
dbl (3): yrs.since.phd, yrs.service, salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s use a pre-set Workflow in Orange Download the Orange Box Plot Workflow

Data Dictionary

Qualitative Variables

rank: Rank of the academic (Qual)
discipline: Discipline of the academic (Qual)
sex: Male / Female

Quantitative Variables

yrs.since.phd: Years since PhD (Quant). Can be Qual??
yrs.service`: Years of service (Quant)
salary: Salary of the academic (Quant)

Research Question#1

Question

Q1. What is the distribution of salary? If we split by sex?

Research Question#2

Question

Q2. What is the distribution of salary, when we split by other Qual variables, such as rank?

Wait, But Why?

Box Plots tell us distributions of Quant variables, and show us outliers.
They show us the median, quartiles, and outliers of the data.
They are useful for comparing distributions of Quant variables across Qual variables. ( E.g Sex or Rank)

What is the Story Here?

Could there be a systemic bias in salaries?
A statistical t-test / ANOVA would tell us if that is true.
Look in the figures for a t-test and ANOVA report at the bottom.

Jack and Rose lived happily ever after

Jack and Rose lived happily ever after?

The Titanic sank on 15 April 1912, after hitting an iceberg.
What are the chances that Jack survived too?
What did his chances depend on?

Jack and Rose lived happily ever after?

Let’s get the titanic data, using the Datasets widget in Orange.
There were 2201 passengers, as per this dataset.
And let’s use a pre-set Workflow in Orange

Download the Orange Mosaic Chart Workflow

Data Dictionary: `titanic`

Quantitative Data

None.

Qualitative Data

survived: (chr) yes or no
status: (chr) Class of Travel, else “crew”
age: (chr) Adult, Child
sex: (chr) Male / Female.

What kind of Data Variables will we choose?

Variable #1	Variable #2	Chart Names	Chart Shape
Qual	Qual	~~Pies~~, Mosaic Charts

Here, area \sim count, so the area of the tile is proportional to the count of observations in that tile.

Research Question #1

Note

What is the dependence of survived upon sex?

Note

Note the huge imbalance in survived with sex
Men have clearly perished in larger numbers than women.
Colouring shows large positive residuals for men who died, and large negative residuals for women who died.

So sadly Jack is far more likely to have died than Rose.

Research Question #2

How does survived depend upon status?

Note

Crew has seen deaths in large numbers,
- as seen by the large negative residual for crew-survivals.
First Class passengers have had speedy access to the boats and have survived in larger proportions than say second or third class.
There is a large positive residual for first-class survivals.
Rose travelled first class and Jack was third class. So again the odds are stacked against him.

What are these Residuals anyhow?

When differences between the actual and expected counts are large, we deduce that one Qual variable has an effect on the other Qual variable. (speaking counts-wise or ratio-wise)

Wait, But Why?

Mosaic Charts are used to show the relationship between two Qual variables.
They show the counts of observations in each combination of the two Qual variables.
The area of each tile is proportional to the count of observations in that tile
The colour of the tile shows the residuals, which are the differences between the actual and expected counts.

The Art of Surprise with Gabbar Singh

What was the “data” Gabbar was looking at?

Chamber	Bullet
1	Y / N
2	Y / N
..	…
6	Y / N

The number of people in the village, and the number of people who were armed. (Quant)
The number of bullets in the gun (Quant)
The ID of the pistol chamber which contains a bullet (Qual)

The Art of Surprise with Gabbar Singh

And Gabbar’s Hypothesis?

That three bullets would never line up just ready to be fired, especially after he had fired three off!!
So he could claim “ignorance”!
“Ignorance” == “Fairness” == “Justice” !
But he was pretend-surprised when the three bullets were lined up, and he could fire them all off at once!
But no one could accuse him of anything, because he was “ignorant” of the fact that the bullets were lined up!

Gabbar’s Gun Chamber Permutations

So how do we become Gabbar Singh?

Are Emily and Greg more Employable than Lakisha and Jamal?

Is ethnicity (as revealed by first names) a basis for racial discrimination, in the US?

This dataset was generated as part of a landmark research study done by Marianne Bertrand and Senthil Mullainathan.

Read the description therein to really understand how you can prove causality with a well-crafted research experiment.

# A tibble: 6 × 3
  name    ethnicity call 
  <chr>   <chr>     <fct>
1 Allison cauc      no   
2 Kristen cauc      no   
3 Lakisha afam      no   
4 Latonya afam      no   
5 Carrie  cauc      no   
6 Jay     cauc      no

# A tibble: 4 × 3
# Groups:   ethnicity [2]
  ethnicity call      n
  <chr>     <fct> <int>
1 afam      yes     157
2 afam      no     2278
3 cauc      yes     235
4 cauc      no     2200

# A tibble: 2 × 2
  ethnicity call_prop
  <chr>         <dbl>
1 afam           6.45
2 cauc           9.65

diffprop 
0.108199

The Art of Surprise with Gabbar Singh

So it appears the call percentage is different for the two ethnicities, afam and cauc
But is it statistically significant? Would Gabbar be surprised?
Let us pretend ethnicity does not matter and spin the revolver!!
We mess with the ethnicity variable, some 5000 times

[1] "diffprop"

The Art of Surprise with Gabbar Singh

We are not able to mimic Mother Nature aka Reality
The red line is the observed difference in proportions, and it is way out of the null distribution.
So we can reject the NULL Hypothesis that ethnicity does not matter.
Hence we infer that there was bias in the hiring process, and that afam candidates were discriminated against.

[1] "diffprop"

Gabbar viv-a-vis a Stats Teacher

Gabbar	Stats Teacher
“Kitne aadmi thay?”	How many observations do you have? n < 30 is a joke.
Kya Samajh kar aaye thay? Gabbar khus hoga? Sabaasi dega kya?	What are the levels in your Factors? Are they binary? Don’t do ANOVA just yet!
(Fires off three rounds ) Haan, ab theek hai!	Yes, now the dataset is balanced wrt the factor (Treatment and Control).
Is pistol mein teen zindagi aur teen maut bandh hai. Dekhte hain kisko kya milega.	This is our Research Question, for which we will Design an Experiment.
Hume kuchh nahi pataa!	Let us perform a non-parametric Permutation Test for this Factor!
Kamaal ho gaya!	Fantastic! Our p-value is so small that we can reject the NULL Hypothesis!!

Thank You!!

Questions? Comments? Suggestions?

arvind.venkatadri@gmail.com

This Presentation: https://av-quarto.netlify.app/content/projects/modules/talks/vizchitra25/

Data and Workflow files are available at https://www.dropbox.com/scl/fo/15ha6pmtqd3t17c39idxi/AEviMggAcido-xQHiD82dH4?rlkey=h9xj09pdtx3st9basdfljwwyw&st=06dvqnbb&dl=0

Data, DataViz, and Stats with the Stars

Agenda!

Orange? What is this Orange stuff, anyhow?

Basic Usage of Orange

Widgets and Channels

The Orange Visual Interface

What does Data Look like?

Orange Practice Session#1

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

A Pillar of Statistical Wisdom

Brad Pitt: Throwing it All Away

What was he throwing away?

How do we throw away data?

Orange Practice Session #2

Wait, But Why?

Counting Letters with Sherlock Holmes

Sherlock Holmes: The Adventure of the Dancing Men

Sherlock Holmes: The Adventure of the Dancing Men

What Charts work for counting?

Orange Practice Session #3

Banned Books!!!

Data Dictionary

Qualitative Variables

Quantitative Variables

Counting our Data

Counting our Data

Wait, But Why?

Nursery Rhymes with Ben Affleck

Nursery Rhymes with Ben Affleck

What is the Data here? And the Chart?

Timing our Data

Data Dictionary

Orange Practice Session#4

Wait, But Why?

What is the Story Here?

Being a Mermaid with Katie Ledecky

Being a Mermaid with Katie Ledecky

So how do we find, and show, outliers?

Being an Outlier with Katie Ledecky

Data Dictionary

Research Question#1

Research Question#2

Wait, But Why?

What is the Story Here?

Jack and Rose lived happily ever after

Jack and Rose lived happily ever after?

Jack and Rose lived happily ever after?

Data Dictionary: titanic

What kind of Data Variables will we choose?

Research Question #1

Research Question #2

What are these Residuals anyhow?

Wait, But Why?

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

Gabbar’s Gun Chamber Permutations

So how do we become Gabbar Singh?

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

Gabbar viv-a-vis a Stats Teacher

Thank You!!

Questions? Comments? Suggestions?

arvind.venkatadri@gmail.com

This Presentation: https://av-quarto.netlify.app/content/projects/modules/talks/vizchitra25/

Data and Workflow files are available at https://www.dropbox.com/scl/fo/15ha6pmtqd3t17c39idxi/AEviMggAcido-xQHiD82dH4?rlkey=h9xj09pdtx3st9basdfljwwyw&st=06dvqnbb&dl=0

Data Dictionary: `titanic`