Data, DataViz, and Stats with the Stars

Arvind V.

2025-06-28

Agenda!

  • Orange? What is this Orange stuff, anyhow?
  • Throwing it All Away with Brad Pitt: Data Summaries
  • Counting Letters with Sherlock Holmes: Bar Charts
  • Nursery Rhymes with Ben Affleck: Line Charts
  • Being a Mermaid with Katie Ledecky: Box Plots
  • Jack and Rose lived happily ever after: Mosaic Plots
  • The Art of Surprise with Gabbar Singh: Permutation Tests

Orange? What is this Orange stuff, anyhow?

Orange is a visual drag-and-drop tool for

  • Data visualization
  • Statistical Tests
  • Machine Learning
  • Data mining

and much more. You can download and install Orange from here:

https://orangedatamining.com/download/

Figure 1: Orange Data Mining GUI

Basic Usage of Orange

Widgets and Channels

The Orange Visual Interface

Let us create some simple visualizations using Orange.

  • Use the File Widget to import the iris dataset into your session
  • Use the Data Table Widget to look at the data, and note its variable names
  • Use the Visualization Widgets ( Scatter Plot, Bar Plot, and Distributions) to look at the properties of the variables, and examine relationships between them.

What does Data Look like?

Variable Types

Variable Types

Tidy Data

Tidy Data

Orange Practice Session#1

  • Let’s use the Datasets widget
  • Click on it to select one of the built-in CSV files : Auto MPG
  • Let’s look at the Data using the Data Table widget
  • And create a Scatter Plot with the Scatter Plot widget (Horsepower vs Displacement)
  • Try the menu options on the left side to see how they alter the plot

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

A Pillar of Statistical Wisdom

Steven Stigler (2016) in “The Seven Pillars of Statistical Wisdom”:

  • One of the Big Ideas in Statistics is: Aggregation
  • How is it revolutionary?
  • By stipulating that, given a number of observations, you can actually gain information by throwing information away
  • In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Brad Pitt: Throwing it All Away

What was he throwing away?

All the “Variables”

  • Age
  • Previous Seasons
  • Waist Size
  • Treadmill Test Score
  • Bat Speed?
  • Smoke Weed?
  • Girlfriend?
  • Girlfriend Looks Rating?
  • Waddles like a Duck?
  • Looks Weird?

And he was looking ONLY at…

OBP Data

OBP Data

How do we throw away data?

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

Orange Practice Session #2

  • So what do we throw away now? And how?
  • Let’s look at the existing dataset, Auto MPG
  • Pull in the Feature Statistics widget
  • Connect the Datasets widget to the Feature Statistics widget
  • Appreciate the Summary Statistics that are generated

Wait, But Why?

  • To summarize is to understand.
  • Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.
  • Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of “Funes the Memorious,”, there were only details.”
  • Brad Pitt aka Billy Beane was throwing away the details, and looking at the aggregated picture to pick his future Oakland A’s team.

Counting Letters with Sherlock Holmes

Sherlock Holmes: The Adventure of the Dancing Men

In the Sherlock Holmes story, The Adventure of the Dancing Men, a criminal known to one of the characters communicates with her using a childish/child-like drawing which looks like this:

Am Here, Abe Slaney

Am Here, Abe Slaney

How would Holmes decipher this message?

Sherlock Holmes: The Adventure of the Dancing Men

  • Using Conjectures:
  • Symbols -> Letters
  • Based on well-known Counts of letters (Zipf’s Law)
  • Holmes deduces that the most common letter in the message is “E”
  • He then deduces that the second most common letter is “T”

Zipf’s Law

Zipf’s Law

What Charts work for counting?

Variable #1 Variable #2 Chart Names Chart Shape
Qual None Bar Chart

  • Bar charts are used to show “counts” and “tallies” with respect to Qual variables.
  • For instance, in a survey, how many people vs Gender?
  • In a Target Audience survey on Weekly Consumption, how many low, medium, or high expenditure people?
  • NOTE: we count with Qual variables! (Why? We count labels, like letters.)
  • Which is why the Bars are not touching each other.

Orange Practice Session #3

Banned Books!!!

OK, Let’s get some data to count:

Rows: 1586 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Author, Title, Type of Ban, State, District, Date of Challenge/Remo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s for now use a pre-set Workflow in Orange

Download the Orange Barchart Workflow

  • We will look at the data
  • Make a Data dictionary
  • Identify the Qual and Quant variables
  • Prepare Counts and Bar Charts wrt Qual variables
  • In Orange! Point, Click, and See!

Data Dictionary

Qualitative Variables

  • Author: Author of the book (Qual)
  • Title: Title of the book (Qual)
  • Origin: Origin of the Challenge (Qual)
  • Type of Ban: Type of ban on the book (Qual)
  • State: State in which the book was banned (Qual)
  • District: District in which the book was banned (Qual)
  • Origin: Origin of the Challenge (Qual)

Quantitative Variables

  • None

Counting our Data

Research Question

Do some States ban more books than some others?

What is the Story Here?

  • Texas is the worst at book banning!
  • Texas, Florida, Oklahoma, Kansas, Indiana,..are next in line
  • Is there a “Bible Belt” story here?
Figure 2: Bible Belt

Counting our Data

Research Question

What are the kinds of bans that are being imposed on books? How many books banned by each type of ban?

What is the Story Here?

  • Four reasons for banning books
  • “Investigation” is the commonest kind of ban
  • How does one “investigate” a book???

Wait, But Why?

  • Counts first give you an absolute sense of how much data you have.
  • Counts by different Qual variables give you a sense of the combinations you have in your data: (State) * (District) * (Ban)
  • Counts then give an idea whether your data is lop-sided
  • Since the X-axis in bar charts is Qualitative (the bars don’t touch, remember!) it is possible to sort the bars at will.

Nursery Rhymes with Ben Affleck

Nursery Rhymes with Ben Affleck

  • In “The Accountant,” Christian Wolff is heard reciting “Solomon Grundy,”
  • The nursery rhyme tells the life and death of a man named Solomon Grundy, all within a single week.
  • It was innocently used to help children learn their days of the week.
  • However, when we look into the fact that Thursday through Sunday detail the tragic end of Mr. Grundy due to an unspecified illness…
  • it’s hard to ignore the dark undertones.

What is the Data here? And the Chart?

  • The data is the days of the week.
  • The data is the number of events that happen on each day.
  • The y-variable is a Quant variable, a number
  • The x-variable is also Quant variable, a time variable

Note

Tourist: Any famous people born around here?
Guide: No sir, best we can do is babies.

Line Chart for Time Series

Line Chart for Time Series

Timing our Data

OK, Let’s get some data to chart:

Rows: 5479 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): year, month, date_of_month, day_of_week, births

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s for now use a pre-set Workflow in Orange

Download Orange Time Series Workflow

  • We will look at the data
  • Make a Data dictionary
  • Identify the Qual and Quant variables
  • Prepare Time Series Charts with Quant variables, and Qual variables

Data Dictionary

Qualitative Variables

  • year: Year of birth (Qual)
  • month: Month of the year (Qual)
  • day_of_month: Day of the month (Qual)
  • day_of_week: Day of the week (Qual)

Quantitative Variables

  • births: Number of births on that day (Quant)
Rows: 5479 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): year, month, date_of_month, day_of_week, births

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
   year month date_of_month day_of_week births
  <dbl> <dbl>         <dbl>       <dbl>  <dbl>
1  2000     1             1           6   9083
2  2000     1             2           7   8006
3  2000     1             3           1  11363
4  2000     1             4           2  13032
5  2000     1             5           3  12558
6  2000     1             6           4  12466

Orange Practice Session#4

  • The data is the number of births in the USA, by day, month, and year
  • Let us use the Group By widget to group by day_of_week
  • AND compute the mean(births) in the same widget
  • We plot the mean(births) vs month, and colour by day_of_week

Mean Births Time Series

Mean Births Time Series

Wait, But Why?

  • Time series data is a special kind of Quantitative data, where the x-variable is a time variable.
  • The y-variable is a Quant variable, a number.
  • The x-variable is a Quant variable, a time variable.
  • By colouring by day_of_week, we can see how the number of births varies by day of the week.

What is the Story Here?

  • Could there be a staffing shortage at hospitals on weekends?
  • Is this a “revealed preference” thing?
  • Or should we watch Grey’s Anatomy, or even House?

Being a Mermaid with Katie Ledecky

Being a Mermaid with Katie Ledecky

  • Katie Ledecky is a swimmer, and a mermaid.
  • She has won 7 Olympic gold medals, and 15 World Championship gold medals.
  • She is the world record holder in the 400, 800, and 1500 meter freestyle events, and in the 4x100 meter freestyle relay, and the 4x200 meter freestyle relay.
  • What does that make her? An Outlier

So how do we find, and show, outliers?

  • Outliers are data points that are significantly different from the rest of the data.
  • They can be identified using box plots, which show the distribution of the data.
  • Box plots show the median, quartiles, and outliers of the data.
  • Of course, Ledecky was in the water! Well in.
Figure 3: Box Plot Definitions

Being an Outlier with Katie Ledecky

  • Let’s get some data to plot:
Rows: 397 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): rank, discipline, sex
dbl (3): yrs.since.phd, yrs.service, salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Dictionary

Qualitative Variables

  • rank: Rank of the academic (Qual)
  • discipline: Discipline of the academic (Qual)
  • sex: Male / Female

Quantitative Variables

  • yrs.since.phd: Years since PhD (Quant). Can be Qual??
  • yrs.service`: Years of service (Quant)
  • salary: Salary of the academic (Quant)
Figure 4: Salaries Data Table

Research Question#1

Question

Q1. What is the distribution of salary? If we split by sex?

Figure 5: Salaries Box Plot
Figure 6: Salaries Box Plot by Sex

Research Question#2

Question

Q2. What is the distribution of salary, when we split by other Qual variables, such as rank?

Figure 7: Salaries Box Plot by Rank

Wait, But Why?

  • Box Plots tell us distributions of Quant variables, and show us outliers.
  • They show us the median, quartiles, and outliers of the data.
  • They are useful for comparing distributions of Quant variables across Qual variables. ( E.g Sex or Rank)

What is the Story Here?

  • Could there be a systemic bias in salaries?
  • A statistical t-test / ANOVA would tell us if that is true.
  • Look in the figures for a t-test and ANOVA report at the bottom.

Jack and Rose lived happily ever after

Jack and Rose lived happily ever after?

  • The Titanic sank on 15 April 1912, after hitting an iceberg.
  • What are the chances that Jack survived too?
  • What did his chances depend on?

Jack and Rose lived happily ever after?

  • Let’s get the titanic data, using the Datasets widget in Orange.

  • There were 2201 passengers, as per this dataset.

  • And let’s use a pre-set Workflow in Orange

Download the Orange Mosaic Chart Workflow

Data Dictionary: titanic

Quantitative Data

None.

Qualitative Data

  • survived: (chr) yes or no
  • status: (chr) Class of Travel, else “crew”
  • age: (chr) Adult, Child
  • sex: (chr) Male / Female.

What kind of Data Variables will we choose?

Variable #1 Variable #2 Chart Names Chart Shape
Qual Qual Pies, Mosaic Charts

Here, area \sim count, so the area of the tile is proportional to the count of observations in that tile.

Research Question #1

Note

What is the dependence of survived upon sex?

Note

  • Note the huge imbalance in survived with sex
  • Men have clearly perished in larger numbers than women.
  • Colouring shows large positive residuals for men who died, and large negative residuals for women who died.

So sadly Jack is far more likely to have died than Rose.

Research Question #2

How does survived depend upon status?

Note

  • Crew has seen deaths in large numbers,
    • as seen by the large negative residual for crew-survivals.
  • First Class passengers have had speedy access to the boats and have survived in larger proportions than say second or third class.
  • There is a large positive residual for first-class survivals.
  • Rose travelled first class and Jack was third class. So again the odds are stacked against him.

What are these Residuals anyhow?

When differences between the actual and expected counts are large, we deduce that one Qual variable has an effect on the other Qual variable. (speaking counts-wise or ratio-wise)

Actual Counts

Actual Counts

Expected Counts!!

Expected Counts!!

Tile-Wise Differences = Residuals

Tile-Wise Differences = Residuals

Wait, But Why?

  • Mosaic Charts are used to show the relationship between two Qual variables.
  • They show the counts of observations in each combination of the two Qual variables.
  • The area of each tile is proportional to the count of observations in that tile
  • The colour of the tile shows the residuals, which are the differences between the actual and expected counts.

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

  • What was the “data” Gabbar was looking at?
Chamber Bullet
1 Y / N
2 Y / N
..
6 Y / N
  • The number of people in the village, and the number of people who were armed. (Quant)
  • The number of bullets in the gun (Quant)
  • The ID of the pistol chamber which contains a bullet (Qual)

The Art of Surprise with Gabbar Singh

And Gabbar’s Hypothesis?

  • That three bullets would never line up just ready to be fired, especially after he had fired three off!!
  • So he could claim “ignorance”!
  • “Ignorance” == “Fairness” == “Justice” !
  • But he was pretend-surprised when the three bullets were lined up, and he could fire them all off at once!
  • But no one could accuse him of anything, because he was “ignorant” of the fact that the bullets were lined up!

Gabbar’s Gun Chamber Permutations

So how do we become Gabbar Singh?

Is ethnicity (as revealed by first names) a basis for racial discrimination, in the US?

This dataset was generated as part of a landmark research study done by Marianne Bertrand and Senthil Mullainathan.

Read the description therein to really understand how you can prove causality with a well-crafted research experiment.

# A tibble: 6 × 3
  name    ethnicity call 
  <chr>   <chr>     <fct>
1 Allison cauc      no   
2 Kristen cauc      no   
3 Lakisha afam      no   
4 Latonya afam      no   
5 Carrie  cauc      no   
6 Jay     cauc      no   
# A tibble: 4 × 3
# Groups:   ethnicity [2]
  ethnicity call      n
  <chr>     <fct> <int>
1 afam      yes     157
2 afam      no     2278
3 cauc      yes     235
4 cauc      no     2200
# A tibble: 2 × 2
  ethnicity call_prop
  <chr>         <dbl>
1 afam           6.45
2 cauc           9.65
diffprop 
0.108199 

The Art of Surprise with Gabbar Singh

  • So it appears the call percentage is different for the two ethnicities, afam and cauc
  • But is it statistically significant? Would Gabbar be surprised?
  • Let us pretend ethnicity does not matter and spin the revolver!!
  • We mess with the ethnicity variable, some 5000 times
[1] "diffprop"

The Art of Surprise with Gabbar Singh

  • We are not able to mimic Mother Nature aka Reality
  • The red line is the observed difference in proportions, and it is way out of the null distribution.
  • So we can reject the NULL Hypothesis that ethnicity does not matter.
  • Hence we infer that there was bias in the hiring process, and that afam candidates were discriminated against.
[1] "diffprop"

Gabbar viv-a-vis a Stats Teacher

Gabbar Stats Teacher
“Kitne aadmi thay?” How many observations do you have? n < 30 is a joke.
Kya Samajh kar aaye thay? Gabbar khus hoga? Sabaasi dega kya? What are the levels in your Factors? Are they binary? Don’t do ANOVA just yet!
(Fires off three rounds ) Haan, ab theek hai! Yes, now the dataset is balanced wrt the factor (Treatment and Control).
Is pistol mein teen zindagi aur teen maut bandh hai. Dekhte hain kisko kya milega. This is our Research Question, for which we will Design an Experiment.
Hume kuchh nahi pataa! Let us perform a non-parametric Permutation Test for this Factor!
Kamaal ho gaya! Fantastic! Our p-value is so small that we can reject the NULL Hypothesis!!

Thank You!!

Questions? Comments? Suggestions?

arvind.venkatadri@gmail.com

This Presentation: https://av-quarto.netlify.app/content/projects/modules/talks/vizchitra25/

Data and Workflow files are available at https://www.dropbox.com/scl/fo/15ha6pmtqd3t17c39idxi/AEviMggAcido-xQHiD82dH4?rlkey=h9xj09pdtx3st9basdfljwwyw&st=06dvqnbb&dl=0