Can You Find The Fish In State Names?

By Benjamin Soltoff in r

May 22, 2020

The recent Riddler classic offered this intriguing challenge:

Ohio is the only state whose name doesnt share any letters with the word “mackerel.” Its strange, but its true.

But that isnt the only pairing of a state and a word you can say that about — its not even the only fish! Kentucky has “goldfish” to itself, Montana has “jellyfish” and Delaware has “monkfish,” just to name a few.

What is the longest “mackerel?” That is, what is the longest word that doesnt share any letters with exactly one state? (If multiple “mackerels” are tied for being the longest, can you find them all?)

Extra credit: Which state has the most “mackerels?” That is, which state has the most words for which it is the only state without any letters in common with those words?

Given the provided dictionary, we need to find all the words that do not have overlapping letters with state names (specifically, mackerels are ones with exactly one state with which it does not share letters).

In order to do that, I tokenized each word at the character-level, then compared the list of dictionary words to all the state names to find all non-overlapping pairs of words.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)

theme_set(theme_minimal(base_size = 18))

First we retrieve the dictionary of words. R already includes state names as the state.name constant.

# get list of words
all_words <- tibble(word = readLines("https://norvig.com/ngrams/word.list"))
states <- tibble(word = state.name)

Next I use tidytext::unnest_tokens() to tokenize at the character level. Since it doesn’t matter how many times the word uses a specific character, I just need each distinct appearance.¹ From here I nest() the data frame so each word is a row and all their unique characters are in separate characte vectors.

# combine together
tokens <- bind_rows(
  all_words = all_words,
  states = states,
  .id = "source"
) %>%
  # tokenize by character
  unnest_tokens(
    output = letter,
    input = word,
    token = "characters",
    drop = FALSE
  ) %>%
  # remove duplicates
  distinct() %>%
  # store data as one row per word and all unique letters as character vectors
  nest(cols = c(letter)) %>%
  mutate(letters = map(cols, ~ .$letter)) %>%
  select(-cols) %>%
  # split by source
  group_split(source, keep = FALSE)

## Warning: The `keep` argument of `group_split()` is deprecated as of dplyr 1.0.0.
## Please use the `.keep` argument instead.

tokens

## <list_of<
##   tbl_df<
##     word   : character
##     letters: list
##   >
## >[2]>
## [[1]]
## # A tibble: 263,533 x 2
##    word     letters  
##    <chr>    <list>   
##  1 aa       <chr [1]>
##  2 aah      <chr [2]>
##  3 aahed    <chr [4]>
##  4 aahing   <chr [5]>
##  5 aahs     <chr [3]>
##  6 aal      <chr [2]>
##  7 aalii    <chr [3]>
##  8 aaliis   <chr [4]>
##  9 aals     <chr [3]>
## 10 aardvark <chr [5]>
## # … with 263,523 more rows
## 
## [[2]]
## # A tibble: 50 x 2
##    word        letters  
##    <chr>       <list>   
##  1 Alabama     <chr [4]>
##  2 Alaska      <chr [4]>
##  3 Arizona     <chr [6]>
##  4 Arkansas    <chr [5]>
##  5 California  <chr [8]>
##  6 Colorado    <chr [6]>
##  7 Connecticut <chr [7]>
##  8 Delaware    <chr [6]>
##  9 Florida     <chr [7]>
## 10 Georgia     <chr [6]>
## # … with 40 more rows

From here, I need to form all possible combinations between each dictionary word and state name. expand.grid() lets me form all possible combinations, while a couple of left_join() operations bring together all the character tokens.

# form all possible cominations of the dictionary and state names
tokens_combo <- expand.grid(
  word_all = tokens[[1]]$word,
  word_state = tokens[[2]]$word
) %>%
  left_join(tokens[[1]], by = c("word_all" = "word")) %>%
  rename(letters_all = letters) %>%
  left_join(tokens[[2]], by = c("word_state" = "word")) %>%
  rename(letters_state = letters) %>%
  as_tibble()
tokens_combo

## # A tibble: 13,176,650 x 4
##    word_all word_state letters_all letters_state
##    <chr>    <chr>      <list>      <list>       
##  1 aa       Alabama    <chr [1]>   <chr [4]>    
##  2 aah      Alabama    <chr [2]>   <chr [4]>    
##  3 aahed    Alabama    <chr [4]>   <chr [4]>    
##  4 aahing   Alabama    <chr [5]>   <chr [4]>    
##  5 aahs     Alabama    <chr [3]>   <chr [4]>    
##  6 aal      Alabama    <chr [2]>   <chr [4]>    
##  7 aalii    Alabama    <chr [3]>   <chr [4]>    
##  8 aaliis   Alabama    <chr [4]>   <chr [4]>    
##  9 aals     Alabama    <chr [3]>   <chr [4]>    
## 10 aardvark Alabama    <chr [5]>   <chr [4]>    
## # … with 13,176,640 more rows

The tricky part was finding the correct syntax to check all the characters in the dictionary word individually – if any of the letters could be found in the state name, we did not have a valid match. Combining %in% with the any() function allows us to do just that.

# only keep rows where the dictionary word does not have any letters in common
# with the state name
words_not_in_common <- tokens_combo %>%
  mutate(any_letters_in_common = map2_lgl(letters_all,
                                          letters_state,
                                          ~ any(.x %in% .y == TRUE))) %>%
  filter(!any_letters_in_common)

Turns out that even with millions of pairs to iterate through, it only took about 40 seconds to complete the comparisons on my computer.

From here, we just need to find all the mackerels and check their word length.

# check to see if a word has multiple mackerels
mackerels <- words_not_in_common %>%
  count(word_all) %>%
  # only keep words with a single mackerel
  filter(n == 1) %>%
  # what are their length?
  left_join(words_not_in_common) %>%
  mutate(length = map_dbl(word_all, nchar)) %>%
  select(-starts_with("letters"), -any_letters_in_common) %>%
  arrange(-length)

## Joining, by = "word_all"

# what are the longest mackerels?
mackerels %>%
  filter(length == max(length)) %>%
  knitr::kable(col.names = c(
    "Word", "Number of mackerels",
    "State", "Word length"
  ))

Word	Number of mackerels	State	Word length
counterproductivenesses	1	Alabama	23
hydrochlorofluorocarbon	1	Mississippi	23

2 words tie for the longest mackerel at 23 characters each.

As for the extra credit, which states have the most mackerels? Given what we already calculated, let’s instead aggregate total number of mackerels for each state.

# join with states info
state_mackerels <- mackerels %>%
  count(word_state) %>%
  # add back in states which have no mackerels
  full_join(states, by = c("word_state" = "word")) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  arrange(-n, word_state)

ggplot(
  data = state_mackerels,
  mapping = aes(
    x = factor(word_state, levels = word_state) %>% fct_rev(),
    y = n
  )
) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Number of mackerels by state",
    x = NULL,
    y = "Number of mackerels"
  )

Ohio is the clear winner; rounding out the top four are Alabama, Utah, and Mississippi.

Session Info

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       macOS Big Sur 10.16         
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2021-06-01                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  backports     1.2.1   2020-12-09 [1] CRAN (R 4.0.2)
##  blogdown      1.3     2021-04-14 [1] CRAN (R 4.0.2)
##  bookdown      0.22    2021-04-22 [1] CRAN (R 4.0.2)
##  broom         0.7.6   2021-04-05 [1] CRAN (R 4.0.4)
##  bslib         0.2.5   2021-05-12 [1] CRAN (R 4.0.4)
##  cachem        1.0.5   2021-05-15 [1] CRAN (R 4.0.2)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.0.2)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.0.0)
##  cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.2)
##  colorspace    2.0-1   2021-05-04 [1] CRAN (R 4.0.2)
##  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
##  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.2)
##  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.0.4)
##  desc          1.3.0   2021-03-05 [1] CRAN (R 4.0.2)
##  devtools      2.4.1   2021-05-05 [1] CRAN (R 4.0.2)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr       * 1.0.6   2021-05-05 [1] CRAN (R 4.0.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.0.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.2)
##  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.0.2)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
##  ggplot2     * 3.3.3   2020-12-30 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
##  haven         2.4.1   2021-04-23 [1] CRAN (R 4.0.2)
##  here          1.0.1   2020-12-13 [1] CRAN (R 4.0.2)
##  hms           1.1.0   2021-05-17 [1] CRAN (R 4.0.4)
##  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
##  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
##  janeaustenr   0.1.5   2017-06-10 [1] CRAN (R 4.0.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.0.2)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.2)
##  knitr         1.33    2021-04-24 [1] CRAN (R 4.0.2)
##  lattice       0.20-44 2021-05-02 [1] CRAN (R 4.0.2)
##  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
##  lubridate     1.7.10  2021-02-26 [1] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  Matrix        1.3-3   2021-05-04 [1] CRAN (R 4.0.2)
##  memoise       2.0.0   2021-01-26 [1] CRAN (R 4.0.2)
##  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.0.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
##  pillar        1.6.1   2021-05-16 [1] CRAN (R 4.0.4)
##  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
##  pkgload       1.2.1   2021-04-06 [1] CRAN (R 4.0.2)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.0.2)
##  ps            1.6.0   2021-02-28 [1] CRAN (R 4.0.2)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.2)
##  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 4.0.0)
##  remotes       2.3.0   2021-04-01 [1] CRAN (R 4.0.2)
##  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.2)
##  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.0.2)
##  rmarkdown     2.8     2021-05-07 [1] CRAN (R 4.0.2)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
##  rvest         1.0.0   2021-03-09 [1] CRAN (R 4.0.2)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.0.2)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  SnowballC     0.7.0   2020-04-01 [1] CRAN (R 4.0.0)
##  stringi       1.6.1   2021-05-10 [1] CRAN (R 4.0.2)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
##  tibble      * 3.1.1   2021-04-18 [1] CRAN (R 4.0.2)
##  tidyr       * 1.1.3   2021-03-03 [1] CRAN (R 4.0.2)
##  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.0.2)
##  tidytext    * 0.3.1   2021-04-10 [1] CRAN (R 4.0.2)
##  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.0.2)
##  tokenizers    0.2.1   2018-03-29 [1] CRAN (R 4.0.0)
##  usethis       2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
##  utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
##  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.0.2)
##  withr         2.4.2   2021-04-18 [1] CRAN (R 4.0.2)
##  xfun          0.23    2021-05-15 [1] CRAN (R 4.0.2)
##  xml2          1.3.2   2020-04-23 [1] CRAN (R 4.0.0)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

unnest_tokens() automatically converts the text to lowercase, so I don’t need to convert state.name first. ↩︎

Posted on:: May 22, 2020

Length:: 9 minute read, 1883 words

Categories:: r

Tags:: r riddler

See Also:: Can You Defeat the Detroit Lines?; More Menorah Math!; Time travel for the lottery