Day 1!

Day 1: Sorting and Distance Calculation

Problem Statement

Given two columns of numbers in a string format, we need to:

Example Input

58990   83989
26183   15707
48195   12659
20176   26012
26730   42699

Why Do We Need Sorting?

Let's examine why sorting is necessary with a small example:

Example with Two Different Matchings

Left Right
1 10
5 2

Matching 1: After Sorting Both Lists

Left Right Difference
1 2 |diff| = 1
5 10 |diff| = 5
Total: 6

Matching 2: Without Sorting

Left Right Difference
1 10 |diff| = 9
5 2 |diff| = 3
Total: 12

We see that sorting both lists gives us the minimum total difference. Any other matching would result in a larger total.

Approach 1: Single Pass with List Comprehension


left, right = zip(*(map(int, line.split()) 
   for line in input.splitlines() if line.strip()))
left_sorted = sorted(left)
right_sorted = sorted(right)
  • ✓ Clean, one-liner solution
  • ✓ Uses efficient built-in functions
  • ✓ Good readability
  • ⚠ Creates intermediate lists in memory

Approach 2: Direct to Arrays with NumPy


import numpy as np
# Convert directly to 2D array and then split columns
arr = np.array([line.split() for line in input.splitlines() 
   if line.strip()], dtype=int)
left_sorted = np.sort(arr[:, 0])
right_sorted = np.sort(arr[:, 1])
  • ✓ Very efficient for large datasets
  • ✓ Vectorized operations
  • ✓ Good for numerical computations
  • ⚠ Requires NumPy dependency
  • ⚠ Overhead for small datasets
  • ⚠ Additional memory for array conversion

Approach 3: Generator-based parsing


def parse_numbers(input_str):
   for line in input_str.splitlines():
       if line.strip():
           l, r = line.split()
           yield int(l), int(r)

left, right = zip(*parse_numbers(input))
left_sorted = sorted(left)
right_sorted = sorted(right)
  • ✓ Memory efficient for large files
  • ✓ Lazy evaluation
  • ✓ Good for streaming data
  • ⚠ More complex implementation
  • ⚠ Still needs memory for final sorting
  • ⚠ Multiple passes through data

Recommendation

For parsing and sorting the input data, Approach 1 (List Comprehension) is recommended for most cases because:

  • The input size is moderate (fits easily in memory)
  • Code clarity is important for maintenance
  • No external dependencies required
  • Performance is good enough for typical use cases

Consider alternatives when:

  • → Working with very large datasets (Approach 2)
  • → Memory is severely constrained (Approach 3)
  • → Already using NumPy in your project (Approach 2)
  • → Need to process data in chunks (Approach 3)

Calculating the Total Distance

Once we have our sorted lists, we need to calculate the sum of absolute differences between corresponding elements. Here are several approaches:

Approach 1: List Comprehension

total = sum(abs(l - r) for l, r in zip(left_sorted, right_sorted))
  • ✓ Clean, Pythonic syntax
  • ✓ Good readability
  • ✓ Memory efficient (generates values one at a time)
  • ⚠ Creates generator object in memory

Approach 2: NumPy Vectorization

import numpy as np
total = np.abs(np.array(left_sorted) - np.array(right_sorted)).sum()
  • ✓ Very fast for large arrays due to vectorization
  • ✓ Clean, mathematical syntax
  • ⚠ Overhead of converting to numpy arrays
  • ⚠ Only worth it for large datasets (>10k elements)
  • ⚠ Additional dependency required

Approach 3: Map with Lambda

total = sum(map(lambda x: abs(x[0] - x[1]), 
   zip(left_sorted, right_sorted)))
  • ✓ Memory efficient (lazy evaluation)
  • ✓ No intermediate list created
  • ⚠ Less readable than list comprehension
  • ⚠ Lambda functions can be harder to debug

Approach 4: Traditional For Loop

total = 0
for l, r in zip(left_sorted, right_sorted):
   total += abs(l - r)
  • ✓ Most explicit and easiest to understand
  • ✓ Good for debugging (can add print statements)
  • ✓ No hidden memory operations
  • ⚠ More verbose than other approaches
  • ⚠ Slightly slower for large lists

Recommendation

For this specific problem, Approach 1 (list comprehension) is likely the best choice because:

  • The data size is moderate (no need for NumPy optimization)
  • The operation is simple (one-liner is appropriate)
  • Code readability is important
  • We don't need debugging capabilities

If you're dealing with very large datasets (millions of entries), consider Approach 2 with NumPy. For debugging or educational purposes, Approach 4 might be more appropriate.

Calculating Similarity Score

After finding the minimum total distance, we need to calculate a similarity score. This score is based on how many times each number from the left list appears in the right list.

Step 1: Create Set for Efficient Lookup

left_set = set(left_sorted)
  • ✓ Converts list to set for O(1) lookups
  • ✓ Removes duplicates automatically
  • ✓ Memory efficient for large datasets
  • ⚠ Order is not preserved (but not needed here)

Step 2: Count Matching Numbers

matches = {num: 0 for num in left_set}  # initialize all counts to 0
for num in right_sorted:
   if num in left_set:
       matches[num] += 1
  • ✓ Initialize counts to zero for all left numbers
  • ✓ Efficiently count occurrences in right list
  • ✓ Only counts numbers that exist in left list
  • ⚠ Requires additional memory for dictionary

Step 3: Filter Non-Zero Matches

# show non zero matches
non_zero_matches = {k: v for k, v in matches.items() if v > 0}
  • ✓ Removes numbers with zero matches
  • ✓ Creates cleaner dictionary for final calculation
  • ✓ Makes debugging easier
  • ⚠ Creates new dictionary in memory

Step 4: Calculate Similarity Score

similarity_score = sum(num * count for num, count in non_zero_matches.items())
  • ✓ Multiplies each number by its count
  • ✓ Sums all products for final score
  • ✓ Uses generator expression for memory efficiency
  • ✓ Only processes numbers that had matches

Why This Approach?

This multi-step approach provides an efficient way to calculate similarity:

  • Using a set makes lookups very fast
  • Dictionary keeps track of counts without multiple passes
  • Filtering non-zero matches reduces final calculation work
  • Generator expression in final sum keeps memory usage low