Day 1!
Day 1: Sorting and Distance Calculation
Problem Statement
Given two columns of numbers in a string format, we need to:
- Sort both columns independently from lowest to highest
- Calculate the absolute difference between corresponding numbers
- Sum up all the differences
Example Input
58990 83989 26183 15707 48195 12659 20176 26012 26730 42699
Why Do We Need Sorting?
Let's examine why sorting is necessary with a small example:
Example with Two Different Matchings
Left | Right |
---|---|
1 | 10 |
5 | 2 |
Matching 1: After Sorting Both Lists
Left | Right | Difference |
---|---|---|
1 | 2 | |diff| = 1 |
5 | 10 | |diff| = 5 |
Total: | 6 |
Matching 2: Without Sorting
Left | Right | Difference |
---|---|---|
1 | 10 | |diff| = 9 |
5 | 2 | |diff| = 3 |
Total: | 12 |
We see that sorting both lists gives us the minimum total difference. Any other matching would result in a larger total.
Approach 1: Single Pass with List Comprehension
left, right = zip(*(map(int, line.split())
for line in input.splitlines() if line.strip()))
left_sorted = sorted(left)
right_sorted = sorted(right)
- ✓ Clean, one-liner solution
- ✓ Uses efficient built-in functions
- ✓ Good readability
- ⚠ Creates intermediate lists in memory
Approach 2: Direct to Arrays with NumPy
import numpy as np
# Convert directly to 2D array and then split columns
arr = np.array([line.split() for line in input.splitlines()
if line.strip()], dtype=int)
left_sorted = np.sort(arr[:, 0])
right_sorted = np.sort(arr[:, 1])
- ✓ Very efficient for large datasets
- ✓ Vectorized operations
- ✓ Good for numerical computations
- ⚠ Requires NumPy dependency
- ⚠ Overhead for small datasets
- ⚠ Additional memory for array conversion
Approach 3: Generator-based parsing
def parse_numbers(input_str):
for line in input_str.splitlines():
if line.strip():
l, r = line.split()
yield int(l), int(r)
left, right = zip(*parse_numbers(input))
left_sorted = sorted(left)
right_sorted = sorted(right)
- ✓ Memory efficient for large files
- ✓ Lazy evaluation
- ✓ Good for streaming data
- ⚠ More complex implementation
- ⚠ Still needs memory for final sorting
- ⚠ Multiple passes through data
Recommendation
For parsing and sorting the input data, Approach 1 (List Comprehension) is recommended for most cases because:
- The input size is moderate (fits easily in memory)
- Code clarity is important for maintenance
- No external dependencies required
- Performance is good enough for typical use cases
Consider alternatives when:
- → Working with very large datasets (Approach 2)
- → Memory is severely constrained (Approach 3)
- → Already using NumPy in your project (Approach 2)
- → Need to process data in chunks (Approach 3)
Calculating the Total Distance
Once we have our sorted lists, we need to calculate the sum of absolute differences between corresponding elements. Here are several approaches:
Approach 1: List Comprehension
total = sum(abs(l - r) for l, r in zip(left_sorted, right_sorted))
- ✓ Clean, Pythonic syntax
- ✓ Good readability
- ✓ Memory efficient (generates values one at a time)
- ⚠ Creates generator object in memory
Approach 2: NumPy Vectorization
import numpy as np
total = np.abs(np.array(left_sorted) - np.array(right_sorted)).sum()
- ✓ Very fast for large arrays due to vectorization
- ✓ Clean, mathematical syntax
- ⚠ Overhead of converting to numpy arrays
- ⚠ Only worth it for large datasets (>10k elements)
- ⚠ Additional dependency required
Approach 3: Map with Lambda
total = sum(map(lambda x: abs(x[0] - x[1]),
zip(left_sorted, right_sorted)))
- ✓ Memory efficient (lazy evaluation)
- ✓ No intermediate list created
- ⚠ Less readable than list comprehension
- ⚠ Lambda functions can be harder to debug
Approach 4: Traditional For Loop
total = 0
for l, r in zip(left_sorted, right_sorted):
total += abs(l - r)
- ✓ Most explicit and easiest to understand
- ✓ Good for debugging (can add print statements)
- ✓ No hidden memory operations
- ⚠ More verbose than other approaches
- ⚠ Slightly slower for large lists
Recommendation
For this specific problem, Approach 1 (list comprehension) is likely the best choice because:
- The data size is moderate (no need for NumPy optimization)
- The operation is simple (one-liner is appropriate)
- Code readability is important
- We don't need debugging capabilities
If you're dealing with very large datasets (millions of entries), consider Approach 2 with NumPy. For debugging or educational purposes, Approach 4 might be more appropriate.
Calculating Similarity Score
After finding the minimum total distance, we need to calculate a similarity score. This score is based on how many times each number from the left list appears in the right list.
Step 1: Create Set for Efficient Lookup
left_set = set(left_sorted)
- ✓ Converts list to set for O(1) lookups
- ✓ Removes duplicates automatically
- ✓ Memory efficient for large datasets
- ⚠ Order is not preserved (but not needed here)
Step 2: Count Matching Numbers
matches = {num: 0 for num in left_set} # initialize all counts to 0
for num in right_sorted:
if num in left_set:
matches[num] += 1
- ✓ Initialize counts to zero for all left numbers
- ✓ Efficiently count occurrences in right list
- ✓ Only counts numbers that exist in left list
- ⚠ Requires additional memory for dictionary
Step 3: Filter Non-Zero Matches
# show non zero matches
non_zero_matches = {k: v for k, v in matches.items() if v > 0}
- ✓ Removes numbers with zero matches
- ✓ Creates cleaner dictionary for final calculation
- ✓ Makes debugging easier
- ⚠ Creates new dictionary in memory
Step 4: Calculate Similarity Score
similarity_score = sum(num * count for num, count in non_zero_matches.items())
- ✓ Multiplies each number by its count
- ✓ Sums all products for final score
- ✓ Uses generator expression for memory efficiency
- ✓ Only processes numbers that had matches
Why This Approach?
This multi-step approach provides an efficient way to calculate similarity:
- Using a set makes lookups very fast
- Dictionary keeps track of counts without multiple passes
- Filtering non-zero matches reduces final calculation work
- Generator expression in final sum keeps memory usage low