When translating your thoughts into code, most likely, you use the methods that you are most familiar with. These are methods that are top of mind and come automatically to you: you see a string that needs cleaning up and your fingers type the methods that will get the result.
Often, the methods that you type automatically are the most generic Ruby methods, because they are the ones that we read and write more than others, e.g. #gsub
is a generic method to substitute characters in strings. But, Ruby has so much more to offer, with more specialized convenience methods for standard operations.
I love Ruby's rich idiom mostly because it makes code more elegant and easier to read. If we want to benefit from this richness, we need to spend time refactoring even the simplest parts of our code—for instance, cleaning up a string—and it takes a bit of an effort to expand our vocabulary. The question is: is the extra effort worth it?
Four Ways to Remove Spaces
Here's a string that represents a credit card number: "055 444 285". To work with it, we want to remove the spaces. #gsub
can do this; with #gsub
you can substitute anything with everything. But there are other options.
1string = "055 444 285"
2string.gsub(/ /, '')
3string.gsub(' ', '')
4string.tr(' ', '')
5string.delete(' ')
6
7# => "055444285"
It's the expressiveness that I like most about the convenience methods. The last one is a good example of this: it doesn't get more obvious than "delete spaces". Thinking about trade-offs between options, readability is my first priority, unless of course, it causes performance problems. So, let's see how much pain my favorite solution, #delete
really causes.
I benchmarked the examples above. Which one of these methods do you think is the fastest?
1Benchmark.ips do |x|
2 x.config(time: 30, warmup: 2)
3
4 x.report('gsub') { string.gsub(/ /, '') }
5 x.report('gsub, no regex') { string.gsub(' ', '') }
6 x.report('tr') { string.tr(' ','') }
7 x.report('delete') { string.delete(' ') }
8
9 x.compare!
10end
Guess the order from most to least performant. Open the toggle to see the result
1Comparison:
2 delete: 2326817.5 i/s
3 tr: 2121629.8 i/s - 1.10x slower
4 gsub, no regex: 868184.1 i/s - 2.68x slower
5 gsub: 474970.5 i/s - 4.90x slower
I wasn't surprised about the order, but the differences in speed still surprised me. #gsub
is not only slower, but it also requires an extra effort for the reader to 'decode' the arguments. Let's see how this comparison works out when cleaning up more than just spaces.
Pick Your Numbers
Take the following phone number: '(408) 974-2414'
. Let's say we only need the numericals => 4089742414
. I added a #scan
as well because I like that it expresses more clearly that we aim for some particular things, instead of trying to remove all the things we don't want.
1Benchmark.ips do |x|
2 x.config(time: 30, warmup: 2)
3
4 x.report ('gsub') { string.gsub(/[^0-9] /, '') }
5 x.report('tr') { string.tr("^0-9", "") }
6 x.report('delete_chars') { string.delete("^0-9") }
7 x.report('scan') { string.scan(/[0-9]/).join }
8 x.compare!
9end
Again, guess the order, then open the toggle to see the answer
1Comparison:
2 delete_chars: 2006750.8 i/s
3 tr: 1856429.0 i/s - 1.08x slower
4 gsub: 523174.7 i/s - 3.84x slower
5 scan: 227717.4 i/s - 8.81x slower
Using a regex slows things down, that's not surprising. And the intention revealing expressiveness of #scan
costs us dearly. But looking at how Ruby's specialized methods handle cleaning up, gave me a taste for more.
On the Money
Let's try some ways of removing the substring "€ "
from the string "€ 300"
. Some of the following solutions specify the exact substring "€ "
, some will simply remove all currency symbols or all non-numerical characters.
1Benchmark.ips do |x|
2 x.config(time: 30, warmup: 2)
3
4 x.report('delete specific chars') { string.delete("€ ") }
5 x.report('delete non-numericals') { string.delete("^0-9") }
6 x.report('delete prefix') { string.delete_prefix("€ ") }
7 x.report('delete prefix, strip') { string.delete_prefix("€").strip }
8
9 x.report('gsub') { string.gsub(/€ /, '') }
10 x.report('gsub-non-nums') { string.gsub(/[^0-9]/, '') }
11 x.report('tr') { string.tr("€ ", "") }
12 x.report('slice array') { string.chars.slice(2..-1).join }
13 x.report('split') { string.split.last }
14 x.report('scan nums') { string.scan(/\d/).join }
15 x.compare!
16end
You may expect, and correctly so, that the winner is one of the #delete
s. But which one of the #delete
variants do you expect to be the fastest? Plus: one of the other methods is faster than some of the #delete
s. Which one?
Guess and then open.
1Comparison:
2 delete prefix: 4236218.6 i/s
3 delete prefix, strip: 3116439.6 i/s - 1.36x slower
4 split: 2139602.2 i/s - 1.98x slower
5delete non-numericals: 1949754.0 i/s - 2.17x slower
6delete specific chars: 1045651.9 i/s - 4.05x slower
7 tr: 951352.0 i/s - 4.45x slower
8 slice array: 681196.2 i/s - 6.22x slower
9 gsub: 548588.3 i/s - 7.72x slower
10 gsub-non-nums: 489744.8 i/s - 8.65x slower
11 scan nums: 418978.8 i/s - 10.11x slower
I was surprised that even slicing an array is faster than #gsub
and I'm always pleased to see how fast #split
is. And note that deleting all non-numericals is faster than deleting a specific substring.
Follow the Money
Let's remove the currency after the number. (I skipped the slower #gsub
variants.)
1Benchmark.ips do |x|
2 x.config(time: 30, warmup: 2)
3
4 x.report('gsub') { string.gsub(/ USD/, '')
5 x.report('tr') { string.tr(" USD", "") }
6 x.report('delete_chars') { string.delete("^0-9")
7 x.report('delete_suffix') { string.delete_suffix(" USD") }
8 x.report('to_i.to_s') { string.to_i.to_s }
9 x.report("split") { string.split.first }
10 x.compare!
11end
There's a draw between winners. Which 2 do you expect to compete for being the fastest?
And: guess _how much_ slower `#gsub` is here.
1Comparison:
2 delete_suffix: 4354205.4 i/s
3 to_i.to_s: 4307614.6 i/s - same-ish: difference falls within error
4 split: 2870187.8 i/s - 1.52x slower
5 delete_chars: 1989566.1 i/s - 2.19x slower
6 tr: 1853957.1 i/s - 2.35x slower
7 gsub: 524080.6 i/s - 13.22x slower
8
There isn't always a specialized method that will suit your needs. You can't use #to_i
if you need to keep a leading "0". And #delete_suffix
leans heavily on the assumption that the currency is US Dollars.
The specialized methods are like precision tools—suitable for a specific task in a specific context. So there will always be cases where #gsub
is exactly what we need. It is versatile, and it's always top of mind. But it can be a bit harder to process and is often slower, even slower than I expected. To me, Ruby's richness is also one of the reasons that makes it so much fun to work with. The speed wins are a nice bonus.