Huh. I was thinking it seemed like a fairly reasonable question. You sort D by l...

ascar · on Dec 19, 2018

Maybe I'm off here, but sorting of D should not be necessary. You add O(|D| log(|D|)) to the runtime complexity (with |D| being number of elements in D) while a single linear run through D is sufficient. The complexity of checking if a single word d in D is a subsequence of S is lienar, thus O(|S|).

While sorting might work out in the best case for a long String and a few words of different length so you can terminate early, you just made a problem that is solvable in average in O(|S| * |D|) to O(|S| * |D| * log(|D|)). If you don't realize this and the implications, I would assume you just failed your interview.

panda88888 · on Dec 19, 2018

Sorting D is a one time operation of n log n, so the overall complexity is n log n + n * m, which reduces to O(n * m) where n is number of words in D and m is length of S.

ascar · on Dec 19, 2018

You are right, I was off.

On another note: I absolutely do not understand why we substitute the international mathematical symbol for cardinality (|A|) with variables we have to explain.

Computer science has a math background, so we all know set theory. At least I also learned using cardinality with Big O notation in university. But for some reason industry prefers to use variables here.

jldugger · on Dec 19, 2018

I'm not sure my math classes ever actually mentioned that particular notation. I agree it's more elegant to formalize the analysis, but it's not particularly elegant on the webpage -- the kerning feels off.

teraflop · on Dec 19, 2018

Nope, that would find words that are substrings, not subsequences.

Waterluvian · on Dec 19, 2018

What is a subsequence then? I thought you are looking for words that show up in the long string. I visualize sliding each word across the string until it lines up with the same characters.

tylerhou · on Dec 19, 2018

The subsequences of a string X are any strings Y such that Y is X with zero or more characters dropped, with the condition that the original characters of X must remain in the same order. As an example, the string "help" has 16 subsequences: ["", "h", "e", "l", "p", "he", "hl", "hp", "el", "ep", "lp", "hel", "hep", "hlp", "elp", "help"]. Note that the set of subsequences are isomorphic to the power set (assuming repeated characters are unique), so for a string with n unique characters, there are 2^n subsequences.

Waterluvian · on Dec 19, 2018

Okay gotcha. So for each word in the length-sorted set, walk the string "picking up" characters as you go. If you complete the word, that's your answer.

I'm sure there's a faster or more mathematical answer. But that would be my napkin python.

(Edits made)

tylerhou · on Dec 19, 2018

You probably don't need a dict; just a list of candidates. And you should clarify that you should sort your list by reversed word length and keep another list of indexes into each word.

EDIT: You shouldn't sort the list beforehand. You should compile the list of matched candidates after walking through the whole string and pick the longest match. (Or just represent the matched candidates in a list of bools). It's possible that your answer only happens to match the last few characters of your haystack, so you must walk the entire length of the haystack.

EDIT2: This solution also makes a bunch of assumptions and you can find a better runtime in certain cases. For example, if you have many more candidates than the length of your haystack, it might be worthwhile to generate all O(2^n) subsequences and store them in a hashmap/bloom filter. Then you can do (almost) O(1) lookups for each word. This would also lend well to a distributed compute approach: you could have k machines sharded in this manner: the first machine computes and stores in your hashmap/bloom filter the first 2^n/k subsequences, the second the second 2^n/k, etc. Then you can either 1) in the case of hashmaps, fan out each request to each machine, succeeding if any machine succeeds or 2) merge the bloom filters on a central machine (assuming the same hash functions and modulus constants) and do O(1) lookups. (Copying hashmaps is O(data stored inside); merging bloom filters is O(bloom filter size).)

Waterluvian · on Dec 19, 2018

I guess it depends on profiling real world behaviour. If you walk the list only a few times because long words are present, that's better than walking the list once and for each character, walking your set of words. Or if most words aren't a match early on, you'll cull the candidate set pretty quickly.

Yeah I'm convinced I wouldn't bother with something beyond my basic idea until I profile.

tylerhou · on Dec 19, 2018

Yes, I missed that you can drop candidates whose lengths are less than your longest match so far. So a set might be better.