Huh. I was thinking it seemed like a fairly reasonable question. You sort D by length, and then iterate through it doing a kind of string matching. Both of those operations are pretty "foundational" in programming, right?
I'd be curious what you think a better question would be. It's not for brand new programmers - it says the pre-req is a couple of previous computing courses.
Maybe I'm off here, but sorting of D should not be necessary. You add O(|D| log(|D|)) to the runtime complexity (with |D| being number of elements in D) while a single linear run through D is sufficient. The complexity of checking if a single word d in D is a subsequence of S is lienar, thus O(|S|).
While sorting might work out in the best case for a long String and a few words of different length so you can terminate early, you just made a problem that is solvable in average in O(|S| * |D|) to O(|S| * |D| * log(|D|)). If you don't realize this and the implications, I would assume you just failed your interview.
Sorting D is a one time operation of n log n, so the overall complexity is n log n + n * m, which reduces to O(n * m) where n is number of words in D and m is length of S.
On another note: I absolutely do not understand why we substitute the international mathematical symbol for cardinality (|A|) with variables we have to explain.
Computer science has a math background, so we all know set theory. At least I also learned using cardinality with Big O notation in university. But for some reason industry prefers to use variables here.
I'm not sure my math classes ever actually mentioned that particular notation. I agree it's more elegant to formalize the analysis, but it's not particularly elegant on the webpage -- the kerning feels off.
What is a subsequence then? I thought you are looking for words that show up in the long string. I visualize sliding each word across the string until it lines up with the same characters.
The subsequences of a string X are any strings Y such that Y is X with zero or more characters dropped, with the condition that the original characters of X must remain in the same order. As an example, the string "help" has 16 subsequences: ["", "h", "e", "l", "p", "he", "hl", "hp", "el", "ep", "lp", "hel", "hep", "hlp", "elp", "help"]. Note that the set of subsequences are isomorphic to the power set (assuming repeated characters are unique), so for a string with n unique characters, there are 2^n subsequences.
Okay gotcha. So for each word in the length-sorted set, walk the string "picking up" characters as you go. If you complete the word, that's your answer.
I'm sure there's a faster or more mathematical answer. But that would be my napkin python.
You probably don't need a dict; just a list of candidates. And you should clarify that you should sort your list by reversed word length and keep another list of indexes into each word.
EDIT: You shouldn't sort the list beforehand. You should compile the list of matched candidates after walking through the whole string and pick the longest match. (Or just represent the matched candidates in a list of bools). It's possible that your answer only happens to match the last few characters of your haystack, so you must walk the entire length of the haystack.
EDIT2: This solution also makes a bunch of assumptions and you can find a better runtime in certain cases. For example, if you have many more candidates than the length of your haystack, it might be worthwhile to generate all O(2^n) subsequences and store them in a hashmap/bloom filter. Then you can do (almost) O(1) lookups for each word. This would also lend well to a distributed compute approach: you could have k machines sharded in this manner: the first machine computes and stores in your hashmap/bloom filter the first 2^n/k subsequences, the second the second 2^n/k, etc. Then you can either 1) in the case of hashmaps, fan out each request to each machine, succeeding if any machine succeeds or 2) merge the bloom filters on a central machine (assuming the same hash functions and modulus constants) and do O(1) lookups. (Copying hashmaps is O(data stored inside); merging bloom filters is O(bloom filter size).)
I guess it depends on profiling real world behaviour. If you walk the list only a few times because long words are present, that's better than walking the list once and for each character, walking your set of words. Or if most words aren't a match early on, you'll cull the candidate set pretty quickly.
Yeah I'm convinced I wouldn't bother with something beyond my basic idea until I profile.
I'd be curious what you think a better question would be. It's not for brand new programmers - it says the pre-req is a couple of previous computing courses.