Wednesday, December 05, 2007

regex to split a document into words

Say you wanted to split a document into words, but works like are'nt shouldnt be split on the ' and numbers like 10,004,333 should remain intact, but other punctuation should be removed from the resulting word array.

A good way to do this is using Scanner's findWithinHorizon and a regular expression. This way you dont need to read the entire document into memory before processing it.

1 comment:

  1. Which language would this be? Java?..Can I do the same in C#?