If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are open source projects and libraries out there which can help, and which can scale!
In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to use things like Apache Tika to do this, along with some other libraries to complement it. Once that part's all sorted, we'll look at how to roll this all out for a large-scale Search or Big Data setup, helping you turn those 1s and 0s into useful content at scale!