Tuesday, May 21, 2013

Copying Tables in PDF to Excel

I am currently scouting for population data of species for our market study. All was going well (meaning all the data can be exported directly to Excel for easy tabulation and counting). Until I found this file on the list of registered horses from the Philippine Racing Commission.




It was in PDF file and whenever I try to save it to text, it saves in one column.


If you are like me who is lazy enough to fix the file manually in Excel (and who is optimistic that there is a better alternative *wink!), you will thank Ray Kurzweil, inventor of the Optical Character Recognition or OCR. What it does basically is recognize typewritten, printed or even handwritten text in images and convert it to a digital form that is searchable and editable. No more manual encoding!

Now, this may sound like an expensive software but it surprisingly isn't. In fact, there are OCR online sites that will let you convert your image for free! The most dependable one for me in terms of recognition accuracy  is OnlineOCR.net.

Here is the screenshot of the converted registered horses file.



Cool right?

But this site has its limitations also (don't we all?).
  • You can only convert one page of image a time (The image above is only the first page out the 42 pages in the PDF file).
  • You can convert at most 20 pages if you create an account because they will give you free credits

 However, beyond that you have to purchase credits.


But who needs to purchase if you can just snip your files into separate files, right?

And, like all other OCR's there is no perfect tool for recognition (at least not yet) so it is still better to double check the recognition before you hand it to your boss. At least, you can now use Spelling and Grammar to help you with that.

Now that Ray saved you a lot of encoding time, you should utilize that saved time properly. Smile for the whole time you should have been encoding. You may look silly but you will feel a lot better.

No comments:

Post a Comment

Your thoughts?