I have a digital subscription to a textbook, but it’s super annoying to have to use the website to access the book. I’d like to scrape the ebook and dump the contents into a pdf. I have downloaded proprietary pdfs from websites before using downloader browser plugins and predictable urls, but this site is pretty locked down, with randomly generated url tokens and a combination of xml and image data.

Has anyone managed to scrape a digital textbook like this? Any ideas where I should begin?

  • FlyForABeeGuy@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    I had a few books like that that were directly on a scummy academic editors website. No pdf or usable files. I’m currently far from home, so I can’t tell you exacly what program i used. But i noticed that every page was downloaded in my temporary files as image data (cached version on page). So i had to manually flip a few pages, download them 1 by 1 and naming them correctly. I’ll look ok my pc to try to find the program that did that when I’m back

      • KevonLooney@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        1 year ago

        Why don’t you simply open the book in a virtual machine like VMware and hit print? It can print to a PDF.

        • TokyoMonsterTrucker@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          1 year ago

          I can print pages to PDF without a VM. The problem with printing is that these books are over 1000 pages, so I need to automate a good chunk of the process. Ideally, I’d like to capture the XML text for the pdf as well as it will look much better and I will not have to manually crop 1000 PDFs with annoying borders.

          • KevonLooney@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            1 year ago

            Yeah, I believe you can do that by printing to a non-existent printer and then finding the file image waiting in the print queue. I don’t know if it works on Windows 11 but it used to work pretty well.