arXiv moving from Cornell servers to Google Cloud

(info.arxiv.org)

212 points | by ColinWright 1 day ago

34 comments

x_may 1 day ago
It may be that it was time for the hardware that was previously running Arxiv to be retired and this is just another Capex -> Opex decision being made by so many tech companies.
I'd like to know if GCP is covering part of the bill? Or will Cornell be paying all of it? The new architecture smells of "[GCP] will pay/credit all of these new services if you agree to let one of our architects work with you". If GCP is helping, stay tuned for a blog post from google some time around the completion of the migration with a title like "Reaffirming our commitment to science" or something similarly self affirming.
[-]
- khuey 19 hours ago
  > If GCP is helping, stay tuned for a blog post from google some time around the completion of the migration with a title like "Reaffirming our commitment to science" or something similarly self affirming.
  "Google pays to run an enormous intellectual resource in exchange for a self-congratulatory blogpost" seems like a perfectly acceptable outcome for society here.
  [-]
  - stonogo 14 hours ago
    It wasn't when it happened to Usenet.
    [-]
    - toomuchtodo 10 hours ago
      Frequent backups to the Internet Archive for rehydration when needed. RIP Dejanews. Hopefully we’ve learned from past experience.
  - mistrial9 9 hours ago
    mirrors, please
- yumraj 14 hours ago
  > If GCP is helping, stay tuned for a blog post from google some time around the completion of the migration with a title like "Reaffirming our commitment to science" or something similarly self affirming.
  This is an odd criticism. If a company is footing the bill, it can’t even talk about it to gain some publicity/good will?
  [-]
  - nophunphil 8 hours ago
    Footing the bill for how long?
- flakiness 20 hours ago
  https://info.arxiv.org/about/supporters.html
```
  Our Supporters
  ...
  Gold Sponsors
  Google, Inc (USA)
```
- TZubiri 13 hours ago
  "Reaffirming our commitment to science" or something similarly self affirming.
  While I understand that something is more genuine if done in secret, it doesn't stop being a real commitment to science just because you make a pr post about it.
  If company X contributes to Y open source foundation, that's real and they get to claim clout, nobody cares about a post anyways.
tokai 19 hours ago
Would love to see arXiv set up as a consortium of international academic libraries instead. Scientific publishing is where it is today because universities and scientific societies sold off or gave their journals to private enterprises. Letting Google in is a move in the wrong direction imo.
[-]
- fc417fc802 14 hours ago
  Some sort of federated preprint protocol where anyone could stand up a node and clone the existing data would be ideal. The current centralized operator then becomes "just" a curator (and competing curators are easy to set up).
harywilke 1 day ago
This is something that started as far back as march 2023.
https://investinopen.org/blog/ioi-partners-with-arxiv-to-dev... https://blog.arxiv.org/2023/06/12/arxiv-is-hiring-software/
kiproping 23 hours ago
I recently read about arxiv, it's history and all the mini-drama's around it https://www.wired.com/story/inside-arxiv-most-transformative....
I wonder if Ginsparg is finally retiring and relinquishing access.
[-]
- chubot 19 hours ago
  Wow, this is a great article! (other archive link - https://archive.is/XVCi7 )
  I didn't realize arXiV was started in 1991. And then I wondered why I had never heard of it while I was at Cornell from 1997-2001. Apparently it only assumed the arXiV name in 1999.
  I like that it was a bunch of shell scripts :)
  Long before arXiv became critical infrastructure for scientific research, it was a collection of shell scripts running on Ginsparg’s NeXT machine.
  Interesting connections:
  As an undergrad at Harvard, he was classmates with Bill Gates and Steve Ballmer; his older brother was a graduate student at Stanford studying with Terry Winograd, an AI pioneer.
  On the move to the web in the early 90's:
  He also occasionally consulted with a programmer at the European Organization for Nuclear Research (CERN) named Tim Berners-Lee
  And then there was a 1994 move to Perl, and 2022 move to Python ...
  Although my favorite/primary language is Python, I can't help but wonder if "rewrite in Python" is mainly a social issue ... i.e. maybe they don't know how to hire Perl programmers and move to the cloud. I guess rewrites are often an incomplete transmission of knowledge about the codebase.
  [-]
  - chubot 16 hours ago
    Another tidbit: https://arxiv.org/abs/1706.04188
    FAQ 1: Why did you create arXiv if journals already existed? Has it developed as you had expected?
    Answer: Conventional journals did not start coming online until the mid to late 1990s. I originally envisioned it as an expedient hack, a quick-and-dirty email-transponder written in csh to provide short-term access to electronic versions of preprints until the existing paper distribution system could catch up, within about three months.
    So it was in csh on NeXT. Tim Berners-Lee also developed the web on NeXT!
    [-]
    - LarsDu88 14 hours ago
      John Carmack and John Romero also developed the original Doom on NeXT
kevinventullo 20 hours ago
If funding is an issue, I’m quite certain they could set up a nonprofit to support it. I would happily donate to keep arxiv around.
[-]
- tough 19 hours ago
  its not a nonprofit (cornell, its a university) but they do accept donations
  https://info.arxiv.org/about/donate.html
  [-]
  - philipov 18 hours ago
    Cornell is a nonprofit.
    [-]
    - tough 16 hours ago
      didn't occur to me that was possible in the US cause private unis spend most on marketing i guess but glad it is
      stand corrected
gcr 1 day ago
Cornell is currently in hiring freeze. These roles will not be filled.
Source: I applied to a Cornell-related lab in March. A week after submitting my application the role was rescinded and my contact emailed me explaining the situation.
[-]
- wyclif 7 hours ago
  It looks like the links are broken to the software engineer and DevOps open roles.
- quantumHazer 23 hours ago
  Is it related with policies from the US administration?
  [-]
  - trop 23 hours ago
    From 3/17/25:
    > Together with all of American higher education, Cornell is entering a time of significant financial uncertainty. The potential for deep cuts in federal research funding, as well as tax legislation affecting our endowment income, has now been added to existing concerns related to rapid growth and cost escalations. It is imperative that we navigate this challenging financial landscape with a shared understanding and common purpose, to continue to advance our mission, strengthen our academic community, and deepen our impact. [0]
    [0] https://hr.cornell.edu/2025-hiring-pause
  - mindslight 11 hours ago
    Do bears shit in the woods?
darkoob12 1 day ago
Fantastic. Now countries like Iran are going to be blocked. Internet is not a public network anymore It is owned by mostly American cooperation and they will decide what content to show and which group of people can access it.
[-]
- whygcp 1 day ago
  That's true. I recently had to move a VM from gcp to hetzner because gcp would silently drop all packets to some countries, Iran included. And a Stack overflow question was the easiest way to learn about it, not gcp docs.
  [-]
  - Orygin 23 hours ago
    I have looked at it recently and it seems Iran is blocking GCP, not the other way around. Not sure if Google keep a doc up to date with who blocks them.
    [-]
    - sciurus 14 hours ago
      From my past experience, I can say that Google Cloud services (e.g load balancers) by default blocked traffic from ITAR sanctioned countries. Not just blocking people in those countries from becoming customers of GCP, but blocking them from accessing content hosted on GCP.
      I didn't know how that situation had evolved since I last used GCP.
    - whygcp 22 hours ago
      Why do you think it's Iran doing blocking?
      That's not what Google says: https://support.google.com/a/answer/2891389?hl=en
      [-]
      - surajrmal 19 hours ago
        Workspace is not gcp.
- infecto 23 hours ago
  Is it not ideal to block Iran?
  [-]
  - sd9 20 hours ago
    Why should the 90 million people in Iran not have easy access to scientific knowledge?
    [-]
    - infecto 15 hours ago
      Unfortunately their government is a historic adversary to many western nations. The people there are nice but that’s what happens.
      [-]
      - Wilder7977 13 hours ago
        So it is not _ideal_ , considering that human knowledge is a common good.
        [-]
        infecto 13 hours ago
        Disagree. It is ideal. Iran has been behind many terrorist activities over the decades.
        [-]
        Wilder7977 5 hours ago
        The point is that this is not a punishment against "them", it's a loss for everyone (human kind), who affects for the vast majority innocent people both in Iran and all over the world. I don't see a world where this is "ideal", even if you agree with the block.
        We don't know where the people who will make scientific breakthrough will be. Imagine losing the cure for cancer, or a form of clean energy (or anything that could change the world for everyone) due to this.
        axus 10 hours ago
        So has the USA but we're mostly good people here.
        I can't be angry at Google for following US law, any more than I can be angry at Huawei for following Chinese law.
dvrp 1 day ago
Hope they stay vendor unlocked though
perihelions 1 day ago
Is this related to the federal government's vendetta against Cornell? And is there a risk Cornell becomes unable to operate arXiv because of this?
https://www.npr.org/2025/04/09/g-s1-59090/trump-officials-ha... ("Trump officials halt $1 billion in funding for Cornell, $790 million for Northwestern")
[-]
- harywilke 1 day ago
  No, they announced that they are starting this project back in June 2023 [0], though it is good to see Cornell suing the administration for the second time, first was back in February. And as an alum, who also had relatives up at syracuse, i appreciate the snark from syracuse.com [1] calling Cornell a 'central NY college' heh
  0. https://blog.arxiv.org/2023/06/ 1. https://www.syracuse.com/news/2025/04/central-ny-college-sue...
sppfly 1 day ago
I'm wondering who will pay for the service fee to Google. arXiv will still be free right? That would not be just 88,000/year...
Daviey 21 hours ago
If it means I can download papers without spoofing my user-agent, then I am happy!
[-]
- fc417fc802 14 hours ago
  On the contrary, I'd expect Google to be much more proficient at blocking requests based on various factors. It wouldn't surprise me in the slightest if recaptcha made an appearance.
  By all rights arxiv should be moving towards decentralization as opposed to being picked up by one of the largest centralized players.
andyjohnson0 1 day ago
Looks to me like their motivation was largely down to skillset issues and recruitment.
If this is motivated by the prospect of being menaced by the current US government then, while Google might be a safer home, arXiv is still vulnerable to having its funding disrupted by malicious actors.
einpoklum 1 day ago
Another piece of (effectively) public infrastructure coming under the control of the mega-corporations, and Alphabet specifically.
[-]
- johann8384 1 day ago
  I don't understand why people are being so dramatic about them using GCP.
  [-]
  - throwaway984393 23 hours ago
    [dead]
  - pharrington 1 day ago
    Google absolutely will pull the rug on arXiv in the future. Any corporate entity would.
- dewey 1 day ago
  Just because you are hosting at a company doesn’t mean they are “controlling” it.
  [-]
  - johnisgood 23 hours ago
    Can't they stop providing services at a whim? Can't they do whatever with the data?
    [-]
    - surajrmal 18 hours ago
      If cloud providers could do whatever they want with their customers data, they wouldn't have any customers.
      [-]
      - johnisgood 15 hours ago
        > If cloud providers could do whatever they want with their customers data, they wouldn't have any customers.
        I disagree, but it depends on what we mean by "doing".
    - kelnos 23 hours ago
      Yes to the first, but other platforms that run k8s exist.
      No to the second, unless they've come to some sure of agreement with Cornell that lets them.
      [-]
      - johnisgood 23 hours ago
        The first is already awful. Without notice?
        [-]
        surajrmal 18 hours ago
        Generally speaking all companies are capable of stopping services at a whim, unless there are contractual obligations for otherwise. Singling out Google here isn't helpful unless there is a unique provision in their contract that others don't have.
        Also worth noting that gcp has over a decade of continuous service with no indication to think it should disappear any time soon. It's not clear why Google's consumer product strategies can be used to infer how their cloud products are run.
        [-]
        johnisgood 17 hours ago
        I wonder why they chose not to use in-house tools instead.
        In any case, are there any documented instances of Google Cloud discontinuing service or terminating a client's hosting for <reasons>?
  - einpoklum 15 hours ago
    Ah, but it's not merely "hosting"; they're rewriting the code to fit Google Cloud, and possibly to depend on it.
- spenrose 20 hours ago
  Networking and power infrastructure are "under the control of" large corporations also. Without them, Arxiv doesn't exist.
dig1 1 day ago
> We are already underway on the arXiv CE ("Cloud Edition") project... replace the portion of our backends still written in perl and PHP...re-architect our article processing to be fully asynchronous,.. we can deploy via Kubernetes or services like Google Cloud Run...improve our monitoring and logging facilities
Why do I smell someone from G was there and sold them fancy cloud story (or they wanted VMs and reseller sold them CloudRun)? Anyway, goodbye simplicity and stability, hello exorbitant monthy costs for the same/less service quality. Would love to be wrong.
[-]
- jstummbillig 1 day ago
  I noticed that while everyone on hn is quite clever, we are regularly not clever enough to assume that other people in similar settings are just as clever, and recognize when they probably spent a lot more time thinking about an issue we just skim the headline of.
  [-]
  - dogleash 21 hours ago
    > I noticed that while everyone on hn is quite clever
    Are we though? I see a pockets of world-expert level knowledge, some reasonable shop talk, and quite a bit of really dumb nonsense that is contradicted by my professional experience. Or just pedestrian level wrong. I mostly shitpost.
    I don't have an opinion about arxiv's hosting, but it does read to be one of those projects that includes cleaning up of long standing technical debt that they probably couldn't get funded if not for the flashy goal. The flashy goal will, regardless of it's own merits, also be credited for improvements that they should have made all along.
  - Moto7451 22 hours ago
    I’m not sure of the perspective of the OP but his comment hits home in that a common theme for the past ten years of my career has been “let’s move something to X, Y, and Z because that’s what Google says you should do.”
    Note that Google doesn’t outright define an architecture for anyone, but people who worked at Google who come in as the hot hire bring that with them. Ditto for other large employers of the day. One of my mentors had to deal with this when Yahoo was the big deal.
    In some cases, when abstractions are otherwise correct, this hasn’t been a big deal for the software projects and companies I’ve been involved with. It’s simply “there’s an off the shelf well supported industry standard we can use so we can focus on our customer/end goal/value add/etc.” Using an alternative docker runtime “that Google recommends” (aka is suggested by Kubernetes) is just a choice.
    Where people get bit and then look at this with a squint, is when you work at several places where on the suggestion of the new big hire from Google/Amazon/Meta/etc, software that runs just fine on a couple server instances and has a low and stable KTLO budget ends up being broken down into microservices on Kubernetes and suddenly needs a team of three and doesn’t provide any additional value.
    The worst I’ve experienced is a company that ended up listing the cost of maintaining their fork of Kubernetes and development environment as a material reason for a layoff.
    Google’s marketing arm also has made deals to help move people to Google Cloud from AWS. Where I am working now this didn’t work to plan on either side it seems so we’re staying dual cloud, a situation no one is happy about. Before my time there was an executive on the finance side that believed Google was always the company to bet on and didn’t see Amazon as more than a book store. Also money. Different type of hubris, different type of pressure, same technical outcome as a CTO that runs on a “well Google says” game plan.
    At the end of the day, Google is a big place and employs a lot of people. You’re going to have a lot of individuals who experience hucksters trying to parlay Google experience into an executive or high ranked IC role and they’re going to lean on what they know. That has nothing to do with Google itself, but their attempts to pry people away from AWS are about the same flavor from my personal experience.
  - kadushka 23 hours ago
    The people on both ends of that conversation (google vs cornell) are clever but the result will probably be enshittification.
    [-]
    - keepamovin 23 hours ago
      Impossible to argue with HN's regression to entropy cynicism hehehe :)
    - jeffbee 23 hours ago
      I'm torn on flagging comments that throw out "enshittification". Do you feel that this stands in for actual thought?
      [-]
      - kadushka 22 hours ago
        [flagged]
        [-]
        Analemma_ 20 hours ago
        I'm going to fling that response right back at you. "Enshittification" is not a generic term for "update I don't like", it describes a specific dynamic that happens when a company inserts itself as a middleman into a two-sided market. arXiv could get worse in other ways, but enshittification in particular can't happen to it, that's a category error.
        I think you should offer better thoughts instead of mad-libbing in buzzwords where they don't apply. Enshittification is actually a useful concept, and I don't want it to go the way of "FUD", which had a similar trajectory in the later years of Slashdot where people just reduced it to a useless catch-all phrase whenever Microsoft said anything about anything.
        [-]
        kadushka 19 hours ago
        I'm using "enshittification" as defined here: https://www.merriam-webster.com/slang/enshittification
        I believe there's a real chance of it happening here as a result of this transition. I personally experienced results of several of similar transitions over the course of my career. What I haven't experienced are problems with Arxiv that would motivate such a change. There might be actual problems they are trying to solve - but I still believe things will probably get worse as a result.
  - wordofx 23 hours ago
    [flagged]
    [-]
    - fuckbrownpeople 23 hours ago
      [dead]
- specialp 1 day ago
  I don't think it is that. I work for an org with close ties to arXiv, and just like us they are getting a lot more demand due to AI crawling. As a primary source of information there is a lot of traffic. They do have technical issues from time to time due to this demand, and I think their stability is just due to the exceptional amount of effort they take to keep it going. They are also getting more submissions and interest.
  Kubernetes does add complexity but it does add a lot of good things too. Auto scaling, cycling of unhealthy pods, and failover of failed nodes are some of them. I know there is this feeling here sometimes that cloud services and orchestrated containers are too much for many applications, but if you are running a very busy site like arXiv I can't see how running on bare metal is going to be better for your staff and experience. I don't think they are naive and got conned into GCP as the OP alludes to. They are smart people that are dealing with scaling and tech debt issues just like we all end up with at some point in our careers.
  [-]
  - JackC 23 hours ago
    > I work for an org with close ties to arXiv, and just like us they are getting a lot more demand due to AI crawling
    Funny, I also work on academic sites (much smaller than arXiv) and we're looking at moving from AWS to bare metal for the same reason. The $90/TB AWS bandwidth exit tariff can be a budget killer if people write custom scripts to download all your stuff; better to slow down than 10x the monthly budget.
    (I never thought about it this way, but Amazon charges less to same-day deliver a 1TB SSD drive for you to keep than it does to download a TB from AWS.)
    [-]
    - Imustaskforhelp 23 hours ago
      I don't understand, why don't you use cloudflare? Don't they have an unlimited egress policy with R1?
      Its way more predictable in my opinion that you only pay per month a fixed amount to your storage, it can also help the fact that its on the edge so users would get it way faster than lets say going to bare metal (unless you are provisioning a multi server approach and I think you might be using kubernetes there and it might be a mess to handle I guess?)
      [-]
      - sitkack 6 hours ago
        Regardless, if you are delivering PDFs, you should be using a CDN.
        If crawling is a problem, 1 it is pretty easy to rate limit crawlers, 2 point them at a requestor pays bucket and 3, offer a torrent with anti leech.
      - mcmcmc 23 hours ago
        Could have something to do with Cloudflare’s abhorrent sales practices.
        [-]
        keepamovin 23 hours ago
        Can you tell me more? I think my business needs some abhorrent sales practices. That's how it's done, right?
        [-]
        mcmcmc 22 hours ago
        One example
        https://robindev.substack.com/p/cloudflare-took-down-our-web...
        [-]
        ryao 6 hours ago
        I suspect that is the result of this:
        https://www.reddit.com/r/sales/comments/134u0mq/cloudflare_c...
        They got rid of all of the “underperforming” sales people and hired new ones. That nightmare is the result. I suspect the higher the sales performance, the more likely they were doing things like this.
        keepamovin 22 hours ago
        Wow, okay. That's a little too extreme. How is cloudflare acting insecure when so large? hmmm, confused.
    - ryao 22 hours ago
      The two are not comparable. The 1TB of transit at Amazon can be subdivided over many recipients, while the solid state drive is empty and only can be sent to one.
      That said, I agree that transit costs are too high.
      [-]
      - fc417fc802 13 hours ago
        So order multiple drives, transfer the data to them, and drop them in the mail to the client. That should always be the higher bandwidth option, but in a sane world it would also be less cost effective given the differences in amount of energy and sorts of infrastructure involved.
        The reason to switch away from fiber should be sustained aggregate throughput, not transfer cost.
        [-]
        ryao 10 hours ago
        The other guy was also comparing them based on transfer cost. Given that 1TB can be divided across billions of locations, shipping physical drives is not a feasible alternative to transit at Amazon in general.
        [-]
        fc417fc802 9 hours ago
        I'm not trying to claim that it's generally equivalent or a viable alternative or whatever to fiber. That would be a ridiculous claim to make.
        The original example cited people writing custom scripts to download all your stuff blowing your budget. A reasonable equivalent to that is shipping the interested party a storage device.
        More generally, despite the two things being different their comparison can nonetheless be informative. In this case we can consider the up front cost of the supporting infrastructure in addition to the energy required to use that infrastructure in a given instance. The result appears to illustrate just how absurd the current pricing model is. Bandwidth limits notwithstanding, there is no way that the OPEX of the postal service should be lower than the OPEX of a fiber network. It just doesn't make sense.
        [-]
        ryao 6 hours ago
        That is true. I was imagining the AWS egress costs at my own work where things are going to so many places with latency requirements that the idea of sending hard drives is simply not feasible, even with infinite money and pretending the hard drives had the messages prewritten on them from the factory. Delivery would never be fast enough. Infinite money is not feasible either, but it shows just how this is not feasible in general in more than just the cost dimension.
  - CharlieDigital 1 day ago
    Sound like all they needed was a CDN if the problem is AI crawlers. Adding auto-scaling compute just increases costs faster.
    [-]
    - specialp 23 hours ago
      CDN is one part of strategy to deal with load. But it is not the only solution unless your site is exclusively static content. Their search, APIs, submission pipelines, duplicate detectors and a lot of other things are not going to be powered by CDNs.
      [-]
      - motorest 23 hours ago
        Thank you for the insight. It's very easy to prescribe simple solutions when we are oblivious to the actual problems being solved.
      - GTP 23 hours ago
        But, under the assumption that the problem are indeed AI crawlers, of the things you listed only the search would be under increased load.
        [-]
        specialp 23 hours ago
        I am sure their pages are not entirely static either. The APIs are used by researchers and AI companies too. Also with search you end up having people trying to use it for RAG with AI. I have dealt with this all and there is no one dead simple solution to deal with things. AI crawlers are one part of things, but they also have increasing submissions, have to deal with AI generated spam papers, and all sorts of stuff. There's always this feeling here on HN that oh it is dead simple you just do "X" as if the people that are dealing with it don't know that.
      - coliveira 23 hours ago
        All these services can be throttled to deal with AI. I don't see this as a justification. The idea that a service like arXiv should be run as a startup is, simply put, foolish.
      - Imustaskforhelp 23 hours ago
        pardon me but cloudflare workers seem better for this approach.
        If we can get for the fact that we require javascript to run it, aside from that. Cloudflare workers is literally the best single thing to happen at least to me. With a single domain, I have done so many personal projects for problems I found interesting and I built so many projects for literally free, no Credit card. No worries whatsoever.
        I might ditch writing other languages for server based like golang even though I like golang more just because cloudflare workers exists.
        [-]
        anelson 22 hours ago
        I too am impressed by Cloudflare Workers’ potential.
        However Workers supports WASM so you don’t necessarily have to switch to JavaScript to use it.
        I wrote some Rust code that I run in Cloudflare Functions, which is a layer on top of Cloudflare Workers which also supports WASM. I wrote up the gory details if you’re interested:
        https://127.io/2024/11/16/generating-opengraph-image-cards-f...
        JavaScript is most definitely the path of least resistance but it’s not the only way.
    - miyuru 23 hours ago
      they already use fastly.
      https://blog.arxiv.org/2023/12/18/faster-arxiv-with-fastly/
  - evrythingisfin 22 hours ago
    GC seems like a bad choice to me. The GC CLI and tools aren’t terrible, but their services in-general have historically been not been friendly to use and their support involves average documentation and humans that talk at users instead of serve them. Google as a company is not what it was, either. A lot of their funding was driven from advertising, and between social media, streaming, and LLMs, that seems like it may be starting to dry up.
    Azure? Microsoft as a company is still the choice of most IT departments due to its ubiquitousness and low cost barrier to entry. I personally wouldn’t use Azure if I had the choice, because it’s easy and cheap at the surface, with hell underneath (except for products based on other things, like AD which was just a nice LDAP server, or C# which was modelled after Java).
    I’d have gone with AWS. EKS isn’t bad to setup and is solid once it’s up. As far as the health of Amazon itself, China entering the their space hasn’t significantly changed their retail business, though eventually they’ll likely be in price wars.
    The greatest risk to any cloud provider I think would be a war that could force national network isolation or government taking over infrastructure. And the grid would go down for a good while and water would stop, so everyone would start migrating, drinking polluted water, then maybe stealing food. At that point, whether or not they chose GC doesn’t matter anymore.
    [-]
    - surajrmal 18 hours ago
      Everything has pros and cons. Just because their calculus came out different from yours doesn't mean they made the wrong decision for their situation. Hundreds of thousands of organizations have made similar conclusions to arxiv.
      Google cloud profitable these days and advertising or other income streams drying up will only entice Google to further invest in cloud to ensure they are more diversified. Google isn't going to go away overnight and cloud is perhaps the least risky business they operate in.
    - Kwpolska 20 hours ago
      Isn't Azure noticeably more expensive than AWS?
  - sitkack 6 hours ago
    They don't need K8S, containerization yes, but not K8S.
  - Funes- 23 hours ago
    >AI crawling
    Can you not reliably block crawlers in this day and age?
    [-]
    - masklinn 23 hours ago
      AI crawlers are a plague, they are intentionally badly behaved and designed to be hard to flag without nuking legit traffic. That’s why projects like nepenthes exist.
    - specialp 23 hours ago
      You can to some degree with Cloudflare and other solutions. But, do you want to block them all? AI is a very useful tool for people to discover information and summarize results. Especially in scholarly publishing where one would have to previously search on dumb keywords, and have to read loads of abstracts to find the research that pertains to their interests. So by blocking AI crawlers and bots completely, you are shutting off what will probably end up being the primary way people use your resource not too long from now. arXiv is a hub of research, and their mission is to make that research freely available to the world.
    - Imustaskforhelp 23 hours ago
      Dude, I don't want to sound cloudflare advocate because my last 2 comments on this thread are just shilling cloudflare...
      but I think cloudflare is the answer to this thing as well.. (Sorry if I am being annoying) (Cloudflare isn't sponsoring me, I just love their service so much)
- londons_explore 1 day ago
  This....
  I bet arXiv was run on server hardware costing under $10k before...
  And now it'll end up costing $10k per month (with free credit from Google which will eventually go away and then arXiv will shut down or be forced to go commercial)
  [-]
  - perihelions 1 day ago
    arXiv budgets about $88,000/year for server costs as of 2019,
    (pdf) https://info.arxiv.org/about/reports/arXiv_CY19_midyear.pdf
    [-]
  - falcor84 1 day ago
    I assume they would still have the serving code they use now and if they do choose to go back to maintaining it on their own hardware they'll always have that option. It seems they just don't want that anymore.
    [-]
    - gapan 1 day ago
      That will be several years down the line, when everything has bit-rotten to death and nobody in the team remembers how the old setup worked.
    - londons_explore 13 hours ago
      That's why they're being encouraged to use cloud run and all the other cloud functionality so a migration back becomes very hard.
- sightbroke 1 day ago
  > This is a project to re-home all arXiv services from VMs at Cornell to a cloud provider (Google Cloud).
  They are already using VMs but one of the things it'll do is:
  > containerize all, or nearly all arXiv services so we can deploy via Kubernetes or services like Google Cloud Run
  And further state:
  > The modernization will enable: - arXiv to expand the subject areas that we cover - improve the metadata we collect and make available for articles, adding fields that the research community has requested such as funder identification - deal with the problem of ambiguous author identities - improve accessibility to support users with impairments, particularly visual impairments - improve usability for the entire arXiv community
  [-]
  - notpushkin 1 day ago
    Containers part I can understand. Why not spin up a tiny Docker Swarm (or k3s/k0s) cluster instead of straight out going to Google though?
    [-]
    - kelnos 23 hours ago
      Because those other things require more maintenance effort to run.
      Getting creative is often just a pain in the ass. Doing the standard things, walking the well-trod path, is generally easier to do, even if it may not be the cheapest or most hardware/software-efficient thing to do.
- elif 1 day ago
  I think this is more likely a move to prevent Cornell funding from getting tied up with dictations about what gets published.
  But in all likelihood someone was probably just like "we're tired of doing ops on 2 decades old stacks"
  [-]
  - johann8384 1 day ago
    It sounds like they are just using GCP. It doesn't really change that.
    [-]
    - whatever1 23 hours ago
      Google cloud can easily move the the instance to a region that is not science/free speech hostile
    - elif 23 hours ago
      Yea but GCP was state of the art 20 years ago, php+perl was already crufty
      [-]
      - ordersofmag 23 hours ago
        So state of the art it wasn't even available yet (preview launch was in 2008).
        [-]
        elif 20 hours ago
        Sounds about right. I remember hearing about it first in a talk being given by Doug Crockford at my university around that time. It blew my mind. I thought it was like gcc for the Internet. It's kind of wild that in the interim we have experienced the complete rise and fall of mongodb, node.js, even today the react paradigm are all expressions of this tiny little functional scripting language..
- johann8384 1 day ago
  Moving to K8s, adding in additional instrumentation, just sounds like some new folks took over or joined the project and are doing some renovations. Seems like pretty standard stuff, doesn't really seem as sinister as you make it to be.
- moralestapia 23 hours ago
  >we can deploy via Kubernetes
  Oh noes ... they got scammed
paologiacometti 21 hours ago
I would be curious if by any chance google has given a discount as long as they allow them to use latex sources of the papers to addrest their artificial intelligence models
[-]
- staunton 14 hours ago
  Arxiv makes the latex source available for download since a while ago. I'm sure all of that data has long been used for training already.
johnisgood 23 hours ago
So, is this good or bad news? Why?
musicale 11 hours ago
RIP
jeffbee 1 day ago
Job opening for the rare perl+latex hackers.
[-]
- rahen 1 day ago
  Perl is becoming rare, but is LaTeX also falling out of use for technical papers? Most of the scientific and CS papers I’ve seen lately still seem to use it, even those coming from Microsoft.
  That said, it's often generated via Org-mode or WYSIWYG tools these days.
  [-]
  - bluenose69 23 hours ago
    LaTeX is still very much in favour in my field of geophysical research. I believe that the same is true across most fields that rely on mathematical notation to explain things in written form. I've never found it difficult to write in a non-WSIWYG system ... indeed, it is really quite convenient.
    Years ago I had a student in my class who was unable to make out written material, but had a machine that read text aloud. My class notes (some 300+ pages) are chock full of mathematics. I went to the student's house to see how that reading machine worked. Provided with LaTeX input, it said a lot of things like "backslash alpha" and "begin open brace equation close brace" stuff. I wrote a quick perl script to change it, so it said "alpha" and "begin equation". Presto -- it was exactly what the student needed. This was, as I say, many years ago. Maybe now there is software that can handle MSword files, etc., but that definitely did not exist at the time. The result? The student was able to take the class, and did very well in it.
  - KeplerBoy 1 day ago
    LaTeX is alive and well. I absolutely don't have the feeling that's going to change anytime soon.
    All the conferences I'm looking at have word and latex templates. Word clearly isn't going to replace LaTeX for many reasons.
    Truth to be told: As long as you're given a template and have to stick to that template, LaTeX is all you could ever want. With overleaf there's barely a tooling learning curve either.
  - toxik 1 day ago
    It’s still 100% LaTeX in my fields (AI, ML, robotics, telecom).
    [-]
    - rahen 22 hours ago
      Out of curiosity, do you happen to know which tools are most commonly used to generate those LaTeX papers? Is auctex still a thing nowadays?
  - btrettel 22 hours ago
    > is LaTeX also falling out of use for technical papers?
    In mechanical and aerospace engineering broadly, I see less use of LaTeX over time. It's hard to estimate precisely which percent of the market is LaTeX vs. Word vs. something else, but I think I can see trends. Almost no one where I work uses LaTeX, though LaTeX used to be more popular there.
    I think it probably varies a lot by narrow specialty and publication venue too. Papers submitted to Journal of Fluid Mechanics seem to overwhelmingly use LaTeX. The main conference I would submit papers to during my PhD is primarily Word (though I used LaTeX). I have seen at least one Word-only engineering journal, though it wasn't something I would publish in.
ilynd 19 hours ago
Is this why my submission has been on hold for so long..
ck2 20 hours ago
Odd but related question if anyone knows:
Are all preprints on arXiv public?
Or is there actually a private unlisted preprint queue?
From behavior I've observed, I'm guessing maybe authors have the ability to hide papers and send private invites for select peer-review?
[-]
- prof-dr-ir 14 hours ago
  > I'm guessing maybe authors have the ability to hide papers
  No they do not.
  In fact, authors cannot even delete submitted papers after they have officially appeared in the (approximately) daily cycle. You can update your paper with a new version (which happens frequently) or mark it as withdrawn (which happens rarely). But in either case all the old versions remain available.
- evanb 20 hours ago
  Papers can be held my moderators, but as an author when you post you don't have the ability to hold or hide your paper.
  Postings go up once a day as a batch, so you could wait 24ish hours to see your paper appear, longer if your posting is just before the weekend.
kkfx 23 hours ago
A small note the defunct ZeroNet prove to be performant enough for a distributed web, why not start to really go distributed for things where the network overhead is not so heavy? A distributed arXiv sustained by many individual peers spread all over the world is possible.
Funes- 23 hours ago
Great, more self-reliance instead of playing into the hands of giga corporat... ah, shit.
ArinaS 14 hours ago
And this is a yet another instance of Google monopolizing the web.
cruzcampo 1 day ago
More capitalist capture of the commons - another day in america.
nsibr 1 day ago
[dead]
throwaway81523 1 day ago
Crap, there goes privacy.
[-]
- einpoklum 1 day ago
  Can you elaborate regarding the privacy concern? arXiv content is public, isn't it?
  [-]
  - tgv 1 day ago
    Possibly refering to google now tracking who visits which articles.
  - Tepix 1 day ago
    Google will likely see who accesses which papers
    [-]
    - dewey 1 day ago
      They could already just look at Google Search popularity if that would be so interesting.
      [-]
      - Tepix 18 hours ago
        Google already has large scale surveillance of large parts of the internet thanks to their tracking cookies that are used for google analytics. They can track individual users, not just get some statistics.
    - surajrmal 18 hours ago
      People seem to generally confuse the ability to do something with the actual doing of said thing.
paologiacometti 21 hours ago
sarei curioso di sapere se per caso google ha concesso uno sconto purché gli permettano di usare di sorgenti latex dei paper per addrestrare i loro modelli di intelligenza artificial
Brosper 1 day ago
They should definitely update UI of the website.
[-]
- jruohonen 1 day ago
  Why? I'd argue most users of that site like the skeleton ASCII-style UI, which is also fast (at least when compared to the today's JS-riddled sites).
  [-]
  - sightbroke 1 day ago
    I am curious too as to what needs changed as well.
    It appears functionality efficient, well organized, and fast as you observed.
- dark-star 1 day ago
  Why, what's wrong with the UI?
  [-]
  - worthless-trash 1 day ago
    Not react enough.
progbits 1 day ago
I'm sure many are going to hate on this, but a simple k8s setup (just container+deployment+ingress) is going to be good for preventing vendor lock in.
Too bad it's US only, devops is actually a good role for full remote setups. If you can't make it work asynchronously across timezones you are doing it wrong :)
[-]
- orochimaaru 23 hours ago
  Hiring and payroll can be a complete nightmare especially with publicly funded organizations and initiatives like arxiv.
  Also plenty of North Korean bad actors masquerading as US remote workers. I’m betting as European remote workers too. I like remote work but shit like that is why we can’t have good things.
  https://www.yahoo.com/news/thousands-north-korean-workers-in...
- dvrp 1 day ago
  You deal with emergencies with international on call rotation?
  [-]
  - tgsovlerkhgsel 1 day ago
    "Follow the sun" rotations are a common way to handle 24h oncall rotations. Three regular shifts (8 hours of work plus some lunch break that gives you time for a short overlap at the beginning/end of the shift) during that region's local daytime hours, in three different regions with time zones roughly 8 hours apart.
    That way nobody has to work nights and you still get 24/7 coverage.
    [-]
    - Spivak 1 day ago
      You have to get management to get over themselves and the stupid office policy to make it happen though. Synchronous collaboration isn't all it's cracked up to be.
  - progbits 1 day ago
    Not just emergencies but good timezone coverage is nice for oncall, nobody likes getting woken up by a page at 3am.
    But I meant things like having good and automated tests, deploys etc so everyone can be productive on their own without multiple people getting involved ("hey can you please deploy X for me when you are here?").
bustling-noose 1 day ago
My theory : Engineers rented servers and maintained them with software and packages and infrastructure scripts / code etc. Then this moved to cloud VMs cause it became easier for higher availability and sometimes cheaper too. Then VM costs started rising and cloud providers started offering tempting prices to use some of their services instead that integrated well with their VM infrastructure. Simultaneously engineers started to cost more money to maintain these systems and more people trained in 'cloud' became available for relatively cheap. So people moved their infrastructure to cloud offerings. Now both engineers and cloud services cost a lot of money. But now engineers who can maintain such infrastructure are far and few in between and also cost a lot of money while cloud offerings became turnkey literally.
So costs went from $80k an engineer a year maybe a decade ago with a few thousand to servers to $200k an engineer which you would struggle to find or $100k for a 'cloud engineer / architect' plus $100k to a cloud provider.
This sounds great in theory. Except that cloud providers are messy and once vendor locked in, you are in a big spiral. Secondly the costs can be hidden and climb exponentially if you don't know exactly what you are doing. You might also get into weird bugs that could be solved by a patch over a Monday to some package you could just update which might take months or never happen on a cloud provider. The reality of moving to cloud is not as rosy as it sounds.
Universities used to be the birth place of big projects that were created to solve problems they ran into hosting / running their own infrastructure. I hope that is still true.