Microsoft's computer vision model will generate alt text for Reddit images

Two years ago, Microsoft announced Florence, an AI strategy that it sounded arsenic a “complete rethinking” of modern machine imagination models. Unlike astir imagination models astatine nan time, Florence was some “unified” and “multimodal,” meaning it could (1) understand connection arsenic good arsenic images and (2) grip a scope of tasks alternatively than being constricted to circumstantial applications, for illustration generating captions.

Now, arsenic a portion of Microsoft’s broader, ongoing effort to commercialize its AI research, Florence is arriving arsenic a portion of an update to nan Vision APIs successful Azure Cognitive Services. The Florence-powered Microsoft Vision Services launches coming successful preview for existing Azure customers, pinch capabilities ranging from automatic captioning, inheritance removal and video summarization to image retrieval.

“Florence is trained connected billions of image-text pairs. As a result, it’s incredibly versatile,” John Montgomery, CVP of Azure AI, told TechCrunch successful an email interview. “Ask Florence to find a peculiar framework successful a video, and it tin do that; inquire it to show nan quality betwixt a Cosmic Crisp pome and a Honeycrisp apple, and it tin do that.”

The AI investigation community, which includes tech giants for illustration Microsoft, person progressively coalesced astir nan thought that multimodal models are nan champion way guardant to much tin AI systems. Naturally, multimodal models — models that, erstwhile again, understand aggregate modalities, specified arsenic connection and images aliases videos and audio — are capable to execute tasks successful 1 changeable that unimodal models simply cannot (e.g., captioning videos).

Why not drawstring respective “unimodal” models together to execute nan aforesaid end, for illustration a exemplary that understands only images and different that understands exclusively language? A fewer reasons, nan first being that multimodal models successful immoderate cases execute amended astatine nan aforesaid task than their unimodal counterpart acknowledgment to nan contextual accusation from nan further modalities. For example, an AI adjunct that understands images, pricing information and purchasing history is apt to connection better-personalized merchandise suggestions than 1 that only understands pricing data.

The 2nd logic is, multimodal models thin to beryllium much businesslike from a computational standpoint — starring to speedups successful processing and (presumably) costs reductions connected nan backend. Microsoft being nan profit-driven business that it is, that is, nary doubt, a plus.

So what astir Florence? Well, because it understands images, video and connection and nan relationships betwixt those modalities, it tin do things for illustration measurement nan similarity betwixt images and matter aliases conception objects successful a photograph and paste them onto different background.

I asked Montgomery which information Microsoft utilized to train Florence — a timely question, I though, successful ray of pending lawsuits that could determine whether AI systems trained connected copyrighted data, including images, are successful usurpation of nan authorities of intelligence spot holders. He wouldn’t springiness specifics, prevention that Florence uses “responsibly-obtained” information sources “including information from partners.” In addition, Montgomery said that Florence’s training information was scrubbed of perchance problematic contented — different all-too-common feature of nationalist training datasets.

“When utilizing ample foundational models, it is paramount to guarantee nan value of nan training dataset, to create nan instauration for nan adapted models for each Vision task,” Montgomery said. “Furthermore, nan adapted models for each Vision task has been tested for fairness, adversarial and challenging cases and instrumentality nan aforesaid contented moderation services we’ve been utilizing for Azure Open AI Service and DALL-E.”

We’ll person to return nan company’s connection for it. Some customers are, it seems. Montgomery says that Reddit will usage nan caller Florence-powered APIs to make captions for images connected its platform, creating “alt text” truthful users pinch imagination challenges tin amended travel on successful threads.

“Florence’s expertise to dress up to 10,000 tags per image will springiness Reddit overmuch much power complete really galore objects successful a image they tin place and thief make overmuch amended captions,” Montgomery said. “Reddit will besides usage nan captioning to thief each users amended article ranking for searching for posts.”

Microsoft is besides utilizing Florence crossed a swath of its ain platforms, products and services.

On LinkedIn, arsenic connected Reddit, Florence-powered services will make captions to edit and support alt matter image descriptions. In Microsoft Teams, Florence is driving video segmentation capabilities. PowerPoint, Outlook and Word are leveraging Florence’s image captioning abilities for automatic alt matter generation. And Designer and OneDrive, courtesy of Florence, person gained amended image tagging, image hunt and inheritance generation.

Montgomery sees Florence being utilized by customers for overmuch much down nan line, for illustration detecting defects successful manufacturing and enabling self-checkout successful unit stores. None of those usage cases require a multimodal imagination model, I’d note. But Montgomery asserts that multimodality adds thing valuable to nan equation.

“Florence is simply a complete re-thinking of imagination models,” Montgomery said. “Once there’s easy and high-quality translator betwixt images and text, a world of possibilities opens up. Customers will beryllium capable to acquisition importantly improved image search, to train image and imagination models and different exemplary types for illustration connection and reside into wholly caller types of applications and to easy amended nan value of their ain customized versions.”

